Running Apache Spark for Big Data on VMware Cloud on AWS – Part 1

This article gives an overview of the deployment steps that were used in a series of tests done by VMware staff for the Spark and Cloudera CDH distributed application platforms for big data on the VMware Cloud on AWS. This is part 1 of a set of articles on this subject.

Part 1 describes an out-of-the-box deployment of these platforms. No attention is given to tuning the environment for best performance – the validations done here were purely functionality tests. In a separate exercise, a Standalone Spark cluster was deployed and a set of performance tests were executed on it. Both exercises showed that the functionality and performance of Spark on VMware Cloud on AWS and on on-premise hardware were very similar.

In the functional testing sequence being described here, the VMware Cloud on AWS unit of deployment – a single Software Designed Datacenter (SDDC), was used with the standard minimum of four ESXi server hosts in it. The SDDC was created in the VMware Cloud on AWS portal ahead of time by the Cloud Administrator user. This is shown in the vCenter Client with the vCenter Server instance for that SDDC in Figure 1.

The set of host servers within the SDDC could be expanded to contain more servers at a later point, according to workload demands.

Figure 1 – Top-level vCenter view within VMware Cloud on AWS

Two separate sets of virtual machines were constructed,

  • one set making up a Standalone Spark cluster (i.e. not managed by YARN) and
  • a second set of VMs containing both a Cloudera Manager instance and a Cloudera CDH cluster of nodes (including CDH-supplied Spark services).
  • These two clusters operated in parallel within one SDDC as shown in figure 2. The functionality testing described here was executed on the two environments. The Cloudera cluster’s virtual machines are shown with that prefix for their name, whereas the Standalone Spark cluster virtual machines are named “Spark“.

    Figure 2 – The virtual machines for the separate Cloudera and Standalone Spark clusters shown in vCenter

    Background on Spark

    Spark is an open source distributed application platform for running big data applications. It was originally developed by a research team led by Matei Zafaria in the UC Berkeley AMPlab starting in 2009. The idea was to produce a radical improvement on Hadoop technology, both for performance and simplicity of API usage, as it stood then. Spark was placed into the open source community in 2013 and it continues to be used and developed as a major Apache open source project.

    Spark has now been widely adopted throughout the big data industry. The major Hadoop vendors all support it on their platforms. A huge amount of technical innovation work is being done on it, as well as many production deployments.

    An outline of the Spark architecture is shown in Figure 3. The principal components are the Driver program (sometimes call the “Master”) and the Worker Nodes. The “Master” virtual machine contains a Driver JVM process principally. The Worker machines contain mainly JVM processes called Executors that take parts of the overall application or job from the Driver and execute code on their parts. The framework makes use of the partitioning of large quantities of data into Resilient Distributed Datasets (RDDs) across machines. RDDs are in-memory datasets that are split across the various Executors. For more background information on Spark and its ecosystem, look at http://spark.apache.org

    The Driver program in our setup ran in a virtual machine of its own and each Worker virtual machine contained one or more Executors (usually more than one, depending on the memory and CPU capacity of the virtual machine it resided in).

    Figure 3 – The Spark Architecture

    You can run Spark in several different ways:

    1. Spark Standalone form, without any cluster manager apart from Spark’s own in-built scheduler. This is the simplest form of Spark deployment. It is ideal for single-tenant, single-purpose clusters for development and testing groups.

    2. Spark on YARN form, that is controlled by the YARN scheduler that is built into contemporary Hadoop distributions such as Cloudera CDH, Hortonworks HDP or MapR. Within this more multi-tenant capable form of running Spark, you can configure it in “Client” mode or in “Cluster” Mode. “Client” mode means that the Driver program runs in a Cloudera Gateway node. “Cluster” mode means that the Driver runs under the control of a YARN Application Master process on a NodeManager. We used “Client” mode in our work here.

    3. In hosted Spark form, such as that promoted by the Databricks company (formed from the Berkely AMPlab technical team members), or in Altus, a hosted big data service from Cloudera.

    The particular advantages and disadvantages of using each type of deployment are beyond the scope of this current work. Here, we tried out the first two types of deployment on VMware Cloud on AWS, as examples of what VMware’s customers would want to do.

    In the Spark Standalone form, the cluster manager’s functionality was built into the Spark scheduler itself. In the YARN implementation of Spark (i.e. on Cloudera CDH) the cluster manager functionality was done using YARN’s cluster and resource management algorithms. The Spark subsystem comes as an optional part of the Cloudera CDH distribution and is installed using Cloudera Manager in the normal way.

    Cluster Setup

    A number of traditional workloads were executed on the Cloudera cluster initially. These were MapReduce style tests such as TeraGen, TeraSort and TeraValidate. This was a simple health check on that CDH cluster. The tests ran to completion in a satisfactory time.

    In order to run some more contemporary workloads through the two clusters, a toolkit called Spark-Perf was used as a simple way of setting up Logistic Regression and other machine learning algorithms for testing. This toolkit was built by Databricks engineers for this purpose and was freely downloadable from Github https://github.com/databricks/spark-perf.

    1. Standalone Spark Cluster Setup
    Because the Standalone Spark installation and setup were straightforward, involving just a small number of Jar and Zip files and the appropriate JVM and test environment variable setup, then more of the setup details are given here. The full YARN-based Spark cluster setup was done using the Cloudera Manager tool in the same way that it would be used for other environments, so some details are omitted.

    One Spark-capable virtual machine was created and configured first. Then the Driver for Spark and many more workers were easily created through cloning the original virtual machine in vCenter from the first one, with subsequent renaming of the virtual machine’s hostname and resetting its IP address. The Driver virtual machine was configured with

  • 8 virtual CPUs,
  • 22 GB of memory and
  • a 20 GB virtual disk.
  • As an aside here on a useful technical tip, it was convenient to store a template for this virtual machine, once fully configured, on S3 storage and import it into vCenter for creating Spark clusters similar to this one in different SDDCs, if required. This applied particularly to the Worker virtual machines.

    As the initial steps, the following items were downloaded and installed into a Linux (CentOS 7.3) virtual machine:

    1. The Spark tar file from http://spark.apache.org/downloads.html (Spark versions 1.6.3 and 2.1.1 were tested separately).

    2. The spark-perf-master.zip – a Databricks performance toolkit, downloaded from https://github.com/databricks/spark-perf (This can also be done using “git clone” on that repository)

    3. The Java Standard Edition from http://www.oracle.com/technetwork/java/javase/downloads/index.html

    4. The Scala compiler from http://www.scala-lang.org This component was needed only for the Driver program in that virtual machine, not for the Worker virtual machines.

    The extract command used for the Spark tar file was “tar xvf spark1.6.3-bin-hadoop2.6.tar”. The environment variable SPARK_HOME was set to the directory where the Spark tar file was extracted to.

    The Scala compiler was extracted to its own directory also using the appropriate command for the downloaded file (tar xzvf scala-2.10.6.tgz). This Scala step was needed for the Driver VM only. Some of the Scala sources are recompiled on each test run of the Spark Perf toolkit, described below.

    The Test Runtime Toolkit: Spark Perf

    The Spark-Perf toolkit was extracted to a separate directory (in our case /root/spark-perf-master). There was a “config” directory under the main one that contains configuration files with a “.py” extension. An appropriate file is required to be constructed from the “config.py.template” file supplied with a set of adjustments made to its contents, such as:

    1.SPARK_HOME_DIR is required to be set to the extract directory for the Spark tar file above.

    2. The SPARK_CLUSTER_URL variable should be set to “spark://%s:7077” % socket.gethostname()

    3. The variables RUN_MLLIB_TESTS and PREP_MLLIB_TESTS were set to True

    4. The SPARK_DRIVER_MEMORY was set to “20g” Hence a virtual machine with a larger amount of memory is required for this Driver virtual machine

    5. The variable MLLIB_SPARK_VERSION was set to 1.6 for our initial tests and changed later to 2.1

    6. The “num_partitions”, “num_examples”, “num_features” were set to appropriate values for the size of the Spark cluster (num_partitions was initially set to twice the number of executors, for example)

    7. The option setting “spark_executor_instances” was set to the total number of Worker virtual machines times 3 or 6 depending on how large the memory consumed by each executor was. We did not over-commit the memory allocated to the virtual machine with too many executor instances.

    8. The option setting “spark.executor.cores” was set to 2. This enabled us to have 3 executor instances in a 6 VCPU virtual machine, for example.

    9. The option setting “spark.executor.memory” was set to “4g”

    10. The “optimizer” setting is preset to “sgd” and the “loss” option is preset to “logistic” indicating that Stochastic Gradient Descent and Logistic Regression models would be used on the data set. These are Machine Learning algorithms that are built into Spark’s MLlib library. You can get more information on these on the Spark MLlib site.

    All of the above settings influence the runtime behavior of the Spark system and these would be tuned in a performance study exercise. This set of tuning steps was not done in this round of initial VMware Cloud on AWS functionality-only testing.

    The command used to run each test was:
    (Located within the Spark-perf-master directory)

    $ bin/run –config-file config/config_lr_1M_10K.py

    This command ran to conclusion, setting all processes that make up the Spark cluster running across the virtual machines, executing a Logistic Regression workload through a Model Training phase and a Testing phase, timing those test parts and bringing down all of the Spark JVM processes at the end of the run. The main spark–submit step executed by the above script appeared as shown in Figure 4. This was the “spark-submit” command line seen executing in the Driver virtual machine.

    Figure 4 – The command line executed as part of a test run of a Spark cluster with a machine learning test

    For a dataset used in Figure 4, containing 1 million examples with 10,000 features, the training time achieved was 23.01 seconds, using 6 Worker and 1 Driver virtual machines. This compares well with on-premise vSphere tests that were done elsewhere.

    Once the Spark processes on the Driver and Worker virtual machines were up and running, then a web browser was directed to a URL composed of the Driver program’s IP address followed by colon and port 8080.

    This brought up the Spark console that is shown in Figure 5 below. We used the “application” link on the main screen in order to see the number of executors here.

    Figure 5 – The Spark Console showing the running Executors

    The progress of all jobs and executors within Spark was monitored using this Spark console. The details of use of this console are outside our scope here.

    2. Spark-on-YARN Cluster Setup with Cloudera CDH
    The Cloudera CDH cluster creation was performed using the Cloudera Manager tool, which was firstly installed in a virtual machine of its own. The Cloudera Manager was installed using a .bin file from Cloudera’s archives

    $ wget https://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin

    Once that cloudera-manager-installer.bin file was executed, the Cloudera Manager installation proceeded on the virtual machine just as it would have on a physical machine, asking for the same inputs. On completion of that installation process, a browser window was pointed to the URL http://:7180 to give access to the Cloudera Manager console as shown in figure 6.

    Figure 6

    The process of creation of the first CDH cluster was started here by logging in as admin/admin. A straightforward CDH cluster of 1 Master node and 5 Worker nodes was constructed in the normal way in the Cloudera Manager tool – this did not differ from the method that would be used on physical systems or on on-premise vSphere. Details of the process for deciding which Hadoop processes would run on which Hadoop node can be found in the Cloudera Manager documentation: http://doc.cloudera.com

    Once the Cloudera Manager tool had been guided through the necessary steps, then the CDH cluster was created successfully, as shown in Figure 7

    Figure 7 – CDH cluster shown successfully created in the Cloudera Manager tool

    The Spark-Perf toolkit was also installed on the Cloudera Master virtual machine in a similar manner to the method of installation in the Standalone Spark case above. One adjustment was made for Spark on CDH 5.11. The version of the class JOptSimple in the file $SPARK-PERF-HOME/project/MLlibTestsBuild.scala was moved from version 4.6 to version 4.9 for compatibility reasons. Since the Spark-Perf toolkit recompiles any changed Scala sources on each run, then this came into effect on the subsequent test run.

    The Testing

    A set of MapReduce tests were initially run on the Cloudera CDH cluster to show that it was in a healthy state. TeraGen, TeraSort and TeraValidate were executed on the command line to completion. Here is the log produced by TeraSort to show its progress.

    Figure 8 – Output produced by a TeraSort job running on VMware Cloud on AWS

    Following that set of health checks on the CDH cluster, we ran the Spark-perf jobs that exercised the Machine Learning algorithms that are built into the MLLib subsystem that comes with Spark. These workloads also ran to completion successfully, both on the Cloudera CDH cluster and on the Spark Standalone cluster. The output from the Spark Console during one of these test runs is shown below. The Generalized Linear Algorithm was used to execute Logistic Regression against a significant training dataset size (1TB and above). These tests also completed satisfactorily.

    Figure 9 – Initial Part of a Spark job running

    Figure 10 – Spark console showing progress through a Logistic Regression at near completion time.

    Related and Future Work

    Experiments were conducted with alternate forms of storage to HDFS, such as GlusterFS, in a Spark cluster setup also in VMware Cloud on AWS and proven to function correctly on virtual machines on VMware Cloud on AWS. This will be further documented in a follow-on blog article later.

    Conclusion

    The VMware Cloud on AWS is proven to be a very viable platform for big data workloads. Both traditional MapReduce and more contemporary Spark/Machine Learning algorithms were tested under load on it and proven to work successfully there. The best practices derived from earlier performance work apply equally to this environment. No new tools were required at the big data platform level in order to complete this test work – the Cloudera CDH cluster and standalone Spark clusters were installed and configured in the same way as they would be in other environments. More performance analysis work will be done on this environment for big data in the future.

    This article gives an overview of the deployment steps for the Spark and Cloudera CDH distributed application platforms for big data on the VMware Cloud on AWS. This was an out-of-the-box deployment of these platforms and was not tuned for best performance – the validations done here were purely functionality tests. This exercise showed that the functionality and management of Spark on virtual machines looks very familiar whether you are deploying on VMware Cloud on AWS or on vSphere running on on-premises hardware. Early performance tests showed distinct benefits in using the powerful machines underlying the VMware Cloud on AWS as a complement to your on-premises deployment.

    This demo of Spark running a Machine Learning workload on VMware Cloud on AWS shows the Standalone Spark workload that we saw previously running on on-premises vSphere – now running on a set of 12 Worker virtual machines and 1 spark driver virtual machine on the VMware Cloud on AWS platform. A Spark cluster is started up on those 13 virtual machines from the command line and a logistic regression workload with 1 million examples and 10,000 features is executed on it and shown in the Spark console. This data is generated during the run. The entire test takes less than 3 minutes to complete and the resulting Model Training Time is shown to be 15.88 seconds. This would be something that a data scientist/developer would be interested in testing.

    Learn more about how to build the hybrid solution you need, the way you need it with VMware Cloud on AWS.