Using S3 Data with Cloudera on VMware Cloud on AWS

In a previous blog article, Using an AWS VPC Endpoint for Access to Data in S3 from Spark on VMware Cloud on AWS, we described a method for using an AWS VPC Endpoint for access to data in S3 from virtual machines running standalone Spark on VMware Cloud on AWS.

In this article, we build on that idea and show how access to S3 data may be done in the same way from a Spark program hosted on the Cloudera Distribution including Hadoop (CDH) platform, which itself is running on virtual machines on VMware Cloud on AWS.

In a separate earlier article, Running Apache Spark for Big Data on VMware Cloud on AWS – Part 1, we described the installation of Cloudera’s CDH with Spark software onto virtual machines on VMware Cloud on AWS.

Cloudera has published a Reference Architecture for CDH on AWS (independently of VMware) which mentions both S3 and Elastic Block Storage (EBS) from AWS as potential storage options for data being used in CDH. This led us to investigating whether the S3 storage mechanism could be used from CDH while running on VMware Cloud on AWS.

AWS S3 (Simple Storage Service) is one of the most popular services in use today among AWS users. It provides a reliable, global and inexpensive storage option for large quantities of data.

Cloudera CDH clusters that are hosted on VMware Cloud on AWS can use the traditional HDFS file system within their virtual machines’ guest operating systems. That data would be held in the VMDK files making up the various Worker (datanode) virtual machines in the CDH cluster. This direct-attached-storage and VMDK approach has been used for on-premises CDH on VMware vSphere installations for several years (refer here). It continues to play a role in the design that reads the “big data” from S3, as other parts of the needed data (such as the operating system and swap disks for the virtual machine) will be held in VMDK files on local storage.

There is, however, a growing interest in using the S3 object storage system as either a complement to, or even in certain cases a replacement for, HDFS in its role as a repository for big data. The latter point is particularly applicable to long term storage of data that may not be in “hot” use. Cloudera publishes in their AWS Reference Architecture the fact that Impala and Spark may use S3 data directly for interactive queries and analytics. Using Cloudera’s CDH with Spark functionality running on VMware Cloud on AWS, we now have easy access to that S3 data also. Spark is one of the subsystems that is controlled by the Hadoop or YARN scheduler, when residing in a CDH cluster.

As in the previous S3 work referred to above, we used a “VPC Endpoint for S3”, a construct available within AWS, to allow secure access to the S3 data and to avoid going over the internet gateway(s). This VPC Endpoint method ensures that the traffic between the virtual machine consuming the data and the S3 storage operates over the Amazon backbone network. That traffic does not traverse the internet gateway.

Note that a VPC Endpoint allows this type of access within the same region as that in which the SDDC exists only, not across AWS regions, at the time of writing. S3 data can be made visible across different regions, but that is not being discussed here.

VPC Endpoints are explained in the AWS documents, including this article, New – VPC Endpoint for Amazon S3, In this current article, we assume that you already:

1. Have Cloudera CDH installed and running in virtual machines on VMware Cloud on AWS. The installation process using Cloudera Manager is the same as it would be if you were installing on a set of physical hosts with IP addresses and hostnames. When asked by the Cloudera installation process for these details for the target “hosts” you simply supply the various VMs’ details instead. That is the only difference.

2. Followed the steps documented in this blog article to create a VPC Endpoint for secure access to your data in S3 from VMware Cloud on AWS.

When using VMware Cloud on AWS, you are working within a connected AWS Virtual Private Cloud or VPC. This VPC is associated with your deployed SDDC. An SDDC is a collection of host servers with the VMware software deployed on them within VMware Cloud on AWS. Your VMware Cloud on AWS system administrator will have established that VPC and the link to it at the time that the SDDC was first created.

The VPC Endpoint we refer to in this article is created on that linked VPC, by a user with the appropriate IAM access rights to do so. You will need to communicate with your VMware Cloud on AWS system administrator to allow you access to an IAM role with those VPC endpoint creation rights on the linked VPC.

You can either view your S3 buckets through the AWS Management Console or else download the AWS command line tool to your desktop machine and use it to list the contents of a bucket as seen here:

$ aws s3 ls

Note: AWS S3 bucket names are required to be globally unique and all lower case. There are also some additional rules for naming a bucket that are described here.

The Cloudera CDH environment on VMware Cloud on AWS

The creation of a CDH cluster is controlled by the Cloudera Manager tool. This process runs in a virtual machine (in our case, named “cmgr1”) that is itself installed first, before the cluster creation begins. You can download the Cloudera Manager installation binary from cloudera.com. User access to the Cloudera Manager GUI is done by pointing a browser window to http://:7182).

For this Cloudera and Spark work, a collection of Linux/CentOS 7 virtual machines was created using the VM clone operation from a master VM in the normal way in vCenter. You may use different source virtual machines for cloning the different types from if you want to.

A CDH cluster in then installed and configured onto these newly-provisioned virtual machines in exactly the same way as would be done with an on-premises virtualized or a physical cluster. Cloudera Manager does not treat a set of virtual machines differently from a set of physical hosts.


Figure 1: vCenter within VMware Cloud on AWS showing the layout of the virtual machines for a CDH cluster.

Cloudera Manager is supplied with the hostnames and IP addresses of the “hosts” or virtual machines that it is installing to. You may wish to configure the root and other device sizes on these virtual machines ahead of time, if needed. There are guidelines for these in the Cloudera reference architecture document for AWS.

Once the Cloudera CDH software installation is completed in Cloudera Manager, the virtual machines now contain the various processes or roles that are needed in the cluster (CDH Masters, Workers and Gateways).

Our example virtualized CDH cluster is composed of

  • two Master virtual machines (in our example, the VMs named cmast1, cmast2),
  • three Worker (or “datanode”) virtual machines (cw[1-3]) and
  • one Gateway virtual machine (cgw1)
  • This is a very minimally-sized cluster built to prove the concept of access to S3 – the topology of the CDH cluster may be much larger in your deployments.

    As an interesting point, in our testing, we run a set of virtual machines containing standalone Spark alongside the Cloudera CDH cluster on the same set of 4 host servers that were allocated to our SDDC. This is seen in the bottom left side of Figure 1 also.

    We need to ensure that the cluster has access to our S3 bucket and so we provide the S3 access key and secret key in the /etc/conf/spark/spark-defaults.conf file on the virtual machine in which we execute our “spark-submit” command. The file has the following contents in our test setup (We added only the last two lines with their appropriate values, omitted here).


    Figure 2: The contents of the file “/etc/spark-conf/spark-defaults.conf” on the virtual machine where the job is submitted (the spark-driver virtual machine).

    The keys shown in the final lines of the above file are created by using the AWS Console as the owner of the S3 bucket.

    The access key and secret key are obtained by logging in to the AWS Management Console and then going to the “My Security Credentials” under your login user name. Then proceed to “Continue to Security Credentials”. By clicking on the Access Keys item below you can have the capability to generate your keys and capture them for use in secure access to your S3 data.


    Figure 3: Getting your Access Keys in the AWS Management Console.

    As a best practice, it is recommended to not store the access AWS key and secret key for your user ID in a file. However, at the time of this writing, the alternate method, i.e. the use of an IAM role, is not supported.

    Certain configurations need to be done in order to have the correct Java and Spark environments available for the test. The JAVA_HOME and SPARK_HOME environment variables were set as shown in the example .bash_profile from the user’s home directory below.


    Figure 4 The .bash_profile file for the root user showing the environment variable settings needed.

    Once your CDH cluster is up and running, you can check within the Cloudera Manager tool to ensure that the Spark service is running on that cluster. If not, go ahead and start it. Wait for the Spark service to be successfully started.

    We can now submit a job to the Spark Driver virtual machine that will read data from an object in one of our S3 buckets. The Spark Driver VM in this case is our cgw1 virtual machine, but you could choose any virtual machine in the cluster to do this. The command we will use on logging in to the Spark Driver virtual machine to submit that job is as shown in Figure 5 below.

    The “–deploy-mode” option in the “spark-submit” command below refers to the location of the Spark Driver process, in our case it is in the “client” virtual machine. This is the virtual machine where the Spark job is first invoked. There is an alternate design that allows the Spark Driver to execute within an Application Master process that is created by the YARN ResourceManager in a separate virtual machine. That is referred to as “cluster” mode, but for simplicity we use “client” mode here. You would use “cluster” mode more in production systems, whereas here we are showing what a developer may use.


    Figure 5: The Spark Submit command used to run a test of the connection to S3.

    The job reads 100MB of data from an object in the named S3 buckets and counts the number of lines it finds in the file. The particular S3 object being read is identified with the “s3a://” prefix above.

    The Spark code that is executed as part of the ReadTest class shown in figure 6 is a simple Scala read of a text file of 100MB in size into memory and having read it, the program counts the number of lines in it (with the file.count() line of code). This is to ensure that a Spark “action” is taken to work around the lazy evaluation of RDDs in Spark.


    Figure 6: The Scala source for the ReadTest program for S3 access in Spark.

    Data may also be written back to the S3 bucket if required. For that purpose, the user would also need to have “hdfs” user permissions, as some local metadata is kept there. As we have shown here, you may now use your CDH-based Spark cluster on VMware Cloud on AWS to access S3 data.

    Conclusion

    Users of big data systems based on the Spark and Cloudera CDH distributed platforms want to use large quantities of data that are stored in AWS S3 buckets. The Spark platform may be run on VMware Cloud on AWS within a number of virtual machines that together make up a Cloudera CDH cluster. We have shown here we can access AWS S3 buckets for data from those CDH virtual machines and from Spark application consumers of that data within them. The popular AWS S3 storage service may now be used as an additional (or even a substitute) storage to traditional HDFS for Spark programs running on VMware Cloud on AWS.