Using an AWS VPC Endpoint for Access to Data in S3 from Spark on VMware Cloud on AWS

AWS S3 (Simple Storage Service) is one of the most popular services in use today among AWS users. It provides a reliable, global and inexpensive storage option for large quantities of data.

It is useful for users on VMware Cloud on AWS to be able to access data sources on AWS S3 in a controlled manner. We show this access being used within a Spark application context here for big data usage, but the same principles would apply to any S3-consuming programs.

We decided to use a “VPC Endpoint for S3”, a construct available within AWS, to allow secure access to the S3 data and to avoid going over the internet gateway, and traffic does not leave the Amazon network.

Note that a VPC Endpoint allows this type of access within the same region as that in which the SDDC exists only, not across AWS regions, at the time of writing. S3 data can be made visible across regions of course, but that is not being discussed here.

VPC Endpoints are explained in the AWS documents, one of which is: https://aws.amazon.com/blogs/aws/new-vpc-endpoint-for-amazon-s3/
In this article, we describe the steps needed to set up this VPC Endpoint for secure access to your data in S3 from VMware Cloud on AWS.

A S3 bucket policy is used also to allow only users who have access to the VPC Endpoint to read data in a non-public bucket.

You will need a login user ID on AWS that allows you to create an S3 bucket and place objects into the bucket. Go the S3 service once logged into the AWS Management Console http://aws.amazon.com and you will see the option to create a bucket there.

You can upload a file from your desktop computer, for example, as one object in a bucket to use for testing. As the creator of an S3 bucket, you can decide whether this bucket should be publicly accessible or not. You do this using the “Permissions” tab when you are looking at a bucket in the AWS console. For now, we will work with a bucket which is not publicly accessible. As the owner of an object within a bucket, you should be able to view the contents interactively or from an application using the correct S3 access libraries.

You can either view S3 buckets through the AWS Management Console or else download the AWS command line tool and use it to list the contents of a bucket as seen here:

“$ aws s3 ls

Note: AWS S3 bucket names are required to be globally unique and all lower case. There are also some additional rules for naming a bucket that are described here: http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html

When using VMware Cloud on AWS, you are working within a connected AWS Virtual Private Cloud or VPC. This VPC is associated with your deployed SDDC within VMware Cloud on AWS. Your system administrator will have established that link at the time that the SDDC was created.

You will need permissions to create a VPC Endpoint for that VPC within the AWS console also. If you already have access to AWS using a suitable IAM user ID then the following would not be required. You can get access to the right to create a VPC Endpoint by asking your VMware Cloud on AWS system administrator, who created your SDDC, to allow you to have a suitable Identity and Access Management (IAM) role with permissions within AWS.

Once logged in to the AWS console using your IAM user name, choose the AWS Services menu item and you will see all services in their categories as shown in Figure 1.


Figure 1: The Services screen within the AWS Console.

Scroll down to the lower part of the screen where the section entitled “Networking and Content Delivery” is located (Figure 2 below).


Figure 2: Networking and Content Delivery within the AWS Services screen

Click the VPC entry from that section. This will take you to the VPC Dashboard screen as shown in Figure 3 below.


Figure 3: The VPC Dashboard Screen within the AWS Console

Click on the VPC link in the middle of the screen to see some details of your connected VPC. You can also see the particular VPC that is in use by your SDDC in the VMware Cloud on AWS console if you wish to, as seen in Figure 4 (lower right corner). Take note of the VPC identity as shown in the bottom right of Figure 4 – you will need this later.


Figure 4: The VMware Cloud on AWS Console showing the Amazon VPC that is connected to an SDDC.

Going back now to the AWS console, click on the VPC entry in the list of items under the sentence “You are using the following Amazon VPC resources…” from Figure 3. There, we see the details of our connected VPC in the AWS console view as shown in Figure 5.


Figure 5: The AWS VPC Dashboard showing details of a VPC.

Click the Endpoints link in the left navigation. In Figure 6, we see that there is already a VPC Endpoint, that was created earlier. Highlight the checkbox to the left of the Endpoint ID to see more details.


Figure 6: A VPC Endpoint seen in the AWS console

If you are in the process of creating a new VPC Endpoint, just click on the “Create Endpoint” at the top of this screen. This brings up the choice of services that will be available on the Endpoint, shown in Figure 7. Choose the S3-related entry here as that is the service we want.


Figure 7: Choosing the S3 service for a new VPC Endpoint in the AWS Console

You will be taken to the bottom of the page once you have chosen the S3 service. The VPC that we noted from VMware Cloud on AWS Console as being our connected VPC is chosen from the Dropbox labelled “VPC”.

The Route table that will be used is shown as in Figure 8 below. This will be set to the default value for the connected subnet. The console normally shows all route tables associated with that particular VPC. In our case, for virtual machine traffic, we need to select the default route table. The default route table can be found by looking at the route tables and selecting that one with the “default” tag.

Highlight the checkbox to the left of the Route table ID entry.


Figure 8: The VPC and routing table are displayed for a new VPC Endpoint.

Scroll down in that page to see the “Create Endpoint” button and click on it. This causes the VPC Endpoint for S3 to be created. You should see the result as shown in Figure 9.


Figure 9: A new VPC Endpoint has been created

Click the “Close” button and you will see a summary of the Endpoints that exist including your new one, seen in Figure 10. By highlighting your new Endpoint using the clickbox on the left you can make changes to the new Endpoint. Highlight your new VPC Endpoint using the clickbox to the left of its ID and de-select other non-pertinent VPC Endpoints.


Figure 10: Summary of existing VPC Endpoints in the VPC Dashboard

Scroll down the left navigation to get to the Security section and click Security Groups within that section (see the left navigation entry for this in Figure 11 below).


Figure 11: Security Groups for a VPC on AWS

Chose the security group for your VPC by clicking the checkbox. Click the Inbound tab for that security group. Click Edit as seen in Figure 12


Figure 12: Security Group – Inbound Rules tab

Click Add Another Rule as seen at the bottom of Figure 13


Figure 13: Choosing Type of HTTPS (443) for Inbound Rule

In the Type dropdown menu, as seen bottom left in Figure 13, choose HTTPS (443)

In the Source box, enter the CIDR block for the logical network that the virtual machines in your SDDC are attached to. Repeat these steps of choosing the Type and the CIDR block Source for each logical network that you want to connect to (the simplest case is one).

Click Save. This process has established your VPC Endpoint for S3 within AWS.

We now need to return to the VMware Cloud on AWS console and create a Compute Gateway firewall rule to allow HTTPS access to the connected AWS VPC.

Login to the VMware Cloud on AWS Console at https://vmc.vmware.com

Navigate to your SDDC within your Org and click the Network tab. Scroll down to view the Compute Gateway.


Figure 14: Details of the Compute Gateway for an SDDC as seen in the VMware Cloud on AWS Console

Click on the Firewall Rules line within the Compute Gateway segment shown in Figure 14 so as to expand that line. Details of the firewall rules appear as seen in Figure 15.


Figure 15: Using “Add Rule” to add a new Compute Gateway firewall rule in the VMware Cloud on AWS console

Click Add Rule on the bottom of the screen shown in Figure 15 and add a new rule with the parameters as seen in the Rule Name “HTTPS” entry above, substituting the CIDR block in the Source field with your own one. Your result should be similar to that shown above on the Rule Name = HTTPS item in Figure 15.

Now the virtual machines in your SDDC can access objects in the S3 buckets in their local region without going through the Internet Gateway, but by using the S3 Endpoint instead.

Using a Bucket Policy to Restrict Access

We want to keep our S3 bucket secure by making it non-public but yet accessible to users that have access to the VPC Endpoint for S3 that we just created. The virtual machines that have their IP address within the CIDR block range that we gave in Figures 13 and 14 will have access to the VPC Endpoint for S3. Now we set up a bucket policy within the AWS console to allow only users of the VPC Endpoint to access our bucket.

While logged into the AWS Console, go to your S3 bucket and click on the Permissions tab. Click on the Bucket Policy entry within that screen as shown in Figure 16 below. You will now construct a bucket policy that looks similar to the one show in Figure 16. The policy states that only users who are accessing the bucket via the VPC Endpoint for S3, which is named in the “SourceVpce” entry may access the bucket named in the “Resource” field, with access shown in the “Action” field.


Figure 16: The Bucket Policy Editor within the AWS Console showing a policy for S3 access via the VPC Endpoint.

Once the policy has been accepted by the Bucket Policy editor as a valid one, click Save to store it and have it take effect.

Testing the VPC Endpoint for S3

To check that your VPC Endpoint for S3 is working correctly, find the URL of your target bucket in the AWS console and use the hostname there as the target of a traceroute command on one of your virtual machines in your SDDC. Because the virtual machine and the VPC EndPoint for S3 are within the same region, you should see output that is similar to that shown in Figure 17.


Figure 17: Traceroute showing the optimal route for access to S3 within the AWS Region

If you perform the same traceroute command to another host that is not in the region containing your VPC Endpoint for S3, you will see a different type of output, as shown in Figure 18. Note that here we are using “west-1” as a component of a different host name in a different Amazon region.


Figure 18: Traceroute showing the routing of a request to S3 via the Internet Gateway

Setting up the Spark environment on VMware Cloud on AWS

We will assume here that you have at least two virtual machines set up to contain the Apache Spark software, one being the Spark Driver and one Spark worker. The setup process for this is described in a separate VMware blog article: Apache Spark deployment.

We show an example of a virtualized Spark cluster below in Figure 21 with one Spark Driver and 12 Spark Worker virtual machines deployed on vSphere. Your cluster does not have to be as large as this to execute the simple test described below.

We will submit a job to the Spark Driver virtual machine that will read data from an object in our identified S3 buckets. The command we will use on logging in to the Spark Driver virtual machine to submit that job is as shown in Figure 19 below:


Figure 19: The Spark Submit command used to run a test of the connection to S3

The particular S3 object being read is identified with the “s3a://”prefix above.

The Spark code that is executed as part of the ReadTest shown in Figure 20 is a simple read of a text file of 100MB in size into memory and counts the number of lines in it.


Figure 20: Scala source for the ReadTest program for S3 access in Spark

Figure 21 below shows a set of virtual machines that together make up a standalone Spark cluster, executing on the VMware Cloud on AWS. They are running within a Resource Pool, a VMware construct that isolates them from other virtual machines that are using the same SDDC or collection of servers running vSphere.


Figure 21: The VMware vCenter user interface showing a collection of Spark virtual machines running on VMware Cloud on AWS

In order to use the correct libraries at execution time, the configuration file $SPARK_HOME/spark-defaults.conf on your Spark Driver virtual machine should contain references to the two jar files shown below. These jar files contain classes that are used for access to the AWS Java SDK and Hadoop libraries respectively. An example of the contents of this file is shown in Figure 22 below.


Figure 22: The spark-defaults.conf file configuration, part of a Spark cluster setup

As a best practice, it is recommended to not store the access AWS key and secret key for your user ID in a file. Using an IAM role is preferred instead. This will be described in a future blog article.

There should be correct entries in the $SPARK_HOME/spark-defaults.conf file for the access and secret keys shown in the last two lines above in the form given below:

spark.hadoop.fs.s3a.access.key

spark.hadoop.fs.s3a.secret.key

The access key and secret key are obtained by logging in to the AWS Management Console and then going to the “My Security Credentials” under your login user name. Then proceed to “Continue to Security Credentials”. By clicking on the Access Keys item below you can have the capability to generate your keys and capture them for use in secure access to your S3 data.


Figure 23: Getting your Access Keys in the AWS Management Console

Once this is done, you may now use your Spark cluster on VMware Cloud on AWS to access data in a secure fashion from virtual machines on VMware Cloud on AWS.

Conclusion

Users of big data systems based on the Spark distributed platform will want access to large quantities of data that are stored in S3 buckets. The Spark platform may be run on VMware Cloud on AWS within a number of virtual machines making up a Spark cluster. We have shown here that using an AWS VPC Endpoint for S3 gives secure access to S3 buckets for data from those virtual machines and from Spark application consumers of that data within them. The access is shown for a non-public S3 bucket. Through the use of a bucket policy, only users that are in the VPC with the named Endpoint for S3 are allowed to access the bucket.