Women Who Code - Empowering women to excel in tech careers

Apache Spark is an open sourced cluster centric computing framework that was originally developed at the University of California, Berkeley's AMPLab. The Apache Spark codebase was later donated to the Apache Software Foundation. Apache Spark provides users with an interface that can be used for programming whole clusters while maintaining implicit data parallelism and fault tolerance.

In this article, we explain why you should use Apache Spark and demonstrate how to setup a real multi-node cluster. This cluster will be hosted on AWS EC2 instances and we’re done, you can start playing around with Spark and process data.

First Things First – Why Apache Spark?

At its core, Apache Spark needs a cluster manager and a distributed storage system. In order to enable cluster management, Spark provides support to standalone native Spark clusters, Hadoop YARN, and Apache Mesos.

As far as distributed storage goes, Spark has the capability to interface with Hadoop Distributed File System (HDFS), MapR File System (MapR-FS), Cassandra, OpenStack Swift, Amazon S3, Kudu, etc. There are also third party solutions that are built on top of existing cloud infrastructure and used for improved storage capabilities such as Nimble predictive storage or PureStorage, or for stronger backup capabilities, such as N2WS. These solutions offer more control over your storage containers and how you use and protect them.

Spark also supports a pseudo-distributed local mode that can be used for development or testing purposes where distributed storage is not needed and the local file storage system can be used to get to the same end result.

Apache Spark is the latest entrant on the Big Data suite of services. Apache Spark allows users to execute Big Data processing workloads that do not fit the regular Mapreduce paradigm. It is able to accomplish by re-using major components of the Apache Hadoop Framework. It also provides support for a variety of patterns that are similar to the Java 8 Streams in functionality, while allowing users to run these jobs on a cluster.

Imagine a scenario where you have a data processing job working with Java 8 Streams. However, you need more horsepower and memory and a single machine can normally provide. In such an instance, Apache Spark is the best solution.

Set Up Your AWS Account

The process of setting up an AWS EC2 instance is quite simple and straightforward. Linux is supported as a development and deployment platform. Try the latest LTE release of Ubuntu which is 18.04 at the time of writing this tutorial. Windows is also a supported development platform. You can also choose a third-party AMI if you’re setting up for deployment. Intel’s BigDL with Apache Spark is recommended for deployment.

However, if your aim is to experiment and play around with Spark, a standard EC2 instance with Ubuntu AMI should do. In this tutorial, we’re going to use an open-source script known as FlintRock that’s the quickest possible way to set up flintrock with AWS.

FlintRock is a python based command-line tool and it’s in active development. The rest of this tutorial is going to focus on setting up clusters using FlintRock. You will need to have an AWS account and some basic knowledge of setting up EC2 instances in AWS to get started.

Install Flintrock

The latest version of Flintrock is 0.9.*. Make sure that you have python & pip installed on your PC and then install the library.

$ pip3 install flintrock

If you don’t have python, you can also try running their standalone installer.

Create a new user in AWS

Flintrock needs an IAM user with programmatic access to EC2 so that it can create, modify and delete nodes on demand. For the sake of simplicity, we’re going to give the new user administrative access. But make sure that you streamline this while performing the steps for production.

To create a new user:

Head to AWS Console and select IAM from the drop down menu.

Go to Users tab and select Add new user.

1. From the Add user window, fill in the username. Let’s name it spark.
2. For the access type, select Programmatic access as well as AWS Console Management Access.
3. Select Next: Permissions
From the next screen, select Add user to a group.

1. You will see an option to create a new group.
2. Add a group name and select the appropriate permission for the user role.
3. Press the Create Group button.
Add the user to the newly created group. You will have a chance to review the settings. Click Next.

IAM will create a new user and return an Access Key Id, Secret Access Key and a Password. Make sure that you download them and keep them secure because these credentials won’t be available later. Set them as environment variable so that Flintrock can use them later:

$ export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxxx
$ export AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

Next up, you will need to generate a key pair to access the EC2 instance from your terminal. The steps are straightforward and AWS documentation has a good guide on how to do that.

Configure Flintrock

To configure Flintrock, you can use the below command.

flintrock configure

This will generate a config.yml file that will be opened in your default text-editor. This is what it looks like:

services:
spark:
version: 2.2.0

hdfs:
version: 2.7.5

provider: ec2

providers:
ec2:
key-name: spark_cluster
identity-file: /path/to/spark_cluster.pem
instance-type: m4.large
region: us-east-1
ami: ami-97785bed # Amazon Linux, us-east-1
user: ec2-user
tenancy: default # default | dedicated
ebs-optimized: no # yes | no
instance-initiated-shutdown-behavior: terminate # terminate | stop

launch:
num-slaves: 1
install-hdfs: True
install-spark: True

debug: false

Replace the key-name and identity-file with the details of the key pair that you generated in the previous section. The num-slaves property determines the number of slaves that you would want to create.

Start a Multi-Node Spark Cluster

To start a spark cluster, run the following command. Replace multi-node-cluster-demo with the name of your cluster.

flintrock launch multi-node-cluster-demo

It might take a couple of minutes for flintrock to finish the setup. You can log in to your cluster by running the following command.

$ flintrock login my-spark-cluster

You should see something like this:

__| __|_ )
_| ( / Amazon Linux AMI
___|___|___|

https://aws.amazon.com/amazon-linux-ami/2017.09-release-notes/
[ec2-user@ip]$

Running Jobs Using Pyspark/Spark-shell

Once the above steps have been completed, we can now try to run a job using the Python shell provided by Apache Spark. To begin, starting the shell, requires the Spark Cluster URL–

cd ~/server
./spark-2.3.0-bin-hadoop2.7/bin/pyspark –master spark://ip-170-24-10-53.us-east-l.compute.internal:7077

Following a brief pause, allowing for startup, you should see the pyspark prompt ‘>>>’.

Alternatively, you can use spark-shell to run your spark jobs.

$ spark-shell

Conclusion

The above article has covered the basics of how to set up Apache Spark on an AWS EC2 instance. The article then demonstrated how to run a master and slaves on an EC2 instance using Flintrock to make things easier.

How to Set Up a Multi-Node Spark Cluster on AWS