
Recently, I have been working with processing of large data sets on the daily basis. I decided to use Hadoop Map-Reduce and wrote mapper and reducer scripts to process the data.
The whole process included launching EMR cluster, installing requirements on all nodes, uploading files to Hadoop’s HDFS, running the job and finally terminating the cluster (Because AWS EMR Cluster is expensive).
To eliminate the manual effort I wrote an AWS Lambda function to do this whole process automatically. I wrote the function in Python and used APIs from Boto3 library.
The lambda function consists of 3 parts:
1. Instance Configuration
In this part of the AWS Lambda function you can set a configuration for your EMR Cluster like what type of instance your cluster will have, what will be the count of master and core nodes etc.
Instances={ 'InstanceGroups': [ {'Name': 'master', 'InstanceRole': 'MASTER', 'InstanceType': 'm3.xlarge', 'InstanceCount': 1, }, {'Name': 'core', 'InstanceRole': 'CORE', 'InstanceType': 'm3.xlarge', 'InstanceCount': 2, }, ], 'Ec2KeyName': KEY_PAIR # This allows us to ssh with the keypair }
2. Boostrapping the nodes
Here you can specify S3 path to a shell script which will install all the requirements and dependencies in all the nodes of master and core while the EMR Cluster is setting up.
Note: BootstrapActions is a list, so can you add multiple scripts here if needed.
BootstrapActions=[ { 'Name': 'Install packages', 'ScriptBootstrapAction': { 'Path': INITIALIZATION_SCRIPT_PATH } } ]
3. Running the Steps
In this part of the Lambda function you can add one or more steps. Each step can have a processing job like Hadoop Map Reduce or PySpark etc.
In each step you can specify the name of the job, what happens if the job fails because of any reason and the command to run the job. I have added a step for a Hadoop Map Reduce job below.
Note: Steps here is also a list, so you can add multiple if needed.
Steps=[ {'Name': 'Name of the Step', 'ActionOnFailure': 'TERMINATE_CLUSTER', 'HadoopJarStep': { 'Jar': 'command-runner.jar', 'Args': [ 'hadoop-streaming', '-files', '{},{},{}'.format(MAPPER_PATH, REDUCER_PATH, INPUT_PATH), '-mapper', MAPPER_FILE, '-input', INPUT_PATH, '-output', OUTPUT_PATH, '-reducer', REDUCER_FILE ]} } ]
All is left now to add trigger for this AWS Lambda Function. You may need to handle how you are getting your INPUT_FILE as per your trigger.
Complete Lambda Function
You can get the complete AWS Lambda function from my github repository.
Subscribe and Share
Don’t forget to subscribe share this blog if the article helped you.
Also Read: Trigger AWS Lambda Asynchronously from API Gateway
Also Read: Coding a Tinder Bot in Python