0

AWS Lambda Function To Launch EMR with Hadoop Map-Reduce Python

AWS Lambda Function

Recently, I have been working with processing of large data sets on the daily basis. I decided to use Hadoop Map-Reduce and wrote mapper and reducer scripts to process the data.

The whole process included launching EMR cluster, installing requirements on all nodes, uploading files to Hadoop’s HDFS, running the job and finally terminating the cluster (Because AWS EMR Cluster is expensive).

To eliminate the manual effort I wrote an AWS Lambda function to do this whole process automatically. I wrote the function in Python and used APIs from Boto3 library.

The lambda function consists of 3 parts:

1. Instance Configuration

In this part of the AWS Lambda function you can set a configuration for your EMR Cluster like what type of instance your cluster will have, what will be the count of master and core nodes etc.

Instances={
    'InstanceGroups': [
        {'Name': 'master',
         'InstanceRole': 'MASTER',
         'InstanceType': 'm3.xlarge',
         'InstanceCount': 1,
         },
        {'Name': 'core',
         'InstanceRole': 'CORE',
         'InstanceType': 'm3.xlarge',
         'InstanceCount': 2,
         },

    ],
    'Ec2KeyName': KEY_PAIR  # This allows us to ssh with the keypair
}

2. Boostrapping the nodes

Here you can specify S3 path to a shell script which will install all the requirements and dependencies in all the nodes of master and core while the EMR Cluster is setting up.

Note: BootstrapActions is a list, so can you add multiple scripts here if needed.

BootstrapActions=[
    {
        'Name': 'Install packages',
        'ScriptBootstrapAction': {
            'Path': INITIALIZATION_SCRIPT_PATH
        }
    }
]

3. Running the Steps

In this part of the Lambda function you can add one or more steps. Each step can have a processing job like Hadoop Map Reduce or PySpark etc.

In each step you can specify the name of the job, what happens if the job fails because of any reason and the command to run the job. I have added a step for a Hadoop Map Reduce job below.

Note: Steps here is also a list, so you can add multiple if needed.

Steps=[
    {'Name': 'Name of the Step',
     'ActionOnFailure': 'TERMINATE_CLUSTER',
     'HadoopJarStep': {
         'Jar': 'command-runner.jar',
         'Args': [
             'hadoop-streaming',
             '-files',
             '{},{},{}'.format(MAPPER_PATH, REDUCER_PATH, INPUT_PATH),
             '-mapper', MAPPER_FILE,
             '-input', INPUT_PATH,
             '-output', OUTPUT_PATH,
             '-reducer', REDUCER_FILE
         ]}
     }
]

All is left now to add trigger for this AWS Lambda Function. You may need to handle how you are getting your INPUT_FILE as per your trigger.

Complete Lambda Function

You can get the complete AWS Lambda function from my github repository.

Also Read: Trigger AWS Lambda Asynchronously from API Gateway

Abhishek Balani

A Full Stack Developer, sometimes designer, passionate coder, tireless knowledge seeker, curious learner.

I have a strong passion for new technologies, very autodidact and love to build new things from the ground up. Having 5+ years of dynamic experience accumulated from working in early stage startups to mid-sized organizations in Agile environment.

Skilled in Python and related frameworks, React.js, Databases, Hadoop, Elastic Search and various AWS Services like Boto3, API Gateway, Lamda, EC2, EMR, CloudWatch.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.