AWS Lambda Function To Launch EMR with Hadoop Map-Reduce Python

Abhishek Balani
. August 16, 2019
0
730 Views
Shares

Recently, I have been working with processing of large data sets on the daily basis. I decided to use Hadoop Map-Reduce and wrote mapper and reducer scripts to process the data.

The whole process included launching EMR cluster, installing requirements on all nodes, uploading files to Hadoop’s HDFS, running the job and finally terminating the cluster (Because AWS EMR Cluster is expensive).

To eliminate the manual effort I wrote an AWS Lambda function to do this whole process automatically. I wrote the function in Python and used APIs from Boto3 library.

The lambda function consists of 3 parts:

1. Instance Configuration

In this part of the AWS Lambda function you can set a configuration for your EMR Cluster like what type of instance your cluster will have, what will be the count of master and core nodes etc.

Instances={
    'InstanceGroups': [
        {'Name': 'master',
         'InstanceRole': 'MASTER',
         'InstanceType': 'm3.xlarge',
         'InstanceCount': 1,
         },
        {'Name': 'core',
         'InstanceRole': 'CORE',
         'InstanceType': 'm3.xlarge',
         'InstanceCount': 2,
         },

    ],
    'Ec2KeyName': KEY_PAIR  # This allows us to ssh with the keypair
}

2. Boostrapping the nodes

Here you can specify S3 path to a shell script which will install all the requirements and dependencies in all the nodes of master and core while the EMR Cluster is setting up.

Note: BootstrapActions is a list, so can you add multiple scripts here if needed.

BootstrapActions=[
    {
        'Name': 'Install packages',
        'ScriptBootstrapAction': {
            'Path': INITIALIZATION_SCRIPT_PATH
        }
    }
]

3. Running the Steps

In this part of the Lambda function you can add one or more steps. Each step can have a processing job like Hadoop Map Reduce or PySpark etc.

In each step you can specify the name of the job, what happens if the job fails because of any reason and the command to run the job. I have added a step for a Hadoop Map Reduce job below.

Note: Steps here is also a list, so you can add multiple if needed.

Steps=[
    {'Name': 'Name of the Step',
     'ActionOnFailure': 'TERMINATE_CLUSTER',
     'HadoopJarStep': {
         'Jar': 'command-runner.jar',
         'Args': [
             'hadoop-streaming',
             '-files',
             '{},{},{}'.format(MAPPER_PATH, REDUCER_PATH, INPUT_PATH),
             '-mapper', MAPPER_FILE,
             '-input', INPUT_PATH,
             '-output', OUTPUT_PATH,
             '-reducer', REDUCER_FILE
         ]}
     }
]

All is left now to add trigger for this AWS Lambda Function. You may need to handle how you are getting your INPUT_FILE as per your trigger.

Complete Lambda Function

You can get the complete AWS Lambda function from my github repository.

Subscribe and Share
Don’t forget to subscribe share this blog if the article helped you.

Also Read: Trigger AWS Lambda Asynchronously from API Gateway

Also Read: Coding a Tinder Bot in Python

AWS Lambda Function To Launch EMR with Hadoop Map-Reduce Python

1. Instance Configuration

2. Boostrapping the nodes

3. Running the Steps

Complete Lambda Function

Abhishek Balani

Leave a Reply Cancel reply

Intro to Golang: A Powerful and Efficient Programming Language

Spring Boot ShedLock for Crons in Distributed Environment

Green Soul Monster Ultimate Chair Review

Spring Boot Boilerplate Application Generator

AWS Lambda Function To Launch EMR with Hadoop Map-Reduce Python

1. Instance Configuration

2. Boostrapping the nodes

3. Running the Steps

Complete Lambda Function

Abhishek Balani

Leave a Reply Cancel reply

How To Monitor Your Backlinks with Linkody

Trigger AWS Lambda Asynchronously from API Gateway

Related Posts

Intro to Golang: A Powerful and Efficient Programming Language

Spring Boot ShedLock for Crons in Distributed Environment

Green Soul Monster Ultimate Chair Review

Spring Boot Boilerplate Application Generator