AWS Lambda Function

Recently, I have been working with processing of large data sets on the daily basis. I decided to use Hadoop Map-Reduce and wrote mapper and reducer scripts to process the data.

The whole process included launching EMR cluster, installing requirements on all nodes, uploading files to Hadoop’s HDFS, running the job and finally terminating the cluster (Because AWS EMR Cluster is expensive).

To eliminate the manual effort I wrote an AWS Lambda function to do this whole process automatically. I wrote the function in Python and used APIs from Boto3 library.

The lambda function consists of 3 parts:

1. Instance Configuration

In this part of the AWS Lambda function you can set a configuration for your EMR Cluster like what type of instance your cluster will have, what will be the count of master and core nodes etc.

Instances={
    'InstanceGroups': [
        {'Name': 'master',
         'InstanceRole': 'MASTER',
         'InstanceType': 'm3.xlarge',
         'InstanceCount': 1,
         },
        {'Name': 'core',
         'InstanceRole': 'CORE',
         'InstanceType': 'm3.xlarge',
         'InstanceCount': 2,
         },

    ],
    'Ec2KeyName': KEY_PAIR  # This allows us to ssh with the keypair
}

2. Boostrapping the nodes

Here you can specify S3 path to a shell script which will install all the requirements and dependencies in all the nodes of master and core while the EMR Cluster is setting up.

Note: BootstrapActions is a list, so can you add multiple scripts here if needed.

BootstrapActions=[
    {
        'Name': 'Install packages',
        'ScriptBootstrapAction': {
            'Path': INITIALIZATION_SCRIPT_PATH
        }
    }
]

3. Running the Steps

In this part of the Lambda function you can add one or more steps. Each step can have a processing job like Hadoop Map Reduce or PySpark etc.

In each step you can specify the name of the job, what happens if the job fails because of any reason and the command to run the job. I have added a step for a Hadoop Map Reduce job below.

Note: Steps here is also a list, so you can add multiple if needed.

Steps=[
    {'Name': 'Name of the Step',
     'ActionOnFailure': 'TERMINATE_CLUSTER',
     'HadoopJarStep': {
         'Jar': 'command-runner.jar',
         'Args': [
             'hadoop-streaming',
             '-files',
             '{},{},{}'.format(MAPPER_PATH, REDUCER_PATH, INPUT_PATH),
             '-mapper', MAPPER_FILE,
             '-input', INPUT_PATH,
             '-output', OUTPUT_PATH,
             '-reducer', REDUCER_FILE
         ]}
     }
]

All is left now to add trigger for this AWS Lambda Function. You may need to handle how you are getting your INPUT_FILE as per your trigger.

Complete Lambda Function

You can get the complete AWS Lambda function from my github repository.


Subscribe and Share

Don’t forget to subscribe share this blog if the article helped you.

Also Read: Trigger AWS Lambda Asynchronously from API Gateway

Also Read: Coding a Tinder Bot in Python

Share:

administrator

I am a full stack software engineer and a blogger. Primarily, I work on Python, Java and React.js. I am an autodidact with a strong passion for new technologies. I love to build new things from the ground up. I have about 7 years of dynamic experience gained by working in early stage startups to mid-sized organizations in an Agile environment.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.