Home Big Data Spark on AWS Lambda: An Apache Spark runtime for AWS Lambda

Spark on AWS Lambda: An Apache Spark runtime for AWS Lambda

Spark on AWS Lambda: An Apache Spark runtime for AWS Lambda


Spark on AWS Lambda (SoAL) is a framework that runs Apache Spark workloads on AWS Lambda. It’s designed for each batch and event-based workloads, dealing with knowledge payload sizes from 10 KB to 400 MB. This framework is good for batch analytics workloads from Amazon Easy Storage Service (Amazon S3) and event-based streaming from Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Kinesis. The framework seamlessly integrates knowledge with platforms like Apache Iceberg, Apache Delta Lake, Apache HUDI, Amazon Redshift, and Snowflake, providing a low-cost and scalable knowledge processing answer. SoAL supplies a framework that lets you run data-processing engines like Apache Spark and reap the benefits of the advantages of serverless structure, like auto scaling and compute for analytics workloads.

This submit highlights the SoAL structure, supplies infrastructure as code (IaC), presents step-by-step directions for establishing the SoAL framework in your AWS account, and descriptions SoAL architectural patterns for enterprises.

Resolution overview

Apache Spark presents cluster mode and native mode deployments, with the previous incurring latency because of the cluster initialization and warmup. Though Apache Spark’s cluster-based engines are generally used for knowledge processing, particularly with ACID frameworks, they exhibit excessive useful resource overhead and slower efficiency for payloads beneath 50 MB in comparison with the extra environment friendly Pandas framework for smaller datasets. When in comparison with Apache Spark cluster mode, native mode supplies quicker initialization and higher efficiency for small analytics workloads. The Apache Spark native mode on the SoAL framework is optimized for small analytics workloads, and cluster mode is optimized for bigger analytics workloads, making it a flexible framework for enterprises.

We offer an AWS Serverless Software Mannequin (AWS SAM) template, obtainable within the GitHub repo, to deploy the SoAL framework in an AWS account. The AWS SAM template builds the Docker picture, pushes it to the Amazon Elastic Container Registry (Amazon ECR) repository, after which creates the Lambda perform. The AWS SAM template expedites the setup and adoption of the SoAL framework for AWS prospects.

SoAL structure

The SoAL framework supplies native mode and containerized Apache Spark working on Lambda. Within the SoAL framework, Lambda runs in a Docker container with Apache Spark and AWS dependencies put in. On invocation, the SoAL framework’s Lambda handler fetches the PySpark script from an S3 folder and submits the Spark job on Lambda. The logs for the Spark jobs are recorded in Amazon CloudWatch.

For each streaming and batch duties, the Lambda occasion is distributed to the PySpark script as a named argument. Using a container-based picture cache together with the nice and cozy occasion options of Lambda, it was discovered that the general JVM warmup time decreased from approx. 70 seconds to beneath 30 seconds. It was noticed that the framework performs effectively with batch payloads as much as 400 MB and streaming knowledge from Amazon MSK and Kinesis. The per-session prices for any given analytics workload depends upon the variety of requests, the run period, and the reminiscence configured for the Lambda capabilities.

The next diagram illustrates the SoAL structure.

Enterprise structure

The PySpark script is developed in normal Spark and is appropriate with the SoAL framework, Amazon EMR, Amazon EMR Serverless, and AWS Glue. If wanted, you should utilize the PySpark scripts in cluster mode on Amazon EMR, EMR Serverless, and AWS Glue. For analytics workloads with a measurement between just a few KBs and 400 MB, you should utilize the SoAL framework on Lambda and in bigger analytics workload situations over 400 MB, and run the identical PySpark script on AWS cluster-based instruments like Amazon EMR, EMR Serverless, and AWS Glue. The extensible script and structure make SoAL a scalable framework for analytics workloads for enterprises. The next diagram illustrates this structure.


To implement this answer, you want an AWS Id and Entry Administration (IAM) position with permission to AWS CloudFormation, Amazon ECR, Lambda, and AWS CodeBuild.

Arrange the answer

To arrange the answer in an AWS account, full the next steps:

  1. Clone the GitHub repository to native storage and alter the listing throughout the cloned folder to the CloudFormation folder:
    git clone https://github.com/aws-samples/spark-on-aws-lambda.git

  2. Run the AWS SAM template sam-imagebuilder.yaml utilizing the next command with the stack title and framework of your selection. On this instance, the framework is Apache HUDI:
    sam deploy --template-file sam-imagebuilder.yaml --stack-name spark-on-lambda-image-builder --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM --resolve-s3 --parameter-overrides 'ParameterKey=FRAMEWORK,ParameterValue=HUDI'

The command deploys a CloudFormation stack referred to as spark-on-lambda-image-builder. The command runs a CodeBuild venture that builds and pushes the Docker picture with the most recent tag to Amazon ECR. The command has a parameter referred to as ParameterValue for every open-source framework (Apache Delta, Apache HUDI, and Apache Iceberg).

  1. After the stack has been efficiently deployed, copy the ECR repository URI (spark-on-lambda-image-builder) that’s displayed within the output of the stack.
  2. Run the AWS SAM Lambda package deal with the required Area and ECR repository:
    sam deploy --template-file sam-template.yaml --stack-name spark-on-lambda-stack --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM --resolve-s3 --image-repository <accountno>.dkr.ecr.us-east-1.amazonaws.com/sparkonlambda-spark-on-lambda-image-builder --parameter-overrides 'ParameterKey=ScriptBucket,ParameterValue=<Present the s3 bcucket of the script> ParameterKey=SparkScript,ParameterValue=<present s3 folder lcoation> ParameterKey=ImageUri,ParameterValue=<accountno>.dkr.ecr.us-east-1.amazonaws.com/sparkonlambda-spark-on-lambda-image-builder:newest'

This command creates the Lambda perform with the container picture from the ECR repository. An output file packaged-template.yaml is created within the native listing.

  1. Optionally, to publish the AWS SAM software to the AWS Serverless Software Repository, run the next command. This enables AWS SAM template sharing with the GUI interface utilizing AWS Serverless Software Repository and different builders to make use of fast deployments sooner or later.
    sam publish --template packaged-template.yaml

After you run this command, a Lambda perform is created utilizing the SoAL framework runtime.

  1. To check it, use PySpark scripts from the spark-scripts folder. Place the pattern script and accomodations.csv dataset in an S3 folder and supply the situation through the Lambda surroundings variables SCRIPT_BUCKET and SCRIPT_LOCATION.

After Lambda is invoked, it uploads the PySpark script from the S3 folder to a container native storage and runs it on the SoAL framework container utilizing SPARK-SUBMIT. The Lambda occasion can be handed to the PySpark script.

Clear up

Deploying an AWS SAM template incurs prices. Delete the Docker picture from Amazon ECR, delete the Lambda perform, and take away all of the recordsdata or scripts from the S3 location. You too can use the next command to delete the stack:

sam delete --stack-name spark-on-lambda-stack


The SoAL framework lets you run Apache Spark serverless duties on AWS Lambda effectively and cost-effectively. Past price financial savings, it ensures swift processing occasions for small to medium recordsdata. As a holistic enterprise imaginative and prescient, SoAL seamlessly bridges the hole between large and small knowledge processing, utilizing the facility of the Apache Spark runtime throughout each Lambda and different cluster-based AWS assets.

Comply with the steps on this submit to make use of the SoAL framework in your AWS account, and depart a remark you probably have any questions.

In regards to the authors

John Cherian is Senior Options Architect(SA) at Amazon Internet Companies helps prospects with technique and structure for constructing options on AWS.

Emerson Antony is Senior Cloud Architect at Amazon Internet Companies helps prospects with implementing AWS options.

Kiran Anand is Principal AWS Knowledge Lab Architect at Amazon Internet Companies helps prospects with Huge knowledge & Analytics structure.



Please enter your comment!
Please enter your name here