Skip to main content

Real-Time Data Analysis: Building a Lambda Architecture with AWS DynamoDB and Lambda

· 6 min read
Samuel Joset

If you're here, it's probably because you have an interest in big data. This field offers many applications. One of them, real-time data analysis, holds a significant place in the current concerns of startups.

In this article, we're going to define what a Lambda architecture is and explain how to implement it. For the sake of simplicity, we will exclusively use AWS tools. Nevertheless, the concepts we're going to discuss can be applied with any other tool.

Are you ready? Let's go!

1. Understanding Lambda Architecture

Let's take a moment to understand what a Lambda architecture is:

Lambda architecture is a data processing structure designed to handle immense amounts of data. It is devised to meet the challenges posed by large-scale data management systems.

The specificity of this architecture relies on three essential parts: batch processing *(Batch Layer), real-time processing (Speed Layer) and and results providing (Serving Layer)

The Batch Layer is the layer where all incoming data are indexed. It handles data processing and generates views from it. These views are pre-computed and stored. The advantage of pre-computing these views is that it makes read operations extremely fast.

The Speed Layer operates parallel to the batch processing layer. Its primary role is to bridge the gap between the most recent data received and the latest pre-computed view from the Batch Layer. In other words, it takes charge of the data that have not yet been processed by the Batch Layer and calculates them in real time.

The Serving Layer combines the results of the Batch and Speed layers to provide a unified view of the data. It ensures that the most recent data are always available for analysis, even if they have not yet been processed by the Batch Layer. Thus, when a request is made, the result is always obtained from the most recent data. This is real-time data analysis at its finest!

2. AWS DynamoDB, Lambda, and Event Bridge to the Rescue

To implement our Lambda architecture, we will mainly use three AWS services: AWS DynamoDB, AWS Lambda, and AWS EventBridge.

AWS DynamoDB will be used for storage. Its scaling capacity and performance allow us to store and retrieve any amount of data, regardless of the traffic intensity. For this reason, DynamoDB will serve both the Batch Layer and the Speed Layer.

We will also use AWS Lambda to run our code. This service has the advantage of executing the code only when necessary, and automatically scaling resources based on demand. In addition, AWS Lambda allows code to be executed in response to an event, through a CRON or even an HTTP request.

Finally, AWS EventBridge will allow us to receive data asynchronously and trigger functions from specific events. It is via EventBridge events that we will trigger an AWS Lambda function that will execute the code responsible for storing data in DynamoDB.

3. Building the Batch Layer with AWS DynamoDB

The construction of the Batch layer involves processing and storing a significant volume of data in batches. Several points need to be addressed in this part:

  • When will the batches be created?
  • What will the format of the batches be?
  • How will the data enabling batch creation be stored?

3.1. Retrieving Data from DynamoDB

The first step in our process is to retrieve the data from DynamoDB. We will use the AWS SDK to perform this task.

const AWS = require('aws-sdk');
const documentClient = new AWS.DynamoDB.DocumentClient();

const fetchRecords = async (tableName) => {
const params = {
TableName: tableName,
};
const data = await documentClient.scan(params).promise();
return data.Items;
}

3.2. Batch Processing of Data

After retrieving the data, the next step is to process them in batches. Batch processing is not just a matter of grouping data into packets. It can involve several operations depending on the needs, including aggregation, enrichment, or data cleaning.

  • Data aggregation: commonly used to summarize data. For example, sales data can be aggregated by region, by day, etc. We can also calculate new values through calculations of sums, averages, minimums, maximums, etc.

  • Data enrichment: We may also want to add additional information to our existing data from other sources. This can, for example, include adding demographic information to sales data, adding geographic information to location data, etc.

  • Data cleaning: This step should be systematic with each batch generation. It involves several operations, such as the removal of null values, the correction of data entry errors, the standardization of data formats, etc.

In our example, we will group the data into batches in the simplest possible way, without performing aggregation or enrichment. However, keep in mind that the logic of batch processing can and should vary according to your needs.

const createBatches = (data, batchSize) => {
const batches = [];
for (let i = 0; i < data.length; i += batchSize) {
batches.push(data.slice(i, i + batchSize));
}
return batches;
}

3.3. Storing the Processed Data

Once our data has been grouped into batches, we need to store them again in DynamoDB.

const storeBatchedData = async (tableName, data) => {
for (const batch of data) {
const params = {
TableName: tableName,
Item: batch,
};

try {
await documentClient.put(params).promise();
} catch (err) {
console.error(err);
}
}
}

3.4. Configuring the Lambda Trigger with AWS EventBridge

Next, in serverless.yml, we will configure the processBatchData function to be triggered at regular intervals.

functions:
processBatchData:
handler: handler.processBatchData
events:
- schedule: rate(24 hours)

In this example, our processBatchData function is triggered once a day thanks to the schedule event.

With these different steps, we have everything necessary for the operation of our Batch layer.

4. Configuring the Speed Layer with AWS Lambda

We are going to set up an AWS Lambda function that will be triggered each time a new event is emitted by AWS EventBridge. This function will be tasked with receiving real-time data and storing it for later use.

First, let's add a new function to our serverless.yml file:

functions:
RealTimeDataReceiver:
handler: handler.realTimeDataReceiver
events:
- eventBridge:
pattern:
source:
- "my.app"
detail-type:
- "DataEvent"

Next, we need to create the corresponding realTimeDataReceiver function in our handler.js file:

module.exports.realTimeDataReceiver = async event => {
const rawData = event.detail;

// Storing the received data in DynamoDB
await dynamoDb.put({
TableName: process.env.DYNAMODB_TABLE,
Item: rawData,
}).promise();

return {
statusCode: 200,
body: JSON.stringify(rawData),
};
};

In this example, the realTimeDataReceiver function simply retrieves the event's data and stores it in DynamoDB for future use. It's important to note that the data processing does not occur in this function. Indeed, the Speed Layer in our setup is solely responsible for the receipt and storage of real-time data.

5. Creation of the Service Layer

Now, let's tackle the third and final component of the Lambda architecture. Its sole role is to respond to client requests by providing a consolidated view of the results of the calculations carried out by the Batch and Speed layers. It's this layer that allows end users to access the processed data and the derived information generated by the architecture.

To do this, we will use another AWS Lambda function to create this service layer. This function will be triggered by an HTTP request and will return the corresponding results from the database.

We will modify the serverless.yml file to add the new function:

functions:
serveData:
handler: handler.serveData
events:
- http:
path: /data
method: get
cors: true

Here, we have configured our new serveData function to be triggered when a GET request is sent to the "/data" URL.

Here's a pseudo-implementation of this function in the handler.js file:

const AWS = require('aws-sdk');
const dynamoDb = new AWS.DynamoDB.DocumentClient();

module.exports.serveData = async event => {
const params = {
TableName: process.env.DYNAMODB_TABLE,
// Add your filtering parameters here, such as the user ID or date, according to the needs of your application
};

const data = await dynamoDb.scan(params).promise();

return {
statusCode: 200,
body: JSON.stringify(data.Items),
};
};

In this example, we have created a Lambda function that simply scans our DynamoDB table for all data and returns the results. Once again, this is a very basic implementation and you will need to customize the fetch options according to your needs. But the fundamental idea remains to provide a convenient interface for retrieving the data processed by the Lambda architecture.

6. Points of Attention

Implementing a Lambda architecture with AWS may seem straightforward on the surface, but it presents specific challenges that require particular attention:

  • Database Throttling: With AWS DynamoDB, it is crucial to understand and manage the provisioned read and write capacity. An incorrect allocation can lead to throttling, which can slow down your database operations and affect the performance of your system. Remember that DynamoDB allows you to dynamically adjust these capacities according to your needs.

  • Lambda Execution Time Limits: AWS Lambda has a maximum execution time of 15 minutes. For tasks that require a longer processing time, you may need to reorganize the logic of your application so that it can be executed in several parts, or consider other options, such as transferring these tasks to an EC2 instance.

  • Service Layer Response Time: The service layer aims to provide a unified and updated view of the data. However, the complexity of merging data from the Batch layer and the Speed layer can lead to delays. Make sure to test and optimize the response time of this layer.

  • Monitoring: A large volume of processing also results in a large volume of logs. Ensure you set up filters and alarms to not miss errors in the logs.

7. Conclusion

The Lambda architecture isn't a universal solution for all use cases. However, when it comes to managing a large volume of incoming data and providing real-time analysis results, it often emerges as the most suitable option.