TOYOTA and AWS Case Study on Collision Assistant Application

4 min readAug 31, 2021

In this blog, we will talk about the best cloud computing provider AWS. How Toyota built, refined, and deployed the Collision Assistance product with Serverless on AWS services.

Also how they faced issues and get overcome by the best performance from the initial architecture which was not fully refined to how they resolved the issues and created the final product architecture application.

Building a scalable, affordable, secure, and high performing product using AWS

Toyota uses the serverless architecture from AWS to achieve scalable, affordable, secure performance.

Scalability and affordability

In initial architecture, they use Amazon Simple Queue Service (Amazon SQS) queues, Amazon Kinesis Streams, and AWS Lambda functions allow data pipelines to run servers only when they’re needed, which introduces cost savings.

They also process data in smaller units and run them in parallel, which allows data pipelines to scale up efficiently to handle peak traffic loads. These services allow for an architecture that can handle non-uniform traffic without needing additional application logic.

Security

Collision Assistance can deliver information to customers via push notifications. This data must be encrypted because many data points the application collects are sensitive, like geolocation.

To secure this data outside our private network, we use Amazon Simple Notification Service (Amazon SNS) as our delivery mechanism. Amazon SNS provides HTTPS endpoint delivery of messages coming to topics and subscriptions.

Performance

To quantify our product’s performance, we review the “notification delay.” This metric evaluates the time between the initial collision and when the customer receives a push notification from Collision Assistance. Our ultimate goal is to have the push notification sent within minutes of a crash, so drivers have this information in near real-time.

Initial architecture

The Kinesis stream receives vehicle data from an upstream ingestion service.
A Lambda function writes lookup data to Amazon DynamoDB for every Kinesis record.
This Lambda function decreases obvious non-crash data. It sends the current record (X) to Amazon SQS. If X exceeds a certain threshold, it will remain a crash candidate.
Amazon SQS sets a delivery delay so that there will be more Kinesis/DynamoDB records available when X is processed later in the pipeline.
A second Lambda function reads the data from the SQS message. It queries DynamoDB to find the Kinesis lookup data for the message before (X-1) and after (X+1) the crash candidate.
Kinesis GetRecords retrieves X-1 and X+1 because X+1 will exist after the SQS delivery delay times out.
The X-1, X, and X+1 messages are sent to the data science (DS) engine.
When a crash is accurately predicted, these results are stored in a DynamoDB table.
The push notification is sent to the vehicle owner.

To reduce false positives, we gather data before and after the timestamps where the extremely low thresholds are exceeded. We then evaluate the sensor data across this timespan and discard any sets with patterns of abnormal sensor readings or other false positive conditions. Figure 2 shows the time window we initially used.

Adjusting our initial architecture for better performance

Toyota's initial design worked well for processing a few sample messages and achieved the desired near real-time delivery of the push notification.

However, when the pipeline was enabled for over 1 million vehicles, certain limits were exceeded, particularly for Kinesis and Lambda integrations:

Our Kinesis GetRecords API exceeded the allowed five requests per shard per second. With each crash candidate retrieving an X-1 and X+1 message, we could only evaluate two per shard per second, which isn’t cost-effective.
Additionally, the downstream SQS-reading Lambda function was limited to 10 records per second per invocation. This meant any slowdown that occurs downstream, such as during DS engine processing, could cause the queue to back up significantly.

To improve cost and performance for the Kinesis-related functionality, we abandoned the DynamoDB lookup table and the GetRecord calls in favor of using a Redis cache cluster on Amazon ElastiCache.

This allows us to avoid all throughput exceptions from Kinesis and focus on scaling the stream based on the incoming throughput alone. The ElastiCache cluster scales capacity by adding or removing shards, which improves performance and cost-efficiency.

To solve the Amazon SQS/Lambda integration issue, we funneled messages directly to an additional Kinesis stream. This allows the final Lambda function to use some of the better scaling options provided to Kinesis-Lambda event source integrations, like larger batch sizes and max-parallelism.

Conclusion

Finally, the managed services and serverless components available on AWS provided Toyota with many options to test and refine our team’s architecture. This helped them find the best fit for our use case. Having this flexibility in design was a key factor in designing and delivering the best architecture for our product.

and get document crash damage, file an insurance claim, and get updates on the actual repair process.