PyTorch is an open-source machine learning (ML) library widely used to develop neural networks and ML models. Those models are usually trained on multiple GPU instances to speed up training, resulting in expensive training time and model sizes up to a few gigabytes. After they’re trained, these models are deployed in production to produce inferences. They can be synchronous, asynchronous, or batch-based workloads. Those endpoints must be highly scalable and resilient in order to process from zero to millions of requests. This is where AWS Lambda can be a compelling compute service for scalable, cost-effective, and reliable synchronous and asynchronous ML inferencing. Lambda offers benefits such as automatic scaling, reduced operational overhead, and pay-per-inference billing.
This post shows you how to use any PyTorch model with Lambda for scalable inferences in production with up to 10 GB of memory. This allows us to use ML models in Lambda functions up to a few gigabytes. For the PyTorch example, we use the Huggingface Transformers, open-source library to build a question-answering endpoint.
Overview of solution
Lambda is a serverless compute service that lets you run code without provisioning or managing servers. Lambda automatically scales your application by running code in response to every event, allowing event-driven architectures and solutions. The code runs in parallel and processes each event individually, scaling with the size of the workload, from a few requests per day to hundreds of thousands of workloads. The following diagram illustrates the architecture of our solution.
You can package your code and dependencies as a container image using tools such as the Docker CLI. The maximum container size is 10 GB. After the model for inference is Dockerized, you can upload the image to Amazon Elastic Container Registry (Amazon ECR). You can then create the Lambda function from the container image stored in Amazon ECR.
For this walkthrough, you should have the following prerequisites:
Implementing the solution
We use a pre-trained language model (DistilBERT) from Huggingface. Huggingface provides a variety of pre-trained language models; the model we’re using is 250 MB large and can be used to build a question-answering endpoint.
We use the AWS SAM CLI to create the serverless endpoint with an Amazon API Gateway. The following diagram illustrates our architecture.
To implement the solution, complete the following steps:
- On your local machine, run
1for the template source (AWS Quick Start Templates)
- As a package type, enter
- For the base image, enter
3 - amazon/python3.8-base.
- As a project name, enter
- Change your
lambda-pytorch-exampleand copy the following code snippets into the
The following code is an example of a requirements.txt file to run PyTorch code in Lambda. Huggingface has as a dependency PyTorch so we don’t need to add it here separately. Add the requirements to the empty requirements.txt in the folder
The following is the code for the
The following Dockerfile is an example for Python 3.8, which downloads and uses the DistilBERT language model fine-tuned for the question-answering task. For more information, see DistilBERT base uncased distilled SQuAD. You can use your custom models by copying them to the model folder and referencing it in the
Change your working directory back to
lambda-pytorch-example and copy the following content into the
Now we need to create an Amazon ECR repository in AWS and register the local Docker to it. The
repositoryUri is displayed in the output; save it for later.
Deploying the application
The following steps deploy the application to your AWS account:
sam build && sam deploy –-guided.
- For Stack Name, enter
- Choose the same Region that you created the Amazon ECR repository in.
- Enter the image repository for the function (enter the earlier saved
repositoryUriof the Amazon ECR repository).
- For Confirm changes before deploy and Allow SAM CLI IAM role creation, keep the defaults.
- For pytorchEndpoint may not have authorization defined, Is this okay?, select y.
- Keep the defaults for the remaining prompts.
AWS SAM uploads the container images to the Amazon ECR repository and deploys the application. During this process, you see a change set along with the status of the deployment. For a more detailed description about AWS SAM and container images for Lambda, see Using container image support for AWS Lambda with AWS SAM.
When the deployment is complete, the stack output is displayed. Use the InferenceApi endpoint to test your deployed application. The endpoint URL is displayed as an output during the deployment of the stack.
Overcoming a Lambda function cold start
Because the plain language model is already around 250 MB, the initial function run can take up to 25 seconds and may even exceed the maximum API timeout of 29 seconds. That time can also be reached when the function wasn’t called for some time and therefore is in a cold start mode. When the Lambda function is in a hot state, one inference run takes about 150 milliseconds.
There are multiple ways to mitigate the runtime of Lambda functions in a cold state. Lambda supports provisioned concurrency to keep the functions initialized. Another way is to create an Amazon CloudWatch event that periodically calls the function to keep it warm.
Make sure to change <API_GATEWAY_URL> to the URL of your API Gateway endpoint. In the following example code, the text is copied from the Wikipedia page on cars. You can change the question and context as you like and check the model’s answers.
The response shows the correct answer to the question:
Container image support for Lambda allows you to customize your function even more, opening up many new use cases for serverless ML. You can bring your custom models and deploy them on Lambda using up to 10 GB for the container image size. For smaller models that don’t need much computing power, you can perform online training and inference purely in Lambda. When the model size increases, cold start issues become more and more important and need to be mitigated. There is also no restriction on the framework or language with container images; other ML frameworks such as TensorFlow, Apache MXNet, XGBoost, or Scikit-learn can be used as well!
If you do require GPU for your inference, you can consider using containers services such as Amazon Elastic Container Service (Amazon ECS), Kubernetes, or deploy the model to an Amazon SageMaker endpoint
About the Author
Jan Bauer is a Cloud Application Developer at AWS Professional Services. His interests are serverless computing, machine learning, and everything that involves cloud computing.