Getting Sagemaker Models to Production
About a year ago, Amazon publicly released Sagemaker, its platform for training and managing machine learning models. Sagemaker has an appealing value proposition–users can train models and tune hyperparameters using on-demand EC2 instances with a variety of hardware options, and models are stored for deployment to endpoints with IAM access control. On my team, we chose Sagemaker for managing our training and deployment of a text classification algorithm using a Tensorflow neural network. Here are some of the things I learned while creating an automated path to production.
Training and Tuning
The Sagemaker documentation and tutorials provide simplistic examples using iPython notebooks manually edited and run on AWS. While this is great as a launching point, the web iPython editor has some bad ergonomics we wanted to avoid–namely, a lack of reviewable changes tracked in git and a slow feedback loop.
The SageMaker Python SDK lets us script training and hyperparameter tuning into Python scripts. The scripts can be run locally, and the SDK can run the computationally expensive training or tuning on EC2 instances. It even provides a very high-level Estimator API for creating Tensorflow-based models.
Be aware that the default behavior of the SDK is to wait for each step in your
script to complete–steps that can take on the order of hours for training. You
are able to disable the waiting behavior, and you can design your scripts to be
resumed at a later point. We instead found the easiest approach to be to set up
a t2.micro
EC2 instance and work from there, backgrounding processes if we
wanted them to run for a long time.
Deployments
The simplified workflow we worked toward for our deployments is:
- Developers and data scientists run scripts to train/experiment with any model in our Dev AWS account.
- Once we are happy the performance of a new model, the model is deployed to QA for smoke testing.
- If the model passes, it is promoted to production.
Once a model is trained, tuned, and evaluated as step (1), we have a single QA
endpoint where we can run some simple stability tests. The SageMaker SDK does
provide the ability to deploy models to an HTTP endpoint with a simple
deploy()
call. Unfortunately, deploy()
does not support updating existing
endpoints, our target
use case.
It is possible to work around this restriction relatively easily by creating a
new SageMaker EndpointConfiguration
with a lower level API. This worked
initially for (2), our QA deployment, but ran into issues with our workflow step
(3).
Cross-account issues
Our production is isolated into its own AWS account, and the SageMaker SDK makes the implicit assumption in many areas that it is operating within a single AWS account.
When a SageMaker model is trained, a gzipped file of model artifacts is dumped
in an S3 bucket to be loaded when the model is deployed. In order to workaroudn
the cross-account restrictions, I created a script using boto3
to
- Move the artifacts to a bucket in the new account
- Create a model in the new account with matching configurations
- Create a new endpoint configuration with the new model
- Update the existing endpoint with the new endpoint configuration
Below is the model promotion script. It makes some assumptions about the CloudFormation stack setup in each environment, since we are using CloudFormation for our infrastructure management.
# promote_model.py
import boto3
import botocore
import click
import re
from datetime import datetime
@click.command()
@click.option('--model-name', help='The model that should be promoted')
@click.option(
'--source-profile',
default='myapp-qa',
help='Profile name in your AWS CLI configuration for the deployment source'
)
@click.option(
'--target-profile',
help='Profile name in your AWS CLI configuration for the deployment target'
)
@click.option(
'--stage',
type=click.Choice(['qa', 'production']),
help='The target stage')
@click.option(
'--target-stack-name', help='Target environment CloudFormation stack')
def promote_model(model_name, source_profile, target_profile, stage,
target_stack_name):
naming_prefix = 'myapp-{}'.format(stage)
target_stack_name = target_stack_name or naming_prefix
target_endpoint_name = naming_prefix
source_session = boto3.session.Session(profile_name=source_profile)
target_session = boto3.session.Session(profile_name=target_profile)
stack_outputs = get_stack_outputs(target_session, target_stack_name)
destination_bucket = stack_outputs['MyAppBucketName']
source_model = find_model(source_session.client('sagemaker'), model_name)
target_sagemaker = target_session.client('sagemaker')
source_container = source_model['PrimaryContainer']
source_dir_s3_path = source_container['Environment'][
'SAGEMAKER_SUBMIT_DIRECTORY']
source_dir_bucket = re.match('(^s3://.+?)/', source_dir_s3_path).group(1)
target_model_s3_path = copy_s3_object_to_bucket(
source_container['ModelDataUrl'], destination_bucket, target_session)
target_dir_s3_path = copy_s3_object_to_bucket(
source_dir_s3_path, destination_bucket, target_session)
model_created_timestamp = datetime.utcfromtimestamp(
source_model['CreationTime'].timestamp()).strftime('%Y-%m-%d-%H-%M')
new_model_name = '{}-model-{}'.format(naming_prefix,
model_created_timestamp)
endpoint_config_name = '{}-config-{}'.format(naming_prefix,
model_created_timestamp)
if not find_model(target_sagemaker, new_model_name):
target_sagemaker.create_model(
ModelName=new_model_name,
ExecutionRoleArn=stack_outputs['MyAppSagemakerRoleArn'],
PrimaryContainer={
**source_container, 'Environment': {
**source_container['Environment'], 'SAGEMAKER_SUBMIT_DIRECTORY':
target_dir_s3_path
},
'ModelDataUrl': target_model_s3_path
})
if not find_endpoint_config(target_sagemaker, endpoint_config_name):
target_sagemaker.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[{
'VariantName': 'AllTraffic',
'ModelName': new_model_name,
'InitialInstanceCount': 1,
'InstanceType': 'ml.t2.medium'
}])
if find_endpoint(target_sagemaker, target_endpoint_name):
target_sagemaker.update_endpoint(
EndpointName=target_endpoint_name,
EndpointConfigName=endpoint_config_name)
else:
target_sagemaker.create_endpoint(
EndpointName=target_endpoint_name,
EndpointConfigName=endpoint_config_name)
def get_stack_outputs(session, stack_name):
cloudformation = session.client('cloudformation')
result = cloudformation.describe_stacks(StackName=stack_name)
return {
output['OutputKey']: output['OutputValue']
for output in result['Stacks'][0]['Outputs']
}
def find_model(sagemaker_client, model_name):
try:
return sagemaker_client.describe_model(ModelName=model_name)
except botocore.exceptions.ClientError:
return None
def find_endpoint(sagemaker_client, endpoint_name):
try:
return sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
except botocore.exceptions.ClientError:
return None
def find_endpoint_config(sagemaker_client, endpoint_config_name):
try:
return sagemaker_client.describe_endpoint_config(
EndpointConfigName=endpoint_config_name)
except botocore.exceptions.ClientError:
return None
def copy_s3_object_to_bucket(s3_source, bucket, session):
s3_destination = re.sub('^s3://.+?/', 's3://{}/'.format(bucket), s3_source)
copy_s3_object(s3_source, s3_destination, session)
return s3_destination
def copy_s3_object(s3_source, s3_destination, session):
"""
Moves files bewtween `s3://bucket/path/file.ext` locations
"""
def _extract_s3_objects(s3_path):
match = re.match('^s3://(.+?)/(.+)', s3_path)
return match.groups()
source_bucket, source_key = _extract_s3_objects(s3_source)
destination_bucket, destination_key = _extract_s3_objects(s3_destination)
return session.client('s3').copy_object(
CopySource={
'Bucket': source_bucket,
'Key': source_key
},
Bucket=destination_bucket,
Key=destination_key)
if __name__ == '__main__':
promote_model()
The script can be run with
python promote_model.py \
--stage=qa \
--source-profile=source-aws-profile \
--target-profile=qa-aws-profile \
--model-name=sagemaker-tensorflow-2018-10-01-01-15-23-007
Which will create a model called myapp-qa-model-2018-10-01-15-23-00
and deploy
it to the myapp-qa
endpoint in the target account. Each resource is created if
it doesn’t exist, and the script can be rerun safely.
Going forward
We are now at a point where we can deploy evaluated models to our production environment. There are still many things to improve, but we are now exploring the best approaches to retraining our SageMaker models with user feedback.