Rolling back AWS Elastic Container Service (ECS) Deployments

Chris Wood

5 years ago

What is ECS?

AWS Elastic Container Service (ECS) is a fast and popular way to orchestrate containerised applications in AWS’s cloud computing platform. ECS comes with autoscaling baked in and natively runs containers on AWS’s Fargate serverless compute engine. What makes ECS particularly nice to use is that it abstracts away much of the operations management that comes along with rolling your own deployment, either on-site or on cloud-based computing, such as AWS EC2.

This article assumes you know the basics of ECS terminology and architecture. If you don’t, it’s one of the many fantastic ways to deploy applications on AWS, and I recommend you check out the AWS ECS docs.

Where is the Deployment Rollback Command?

While refactoring the deployment scripts for a Resolver project managed using ECS, I noticed something was missing in the AWS ECS CLI and UI console: a rollback deployment command… 😱

But why is it missing when other services, e.g. Heroku, have these commands? It turns out the reason there is no standard rollback command in the ECS service is due to how ECS handles deployments. New deployments occur when:

a new service is created with a task definition
a service’s task definition is updated
a service re-deployment is forced, such as when it is updated using the aws ecs update-service command with the --force-new-deployment option passed

When a new deployment is triggered, the ECS scheduler starts new containers using the service’s task definition and stops old containers running previous versions, moving traffic to the new containers as they become healthy.

As newly created containers are created from the container image defined in the task definition, deployments are very flexible. Since the image tag can be set to an arbitrary value, it might not change between deployments; however, the actual image pulled by ECS can, making a rollback command not trivial to implement.

For example, in the project I was refactoring, the image being pulled was tagged as the latest image from a related Elastic Container Registry (ECR) repository. While pulling the latest image made infrastructure setup and deployments relatively straightforward (new images could be pushed to ECR and the service force re-deployed), it made rollbacks rather tricky. Some of the ways rollbacks were previously performed were:

deleting the latest image, retagging an old image as latest, and force re-deploying
reverting the git commits on GitHub and deploying the new image

Neither of the above was quick, provided a good indication of which version of the app was deployed, or allowed the original deployment to be returned to again.

Deploying

To rollback effectively, first, you must deploy the application correctly and build and tag new container images suitably.

To start, when building container images, they should be tagged and pushed to the repository with a unique identifier. For example, typically, the SHA-1 hash of the latest git commit of the codebase is used as it provides a clear indication of the state of the codebase described in an image. The git hash can also be used to fetch commit messages for a more human-readable representation of where rollbacks will revert to.

For a sample hello world ECS application (full codebase can be seen here), an application image can be built, tagged with a unique ID, and pushed to a remote repository:

# ./bin/deploy
# …

ecr_repo=<your ecr repository url>
unique_image_tag=$(git rev-parse --short HEAD)

docker build \
  --tag "$ecr_repo:$unique_image_tag" \
  --tag "$ecr_repo:latest" \
  .

aws ecr get-login-password --profile ecs-rollback-hello-world | \
  docker login --username AWS --password-stdin "$ecr_repo"

docker push "$ecr_repo:$unique_image_tag"
docker push "$ecr_repo:latest"

Once the updated images are pushed, the task definition can be updated with the new tag. Since task definitions are versioned, the services are then updated with the new revision number, triggering a deployment of the new tasks:

# ./bin/deploy
# …

# fetch current task definition
current_task_definition=$(
  aws ecs describe-task-definition \
    --task-definition "$task_definition_family" \
    --query '{  containerDefinitions: taskDefinition.containerDefinitions,
                family: taskDefinition.family,
                executionRoleArn: taskDefinition.executionRoleArn,
                networkMode: taskDefinition.networkMode,
                volumes: taskDefinition.volumes,
                placementConstraints: taskDefinition.placementConstraints,
                requiresCompatibilities: taskDefinition.requiresCompatibilities,
                cpu: taskDefinition.cpu,
                memory: taskDefinition.memory }'
)
current_task_definition_revision=$(
  aws ecs describe-task-definition --task-definition "$task_definition_family" \
                                   --query 'taskDefinition.revision'
)

# compare current and updated image tags
current_container_image="$(echo "$current_task_definition" | jq .containerDefinitions[0].image)"
updated_container_image="$ecr_repo:$unique_image_tag"

if [[ $current_container_image = "\"$updated_container_image\"" ]]; then
  echo "Container image '$unique_image_tag' already defined in the latest task definition revision: $task_definition_family:$current_task_definition_revision"
  read -p "Are you sure you want to deploy?" -n 1 -r
  if [[ ! $REPLY =~ ^[Yy]$ ]]
  then
    exit 1
  fi
fi

# inject new image tag into task definition and update
updated_task_definition=$(
  echo "$current_task_definition" | jq --arg CONTAINER_IMAGE "$updated_container_image" '.containerDefinitions[0].image = $CONTAINER_IMAGE'
)
updated_task_definition_info=$(aws ecs register-task-definition --cli-input-json "$updated_task_definition")

# update service with new task definition revision
updated_task_definition_revision=$(echo "$updated_task_definition_info" | jq '.taskDefinition.revision')
aws ecs update-service --cluster "$ecs_cluster_name" \
                       --service "$ecs_service_name" \
                       --task-definition "$task_definition_family:$updated_task_definition_revision" \
                       >/dev/null

Rolling Back

Now new deployments are updating task definitions with unique image tags, rollbacks are a case of updating a service with a previous revision number and rolling back any data migrations or schema changes. Changing a service’s task definition revision number can be done in various ways:

AWS ECS UI Console

The UI console has an interface to update services:

First, visit the service page and click Edit service:

Change task definition revision number to a previous version.

Click Update to update the service and re-deploy.

N.B.: The above screenshots are taken from ‘New ECS Experience’. Similar functionality exists in the legacy UI console.

Performed manually using the CLI

With project specific variables:

# list task definitions
aws ecs list-task-definitions \
  --family-prefix "$task_definition_family" \
  --query taskDefinitionArns \
  --sort DESC

# update service with previous task definition arn
aws ecs update-service \
  --cluster "$ecs_cluster_name" \
  --service "$ecs_service_name" \
  --task-definition "$task_definition_arn"

Scripted

Rolling back can be scripted to reduce human error. For example, take a look at the script I made for manually rolling back deploys, which lets the user:

Select a cluster in the current AWS profile’s region.
Select some services from the selected cluster.
Select a task definition to rollback to from a list of the last n task definitions for each selected service, along with their git commit message information
Updates service with new task definition and re-deploys, rolling back the application state

aws_ecs_rollback

Run it in the sample ECS application using ./bin/rollback_dialog

Conclusion

While there are various ways to rollback ECS deployments when using integrated tooling such as AWS’s CodeDeploy, if you are working on a project without a nicely configured CI/CD pipeline, having a process to manually rollback is crucial during the interim while the project is refactored.

Hopefully, this article helps you set one up for your project, so you can deploy new application revisions without the worry of a slow rollback in the rare case something goes wrong.

This blog is taken from Chris’s personal blog on October 11th 2021. The original post can be seen here.