Improve and simplify AWS and Kubernetes infrastructure management

Biotech · Life Science

Initial state

Erisyon is a Life Science tools company developing a single moleucle protein sequencer.

They are commercializing the world’s first single-molecule protein sequencer that promises to transform the way we detect, treat, and track disease.

‍Their state when they met us:

Two Kubernetes clusters provisioned by Pulumi (development & production)
Parts of the system not imported to Pulumi
No pipeline to execute & review Pulumi infrastructure changes
Complex and hard to read/manage JSON files for Helm charts being applied by Pulumi.
No staging cluster to test high risk infrastructure changes without affecting ongoing development
No strategy how to track and frequently upgrade all platform components

Managing the infrastructure became complex and risky.

Tech stack

Pulumi

Kubernetes

AWS

Project goals

Make it easier and safer to provision and manage infrastructure without expanding the software team

Decisions

To achieve the goals, we made a couple of decisions:

Use Github Actions to execute and validate “Pulumi Preview” output, along with a manual approval process before running “Pulumi Up” and applying infrastructure changes
Create STAGING environment (cluster) and deploy testing instances of components running on development and production clusters
Use the new environment to test both infrastructure and application changes before modifying production.
Use ArgoCD for Helm charts deployment - Pulumi usage is limited to AWS Cloud resources management only, whereas ArgoCD takes over Helm chart deployment part.
Use Renovate to automatically track new helm chart versions available and manage the upgrades automatically via Pull Request creation.

Restrictions

There are strict dependencies between different Pulumi projects require a certain order of applying changes across infrastructure projects.

Strategy

Safeguard changes via GitHub-Actions Pulumi previews, manual approvals, and a staging cluster that gates promotion to dev and prod.
Automate upkeep through CI pipelines, on-demand node workflows, and Renovate PRs tracking Helm chart upgrades.

The process

The process of transforming Erisyon's infrastructure was methodical and detailed:

Create the following Github Actions workflows:run “Pulumi Preview” in a Pull Request and review before merging a change. Continue working on a change until it’s ready.
run “Pulumi Preview” on a merge to the main branch and create a Github Issue with the change details for review.
run “Pulumi Up” when the issue (Pulumi infrastructure change) is reviewed and approved.
Deploy ArgoCD to each cluster via Pulumi and migrate all application and platform Helm chart deployments to ArgoCD (leverage ArgoCD ApplicationSets).
Create STAGING cluster. Update ArgoCD manifests to deploy application and platform services. Create Github Action workflows to provision and terminate EKS Nodes infrastructure to save cost (staging cluster services can be started within minutes when needed).
Set up Renovate configuration/dashboard to track new Helm chart versions and automatically create Pull Requests for new versions (with detailed Change log review etc).

Results

The entire AWS infrastructure is managed using Pulumi, which runs from Github Actions
Helm chart deployments are managed by ArgoCD (applied on a Pull Request merge)
There is new STAGING environment to test both infrastructure and application changes before modifying development and production.
There is automated process in place to track new Helm chart versions and raise Pull Requests to upgrade.

Worth mentioning:

We did other things with Erisyon as well, such as reducing overall infrastructure cost, improving monitoring, streamlining the Kubernetes clusters' upgrades, handling Kubernetes deprecations, etc.

Before & After

Before

After

Manual infrastructure changes management via Pulumi Preview and Up commands executed from a developer machine

Automated infrastructure management with Pulumi and Github Actions

Complex JSON files for Helm charts deployed by Pulumi

Eazy management and deployment by leveraging GitOps approach with ArgoCD for Helm charts

Risky infrastructure upgrades on development and production clusters

Staging environment to test both infrastructure and application changes