Scalable Biotech Cloud Infrastructure with Apache Airflow, Kubernetes, AWS, and Terraform

Initial state

This startup is at the forefront of developing phage cocktails and personalized treatments to target and destroy harmful bacteria in chronic diseases. Before our collaboration started, their data pipelines and batch jobs were in early stages, and infrastructure management was mostly manual and fragmented.

Infrastructure wasn't defined as code.
The deployment process was custom and not standardized.
There was no environments separation.
Pipeline orchestration was lacking consistency and automation.
Difficulty onboarding new contributors and maintaining the pipelines.

Tech stack

AWS

Project goals

Build the infrastructure for orchestrating the data-pipelines of the informatics team.
Make developing & testing data-pipelines in collaboration smooth.
Automate the flow from development to production using CI/CD and GitOps practices with Argo CD.

Decisions

To achieve the goals, these are the decisions we made:

Use Terraform and Terragrunt to provision all of the infrastructure.
Use Apache Airflow to orchestrate the data-pipelines, and deploy it on Kubernetes for scaling flexibility.
Use Argo CD to continuously deploy the data-pipelines & the apps using a GitOps approach.
Created a CI/CD process & branching strategy to enable pushing changes through the environments in an organized way.

Strategy

We created reusable Terraform modules and Helm charts for infrastructure and applications, deployed development and production environments, and ensured everything was fully GitOps-managed. Pipelines were developed and tested with dynamic storage solutions, automated syncing, and retry mechanisms to improve resilience.

The process

We did a few things in parallel:

Created Terraform Modules (VPC, EKS, etc.)
Created a repository for the data-pipelines

Then we continued with a few more parallel phases:

Created the Kubernetes cluster in AWS and bootstrapped it
Created a sample data pipeline & tested it locally on Apache Airflow on a local Kubernetes

After that we used ArgoCD to deploy Apache Airflow on the EKS cluster, created a CI/CD process for the data-pipelines, and implemented autoscaling on the EKS cluster using Karpenter.

Before & After

Before

After

Manual infrastructure setup

Infrastructure as code

No clear environments

Dedicated environments for dev and prod

Inconsistent DAG orchestration

Automated DAGs deployment and testing with GitOps

Debugging pipelines was time-consuming

Improved reliability and developer experience