Initial state
This startup is at the forefront of developing phage cocktails and personalized treatments to target and destroy harmful bacteria in chronic diseases. Before our collaboration started, their data pipelines and batch jobs were in early stages, and infrastructure management was mostly manual and fragmented.
- Infrastructure wasn't defined as code.
- The deployment process was custom and not standardized.
- There was no environments separation.
- Pipeline orchestration was lacking consistency and automation.
- Difficulty onboarding new contributors and maintaining the pipelines.
Tech stack
Project goals
- Build the infrastructure for orchestrating the data-pipelines of the informatics team.
- Make developing & testing data-pipelines in collaboration smooth.
- Automate the flow from development to production using CI/CD and GitOps practices with Argo CD.
Decisions
To achieve the goals, these are the decisions we made:
- Use Terraform and Terragrunt to provision all of the infrastructure.
- Use Apache Airflow to orchestrate the data-pipelines, and deploy it on Kubernetes for scaling flexibility.
- Use Argo CD to continuously deploy the data-pipelines & the apps using a GitOps approach.
- Created a CI/CD process & branching strategy to enable pushing changes through the environments in an organized way.
Strategy
We created reusable Terraform modules and Helm charts for infrastructure and applications, deployed development and production environments, and ensured everything was fully GitOps-managed. Pipelines were developed and tested with dynamic storage solutions, automated syncing, and retry mechanisms to improve resilience.
The process
We did a few things in parallel:
- Created Terraform Modules (VPC, EKS, etc.)
- Created a repository for the data-pipelines
Then we continued with a few more parallel phases:
- Created the Kubernetes cluster in AWS and bootstrapped it
- Created a sample data pipeline & tested it locally on Apache Airflow on a local Kubernetes
After that we used ArgoCD to deploy Apache Airflow on the EKS cluster, created a CI/CD process for the data-pipelines, and implemented autoscaling on the EKS cluster using Karpenter.
