Job Overview
MeteorOps is looking for a freelance Kafka troubleshooting & modernization specialist to step into an older on-prem Kafka environment that currently has no dedicated Kafka owner and limited observability. The cluster supports real-time market quote / HFT tick data at very high throughput (potentially millions of messages/sec) and feeds downstream systems including downsampling services and a SQL Server writer, eventually supporting trading execution workflows.
The Kafka setup is 6–7 years old, deployed on VMware on-prem VMs with 10 Kafka brokers and 5 ZooKeepers, running Kafka 2.13-3.0.0. Each broker has multiple data disks (currently stated as 7 disks ~1TB each; prior notes mention higher disk counts—part of the engagement will be to verify actual layout). Historically disk usage sits around ~10%, but recently one or more brokers spiked toward 100%, coinciding with application Kafka errors and broker/topic instability (e.g., missing leader, invalid partition, impaired topic failover).
You’ll diagnose the incident and underlying risks, produce a clear findings + recommendations report, and help the engineering team implement pragmatic improvements: monitoring, tooling, operational runbooks, resilience/failover improvements, and an assessment of upgrade options (including a path away from ZooKeeper).



%2520(2).avif&w=3840&q=75)


.avif&w=3840&q=75)



