Job Overview
We’re looking for a GPU Performance Engineer to help teams turn “it runs” into “it scales.” This role is for someone who can move confidently between GPU kernels, distributed training communication, and system-level bottlenecks—then translate findings into practical fixes that improve throughput, latency, and cost.
You’ll focus on performance work across CUDA code paths, NCCL-backed multi-GPU/multi-node communication, and end-to-end profiling. The goal is not theoretical optimization—it’s measurable wins: faster training steps, better GPU utilization, stable scaling efficiency, and clear evidence for what changed and why.
What you’ll actually do
You’ll profile real workloads, form hypotheses, run controlled experiments, and deliver improvements that hold up in production. You’ll work across the stack—from kernel-level tuning and memory behavior to overlap of compute/communication and network-aware scaling.
- Profiling & root cause analysis: use tools like Nsight Systems/Compute and CUDA profiling to pinpoint bottlenecks
- Distributed performance: analyze NCCL collectives, topology effects, and scaling limits across nodes
- Optimization delivery: implement and validate changes with benchmarks, regression guards, and clear documentation
How success is measured
Success looks like sustained performance gains that are easy to verify: improved step time, higher effective TFLOPs, better scaling efficiency, fewer performance regressions, and a repeatable methodology the team can keep using after your engagement.
How you’ll work with the team
You’ll collaborate closely with ML engineers, systems/platform teams, and anyone touching the training stack. You’ll be expected to communicate clearly—sharing traces, explaining tradeoffs, and recommending the next highest-leverage change instead of chasing micro-optimizations.



%2520(2).avif&w=3840&q=75)


.avif&w=3840&q=75)



