Job Overview
We’re hiring an HPC Platform Engineer to own the on-prem GPU and HPC platform as an integrated whole. Where our Network, Storage, and Systems engineers go deep in their domains, this role owns the layers that hold the platform together—provisioning, GPU orchestration, and the cross-domain integration work that turns a stack of components into a dependable service.
You’ll work hands-on across bare-metal lifecycle (PXE/iPXE, MAAS, BMC/Redfish), Kubernetes with the NVIDIA GPU Operator (MIG/MPS, device plugins, topology-aware scheduling), the GPU stack itself (drivers, CUDA, NCCL, container runtimes), and the integration glue that makes scheduling, networking, storage, and compute behave predictably under real workloads.
What you’ll focus on
- Provisioning & lifecycle: bare-metal automation, OS imaging, firmware/driver management, and predictable bring-up
- GPU orchestration: Kubernetes with the NVIDIA GPU Operator, MIG/MPS, and workload scheduling
- Cross-domain integration: stitching together scheduling, networking, storage, and compute into a coherent platform
- Operational excellence: upgrades, capacity planning, runbooks, and incident response with measurable improvements
You’ll partner closely with the Network, Storage, and Systems engineers—aligning on architecture, escalating cross-domain issues, and making sure each layer’s behavior contributes to a platform users can trust. Success looks like fast, boring bring-ups, calm upgrades, and a platform that holds together under real GPU workloads—training, fine-tuning, and inference.



%2520(2).avif&w=3840&q=75)


.avif&w=3840&q=75)



