HPC Platform Engineer

Apply to join our team and get matched with HPC Platform Engineer

* Required

Upfeat
Rockwell Automation
Iota Biosciences
D-ID
Cuma Financial
Gefen Technologies
CodeMonkey
BitWise MnM
Surpass
UnitySCM
WisePatient
Skyline Robotics
WiseCommerce
Optival
Upfeat
Rockwell Automation
Iota Biosciences
D-ID
Cuma Financial
Gefen Technologies
CodeMonkey
BitWise MnM
Surpass
UnitySCM
WisePatient
Skyline Robotics
WiseCommerce
Optival
I / 001

Job Overview

We’re hiring an HPC Platform Engineer to own the on-prem GPU and HPC platform as an integrated whole. Where our Network, Storage, and Systems engineers go deep in their domains, this role owns the layers that hold the platform together—provisioning, GPU orchestration, and the cross-domain integration work that turns a stack of components into a dependable service.

You’ll work hands-on across bare-metal lifecycle (PXE/iPXE, MAAS, BMC/Redfish), Kubernetes with the NVIDIA GPU Operator (MIG/MPS, device plugins, topology-aware scheduling), the GPU stack itself (drivers, CUDA, NCCL, container runtimes), and the integration glue that makes scheduling, networking, storage, and compute behave predictably under real workloads.

What you’ll focus on

  • Provisioning & lifecycle: bare-metal automation, OS imaging, firmware/driver management, and predictable bring-up
  • GPU orchestration: Kubernetes with the NVIDIA GPU Operator, MIG/MPS, and workload scheduling
  • Cross-domain integration: stitching together scheduling, networking, storage, and compute into a coherent platform
  • Operational excellence: upgrades, capacity planning, runbooks, and incident response with measurable improvements

You’ll partner closely with the Network, Storage, and Systems engineers—aligning on architecture, escalating cross-domain issues, and making sure each layer’s behavior contributes to a platform users can trust. Success looks like fast, boring bring-ups, calm upgrades, and a platform that holds together under real GPU workloads—training, fine-tuning, and inference.

I / 002

Job Responsibilities

Success in this role means the on-prem HPC and GPU platform is delivered as a coherent, dependable service—across provisioning, scheduling, networking, storage, and GPU operations—rather than a stack of disconnected components.

  1. Keep the platform available and predictable for compute-intensive workloads at scale.
  2. Reduce operational toil through automation, repeatable processes, and clear standards.
  3. Shorten time-to-resolution for incidents with strong observability and disciplined root-cause analysis.
  4. Ship changes safely with tested procedures, validation, and clean rollback plans.

From day one

Get hands-on with the existing environment: review provisioning, scheduler configuration, GPU stack versions, networking, and storage. Validate observability and incident history, then prioritize the highest-impact reliability and automation work.

What you’ll own

  • Bare-metal provisioning and lifecycle: PXE/iPXE, MAAS or equivalent, golden images, and BMC/Redfish-based automation across heterogeneous hardware.
  • Linux cluster operations: OS, kernel, drivers, systemd, security hardening, and configuration management at scale.
  • Scheduling and orchestration: Slurm (partitions/QoS, fairshare, accounting) and/or Kubernetes (GPU Operator, device plugins, MIG/MPS, topology-aware scheduling).
  • GPU stack health: firmware, CUDA, NVIDIA Container Toolkit, DCGM telemetry, NCCL validation, and known-good state across the fleet.
  • High-performance networking awareness: InfiniBand/RoCE behavior, MTU/PFC/ECN/QoS impact on workloads, and partnering with network engineers on fabric design.
  • Storage integration: parallel filesystems (Lustre/GPFS/BeeGFS) and shared filesystems tuned for HPC I/O patterns.
  • Observability: metrics, logs, alerts, and SLOs tied to availability, utilization, job throughput, and time-to-provision.
  • Upgrades and migrations (Kubernetes, Slurm, OS, drivers, firmware) with tested rollbacks and minimal user impact.
  • Documentation: architecture decisions, runbooks, change records, and post-incident reviews with concrete follow-ups.
I / 003

Job Requirements

Requirements

  • 5+ years operating production Linux infrastructure at scale, including hands-on HPC, GPU, or performance-sensitive environments.
  • Demonstrated breadth: comfortable working across provisioning, scheduling, networking, storage, and GPU operations rather than a single silo.
  • Strong Linux fundamentals: systemd, networking, storage, kernel/driver troubleshooting, and performance debugging.
  • Comfortable with on-call rotations, change windows, and disciplined incident response.

Technical

  • Bare-metal provisioning and lifecycle automation: PXE/iPXE, MAAS or similar, image build pipelines, BMC/IPMI/Redfish, firmware/driver management.
  • Working experience with at least one HPC scheduler (e.g., Slurm) and/or Kubernetes with GPU workloads (NVIDIA GPU Operator, device plugins, MIG/MPS).
  • GPU operations: drivers, CUDA, NVIDIA Container Toolkit, DCGM-based observability, and NCCL validation/troubleshooting.
  • High-performance networking awareness: InfiniBand and/or RoCE fundamentals and how fabric behavior affects real workloads.
  • Parallel storage exposure (e.g., Lustre, GPFS, BeeGFS) and how I/O patterns interact with compute performance.
  • Infrastructure-as-code and configuration management (Ansible, Terraform) plus scripting/automation in Python and/or Go.

Experience

  • Operating GPU or HPC clusters supporting real workloads (training, fine-tuning, inference, MPI/OpenMP, scientific computing).
  • Driving upgrades and migrations safely, with measurable outcomes and clear stakeholder communication.
  • Building automation that makes the platform easier to operate over time—not just one-off scripts.
  • Fluent English for documentation, change planning, and cross-team coordination.

Bonus Points

  • Experience designing or operating multi-tenant GPU platforms or GPU-as-a-Service environments.
  • Familiarity with hybrid Slurm + Kubernetes patterns and converged HPC/AI workflows.
  • Low-level diagnostics across NUMA, IRQ affinity, PCIe topology, and NVIDIA tools (nvidia-smi, nvbandwidth, DCGM, NCCL tests).
  • Contributions to open-source HPC, Kubernetes, or GPU tooling.

End to end

Application Process

1

Apply

Submit your CV, LinkedIn, and GitHub via the form. We'll review your profile.

2

Screening

If your skills align, we'll reach out for a quick conversation to understand your experience and project preferences.

3

Get Matched

Once selected, we'll match you with a client project that fits your expertise. A brief onboarding ensures you're set up with our tools and ready to start.

Ready to join

Apply for this role

Upload your CV and links, and we'll get back to you after reviewing your profile.

* Required