Job Overview
We’re hiring an HPC Systems Engineer to own the reliability, performance, and day-to-day operability of high-performance computing environments. This role is for someone who’s comfortable deep in Linux, understands how researchers and engineers actually use clusters, and can translate that into stable scheduling, predictable throughput, and clean automation.
You’ll focus on building and operating Slurm-based clusters end-to-end: provisioning and configuration, user onboarding, queue and partition design, fair-share policies, and troubleshooting jobs from “why is my node down?” to “why is this MPI run stalling?”. You’ll also help standardize how environments are delivered—modules, containers, images, and repeatable configuration—so the cluster stays maintainable as demand grows.
What success looks like
Clusters are available, observable, and predictable: users can submit jobs with confidence, scheduling behavior matches policy, and incidents are resolved quickly with clear root cause and follow-up improvements. You’ll reduce toil through automation, keep upgrades and changes low-risk, and ensure capacity is used efficiently without sacrificing fairness.
How you’ll work
You’ll collaborate closely with infrastructure and platform stakeholders, partnering with researchers, data/ML teams, and application owners to understand workload patterns and remove bottlenecks. Expect hands-on work, pragmatic trade-offs, and a strong bias toward documentation and operational clarity.
- Slurm operations: partitions, QoS, accounting, fair-share, and troubleshooting
- Linux cluster engineering: provisioning, configuration management, patching, and hardening
- Performance & reliability: monitoring, capacity planning, and incident response



%2520(2).avif&w=3840&q=75)


.avif&w=3840&q=75)



