Job Overview
As an HPC Network Engineer, you’ll own the design, tuning, and operational reliability of high-performance network fabrics where microseconds matter. This role is for someone who can translate application and cluster demands into stable, measurable network performance—then keep it that way under real workloads.
You’ll work hands-on with InfiniBand and Ethernet-based RDMA (RoCE) environments, focusing on lossless behavior, congestion control, and end-to-end latency. The work spans architecture decisions, build and validation, and pragmatic troubleshooting across switches, HCAs/NICs, cabling/optics, and host configuration. You’ll be expected to use data—telemetry, counters, packet captures, and benchmarks—to pinpoint bottlenecks and drive improvements.
What you’ll focus on
- Low-latency fabric design: topology selection, oversubscription tradeoffs, and scalable growth planning
- RDMA performance engineering: MTU, PFC/ECN, QoS, congestion control, and kernel/driver tuning
- Operational excellence: repeatable build standards, change control, upgrades, and incident response with clear root-cause analysis
- Validation and observability: performance baselines, regression testing, and actionable monitoring for fabric health
You’ll collaborate closely with HPC/cluster engineers and application teams to align fabric behavior with real job profiles (MPI collectives, storage traffic, east-west patterns). Success looks like predictable latency, high utilization without instability, and faster time-to-diagnosis when issues occur—documented in runbooks and reflected in measurable improvements over time.



%2520(2).avif&w=3840&q=75)


.avif&w=3840&q=75)



