Love solving gnarly problems in AI infrastructure?
Our client is building the AI Native GPU Cloud—and we need a senior HPC Site Reliability Engineer to keep it humming.
You’ll own the reliability and performance of our cutting-edge Nvidia-based HPC systems. Think DGX clusters, RoCE topologies, and automation pipelines built in Ansible and Terraform. If that lights you up, read on.

This role is remote (US-based) and offers the chance to shape our infrastructure from the ground up. Expect high-impact work, loads of autonomy, and collaboration with smart folks across architecture, engineering, and ops.
You’ll:
  • Set up and optimize HPC clusters and networks (think DGX, HGX, GPU Direct)
  • Debug low-level networking issues with Cisco, Juniper, and more
  • Automate configs with Ansible Terraform
  • Monitor everything with Grafana, UFM, ELK, NetQ
  • Own 24/7 reliability, on-call, and root cause analysis
This role is perfect if you:
  • Have 6 years in HPC or networking-heavy roles
  • Know BGP, EVPN, VxLAN, RDMA inside and out
  • Have SRE experience in high-stakes environments
  • Love solving infra puzzles at scale
Bonus points for CCIE/JNCIS, InfiniBand, or cloud/HPC interconnect experience.
Sound like your kind of challenge? Hit apply and let’s talk.