

GPU Platform / SRE Engineer
Engineering
Boston / On-Site
Build the automation that keeps our grid alive. Kubernetes, Slurm, Prometheus, and NVIDIA DCGM. Focus on reliability and auto-healing.


GPU Platform / SRE Engineer
Engineering
Boston / On-Site
Build the automation that keeps our grid alive. Kubernetes, Slurm, Prometheus, and NVIDIA DCGM. Focus on reliability and auto-healing.


GPU Platform / SRE Engineer
Engineering
Boston / On-Site
Build the automation that keeps our grid alive. Kubernetes, Slurm, Prometheus, and NVIDIA DCGM. Focus on reliability and auto-healing.


GPU Platform / SRE Engineer
Engineering
Boston / On-Site
Build the automation that keeps our grid alive. Kubernetes, Slurm, Prometheus, and NVIDIA DCGM. Focus on reliability and auto-healing.
Focus Reliability + Automation.
Role Description
The GPU Platform/SRE Engineer will be responsible for building, maintaining, and troubleshooting high-performance GPU-based compute infrastructure. Daily tasks include designing and improving system reliability, managing infrastructure, and proactively identifying performance bottlenecks.
Responsibilities
Build monitoring, alerting, auto-healing.
Optimize GPU utilization & scheduling.
Enforce SLAs and availability targets.
Support multi-tenant isolation & security.
Tech Stack
Kubernetes / Slurm
Prometheus / Grafana
Terraform / Ansible
NVIDIA DCGM
Qualifications
Proficiency in Site Reliability Engineering and Troubleshooting. * Experience with scripting, automation, and workflow optimization.
Focus Reliability + Automation.
Role Description
The GPU Platform/SRE Engineer will be responsible for building, maintaining, and troubleshooting high-performance GPU-based compute infrastructure. Daily tasks include designing and improving system reliability, managing infrastructure, and proactively identifying performance bottlenecks.
Responsibilities
Build monitoring, alerting, auto-healing.
Optimize GPU utilization & scheduling.
Enforce SLAs and availability targets.
Support multi-tenant isolation & security.
Tech Stack
Kubernetes / Slurm
Prometheus / Grafana
Terraform / Ansible
NVIDIA DCGM
Qualifications
Proficiency in Site Reliability Engineering and Troubleshooting. * Experience with scripting, automation, and workflow optimization.
SIGNAL: OUTLIER
We are constantly scanning for 10x engineers. If you don't fit a standard role description but can optimize GB300 clusters or architect low-latency fabrics, initiate contact immediately.
SIGNAL: OUTLIER
We are constantly scanning for 10x engineers. If you don't fit a standard role description but can optimize GB300 clusters or architect low-latency fabrics, initiate contact immediately.
SIGNAL: OUTLIER
We are constantly scanning for 10x engineers. If you don't fit a standard role description but can optimize GB300 clusters or architect low-latency fabrics, initiate contact immediately.
CORE ARCHITECTURE

NVIDIA GB200 & H100
Blackwell / Hopper Architecture

Kubernetes
Orchestration Layer

PyTorch
ML Framework

Rust / Go
High-Performance Systems