Focus Reliability + Automation.

Role Description

The GPU Platform/SRE Engineer will be responsible for building, maintaining, and troubleshooting high-performance GPU-based compute infrastructure. Daily tasks include designing and improving system reliability, managing infrastructure, and proactively identifying performance bottlenecks.

Responsibilities

  • Build monitoring, alerting, auto-healing.

  • Optimize GPU utilization & scheduling.

  • Enforce SLAs and availability targets.

  • Support multi-tenant isolation & security.


Tech Stack

  • Kubernetes / Slurm

  • Prometheus / Grafana

  • Terraform / Ansible

  • NVIDIA DCGM

Qualifications

  • Proficiency in Site Reliability Engineering and Troubleshooting. * Experience with scripting, automation, and workflow optimization.


Focus Reliability + Automation.

Role Description

The GPU Platform/SRE Engineer will be responsible for building, maintaining, and troubleshooting high-performance GPU-based compute infrastructure. Daily tasks include designing and improving system reliability, managing infrastructure, and proactively identifying performance bottlenecks.

Responsibilities

  • Build monitoring, alerting, auto-healing.

  • Optimize GPU utilization & scheduling.

  • Enforce SLAs and availability targets.

  • Support multi-tenant isolation & security.


Tech Stack

  • Kubernetes / Slurm

  • Prometheus / Grafana

  • Terraform / Ansible

  • NVIDIA DCGM

Qualifications

  • Proficiency in Site Reliability Engineering and Troubleshooting. * Experience with scripting, automation, and workflow optimization.


CORE ARCHITECTURE

NVIDIA GB200 & H100

Blackwell / Hopper Architecture

Kubernetes

Orchestration Layer

PyTorch

ML Framework

Rust / Go

High-Performance Systems