SRE, Observability - Decentralized High-Performance Computing Leader

Andiamo

FULL_TIME Remote · US New York, NY, New York, US Posted: 2026-05-11 Until: 2026-07-10

You will be redirected to the original job posting on BeBee.
Apply directly with the employer.

Job Description

Senior / Staff Site Reliability Engineer – Observability & Telemetry Systems About The Role We’re seeking an accomplished Site Reliability Engineer with deep expertise in large-scale observability systems to help shape and operate the monitoring backbone of a global AI cloud platform. You’ll design, build, and maintain the telemetry infrastructure that ensures the performance, reliability, and visibility of systems powering advanced machine learning and high-performance computing workloads around the world. In this role, you’ll be the technical authority driving how metrics, logs, and traces are captured, processed, and visualized across a massive distributed environment. From optimizing cost efficiency at scale to ensuring rapid root-cause analysis during incidents, you’ll be building the observability systems that keep mission-critical AI workloads running smoothly and predictably. What You’ll Do Architect large-scale observability systems: Design and operate telemetry pipelines for metrics, logs, and traces using modern observability stacks (Prometheus, Mimir, Loki, Tempo, Grafana) at petabyte scale. Ensure reliability and efficiency: Tune distributed telemetry systems for performance, cardinality control, and cost optimization while maintaining high availability across global deployments. Empower debugging and insight: Build tools and frameworks that give developers deep visibility into distributed ML training, inference pipelines, and infrastructure performance. Collaborate cross-functionally: Partner with platform, SRE, and infrastructure teams to extend observability coverage for Kubernetes clusters, SLURM schedulers, and GPU-based compute environments. Operational excellence: Establish SLOs, alerting policies, and observability standards that reduce noise, streamline incident response, and strengthen reliability culture across teams. Automate at scale: Develop clean, maintainable code in Go, Python, or Bash to extend observability tooling and automate operational workflows. Who You Are 7+ years of total engineering experience, including at least 3 years building or operating large-scale observability or telemetry infrastructure (100M+ metric series, 10TB+/day logs). Proven expertise with the Grafana ecosystem — Prometheus, Mimir, Loki, Tempo, Grafana, and Alertmanager — in production environments. Hands-on proficiency with Kubernetes, including Helm, Kustomize, custom CRDs, and multi-cluster federation. Experienced with Terraform (or Pulumi) and Infrastructure-as-Code best practices for hybrid or bare-metal provisioning. Strong programming ability in Go (preferred), with additional experience in Python or Bash for automation, data collection, and controller development. Deep knowledge of Linux internals — cgroups, namespaces, networking, and filesystem performance — plus foundational TCP/IP and TLS expertise. Experienced in defining and enforcing SLOs, SLIs, and alerting mechanisms that align engineering focus with real user impact. Calm and methodical under pressure — you’ve led incident response efforts, authored postmortems, and driven systemic improvements afterward. Communicative and collaborative — able to explain complex systems clearly and influence peers in dynamic, cross-functional environments. Preferred Experience Instrumentation of GPU-heavy or HPC clusters (NVIDIA A-/H-series, NVSwitch, DGX, RoCE, RDMA). Observability for distributed ML workloads managed by Slurm, Ray, or Kubernetes-native batch schedulers. Hands-on with eBPF, Cilium, or Hubble for high-fidelity, low-overhead network visibility. Experience deploying and migrating OpenTelemetry across metrics, logs, and traces. Operating service meshes like Istio or Linkerd and managing telemetry pipelines built on Envoy. Managing observability across distributed or multi-region environments (US/EU/APAC), optimizing for latency and cost. Implementing cost and resource monitoring using tools like Kubecost or Cloudability. Security observability overlap — integrating Falco, GuardDuty, or auditd into telemetry pipelines. Contributions to open-source observability projects or thought leadership through blogs, talks, or community participation. Knowledge of high-performance storage systems (Ceph, Lustre, NVMe-oF) and telemetry integrations for throughput and latency analysis. Experience building custom backends with Kafka, ClickHouse, or VictoriaMetrics for large-scale data ingestion. Why This Role Matters Observability isn’t just about monitoring — it’s about empowerment. In this role, you’ll be building the visibility layer that allows engineers, scientists, and operators to understand their systems at every le