Site Reliability Engineer - AI Infrastructure

Andromeda Cluster, Inc

INTERN Remote · US US Posted: 2026-05-11 Until: 2026-06-10

You will be redirected to the original job posting on BeBee.
Apply directly with the employer.

Job Description

Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster - but it filled almost instantly. Since then, we've been quietly building the systems, network, and orchestration layer that makes the world's AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it's needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute - a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. What You'll Do Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers. Build automation and tooling to streamline cluster deployments and integrations. Debug customer issues across networking, storage, scheduling, and system layers. Improve reliability and scalability of both training and inference infrastructure. Design and implement monitoring, alerting, and observability for critical systems. Collaborate with engineering and product teams to plan and deliver infrastructure for new services. Participate in on-call and incident response, leading postmortems and reliabili