Principal Site Reliability Engineer - Profitable AI Programmatic Advertising Platform

Three Pillars Recruiting

FULL_TIME Remote · US US Posted: 2026-05-11 Until: 2026-07-10

You will be redirected to the original job posting on BeBee.
Apply directly with the employer.

Job Description

PLEASE CLICK HERE TO SEE ALL OF OUR JOB OPENINGS! Principal Site Reliability Engineer This role reports directly to the CTO and works cross functionally with Engineering, Data Science, Machine Learning, and Product. The Principal SRE sits at the intersection of infrastructure, ML systems, and platform governance, with broad influence across teams. They Are Evolving Toward: Event driven system design Container deployments to customer and partner infrastructure Reduced architectural rigidity Strong internal platform standards What “Principal” Means The Principal Site Reliability Engineer role focuses more on architectural leverage than operational work. This role will influence infrastructure, ML operations, governance, and platform strategy. You Will: Recommend architectural direction Reduce systemic complexity Introduce durable patterns Identify architectural risk early Retire services when necessary Define reliability standards across teams Shape how the AI platform is delivered to customers Mission As Principal Site Reliability Engineer, you will define and operate the architectural backbone of the AI platform. You will report directly to the CTO and work cross functionally with Engineering, Data Science, Machine Learning, and Product. This role sits at the intersection of infrastructure, ML systems, and platform governance, with broad influence across teams. In addition to building systems, you will mentor and elevate other engineers in infrastructure best practices, operational rigor, and architectural thinking. You will help establish a culture of reliability, ownership, and continuous improvement across the organization. You will design and operate a scalable, event-driven, multi-tenant ML infrastructure platform that supports: Distributed ML training (Databricks + Ray) Containerized product delivery to external customers Internal event driven services across AWS Centralized state store driven orchestration Governance across adtech integrations and third-party APIs This role is responsible not only for technical execution, but for shaping how the company thinks about reliability, infrastructure standards, and long-term platform evolution. What You’ll Own Event Driven Platform Architecture You will define the architectural direction for the company’s event driven platform and lead the build-out of a scalable SRE function to support it. While you will be hands-on in early design and critical implementations, you will not be operating alone. This role is expected to shape and grow the SRE capability over time. You Will: Design and implement AWS event driven systems using: EventBridge / RabbitMQ (where appropriate) MSK / Kafka Kinesis Lambda / Fargate SQS / SNS / Step Functions Architect centralized state stores (DynamoDB, Redis, Postgres) that: React to signals Trigger downstream services Maintain system integrity Establish architectural standards for: Idempotency Replay safety Event schema governance Operational clarity and traceability As the platform evolves, you will help build and mentor an SRE team capable of operating and extending these systems. You will define what good looks like in architecture, reliability, and operational excellence and ensure that standards scale beyond any single individual. Ideally you have participated in at least one event driven system build in production and have experience evolving infrastructure from early-stage patterns to durable, team supported systems. Kubernetes & Control Plane Ownership Operate multi cluster Kubernetes environments in production Understand and tune: API server scaling etcd performance RBAC architecture Admission controllers Implement: GitOps patterns Progressive delivery Cluster level security policies Mult- tenant isolation Bonus If You’ve: Built internal developer platforms Managed customer facing container workloads Operated ML workloads in Kubernetes Infrastructure as Code, Governance & CI/CD Evolution Define Terraform module standards Create reusable infrastructure primitives Enforce GitHub guardrails (branch protections, CI gates) Evolve and standardize CI/CD pipelines to support: Automated infrastructure testing Policy val