Director of Engineering - SRE & Operations

CVS Health

TEMPORARY Remote · US Wellesley, MA, Town of Wellesley, US USD 12017–24033 / month Posted: 2026-05-11 Until: 2026-07-10

You will be redirected to the original job posting on BeBee.
Apply directly with the employer.

Job Description

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time. Position Summary As the Director of Platform Engineering - SRE & Operations, you will guide the strategy, implementation, and ongoing maturity of reliability, availability, and operational excellence across key platforms within the DDAT organization. You will oversee the reliability of web, mobile, API, platform, and AI‑enabled systems, ensuring they are resilient, scalable, secure, and cost‑efficient. You will partner closely with the other engineering teams across CVS Health to embed SRE best practices and strengthen the resiliency, observability, and performance of our digital ecosystem. Responsibilities SRE Strategy & Reliability Leadership Contribute to and execute the SRE strategy, including definition and management of SLOs, SLIs, and error budgets. Establish and operationalize reliability standards across web, mobile, backend services, and data workloads. Champion a culture of reliability-by-design and continuous improvement within engineering teams. AI‑Driven Operations (AIOps) & Automation Drive adoption of AIOps capabilities for intelligent alerting, proactive issue detection, and predictive failure mitigation. Implement AI-assisted automation: incident triage, runbooks, root-cause analysis, and self-healing workflows. Collaborate with the AI Platform team to integrate LLMs and machine learning models into operational processes. Observability & Monitoring Lead the observability roadmap spanning metrics, logs, traces, and experience monitoring. Define and standardize tooling and operational practices using Datadog, Splunk, Prometheus, Grafana, and OpenTelemetry. Deliver actionable dashboards and reporting for availability, performance, latency, and error budget consumption. DevOps, CI/CD & Release Reliability Partner with the DevEx and Cloud Engineering teams to strengthen CI/CD reliability, safety, and automation. Promote progressive delivery (canary, blue/green, feature flags) to reduce deployment risk. Ensure quality gates, automated rollback, and deployment safeguards are consistently applied. Incident Management & Operational Excellence Lead major incident response and escalation processes for critical digital platforms. Improve MTTD, MTTR, and reduce incident recurrence through preventive engineering and automation. Maintain operational readiness through runbooks, on‑call processes, and post‑incident learning. Cloud Reliability & FinOps Ensure cloud reliability and scalability across On-Prem, Azure, and GCP environments. Collaborate with Finance and Platform teams to support FinOps practices, cost optimization, and capacity planning. Optimize performance and availability across high‑traffic, customer‑facing platforms. Leadership & Talent Development Lead and develop high-performing SRE teams, including managers, engineers, and technical specialists. Support career pathways, skill frameworks, and upskilling initiatives aligned to SRE disciplines. Foster a culture centered on ownership, accountability, curiosity, and continuous learning. Required Qualifications 10+ years of experience in software engineering, platform operations, or site reliability engineering. 5+ years in leadership roles managing SRE, DevOps, or platform reliability teams at scale. Preferred Qualifications Experience using AI/ML capabilities in operations (anomaly detection, predictive alerting, log analysis, automated remediation). Hands‑on knowledge of AIOps platforms (e.g., Datadog Watchdog, Dynatrace Davis, Splunk AI, or custom ML/LLM tooling). Deep expertise in cloud infrastructure, distributed systems, and high‑availability architectures. Strong understanding of SRE principles, DevOps practices, and modern reliability engineering. Experience running mission‑critical digital systems with large-scale user traffic. Effective communication and stakeholder influence skills, including with senior technology leaders. Experience working in regulated industries (e.g., healthcare, financial services, insurance). Demonstrated success collaborating with platform engineering, AI teams, architecture, and cro