AI Platform Reliability Engineer

Oracle

FULL_TIME Remote · US Salt Lake City, UT, US Posted: 2026-05-11 Until: 2026-06-10

You will be redirected to the original job posting on BeBee.
Apply directly with the employer.

Job Description

Job Description Oracle Health is seeking an AI Platform Reliability Engineer to ensure our AI agent platform and AI-enabled analytics workflows are reliable, observable, measurable, and safe in production. This role will focus on the operational foundation for production AI systems, including monitoring, tracing, evaluation in production, rollback controls, alerting, versioning, runtime diagnostics, and quality safeguards. The engineer will also support data reliability use cases such as detection of stopped processing, data gaps, freshness issues, schema drift, and anomaly conditions that affect downstream analytics and reporting. The ideal candidate brings strong engineering discipline in observability, release safety, and operational tooling, with the ability to apply those skills to modern AI and agent-based systems. This role is critical to maintaining trust in AI outputs and ensuring new capabilities can scale safely across Oracle Health. Responsibilities Build and maintain observability, logging, tracing, and monitoring for AI agents, agent tools, and AI-enabled analytics workflows. Implement release, rollout, rollback, and versioning controls for prompts, models, tools, and configurations. Design and support production evaluation practices to detect regressions, silent failures, quality drift, and performance issues. Contribute to data monitoring and reliability workflows, including detection of stopped processing, data gaps, freshness issues, schema drift, and anomalies. Support incident response, triage, root-cause analysis, and operational reporting for AI and data reliability issues. Partner with architects and AI engineers to ensure systems are production-ready, measurable, and maintainable. Implement latency, throughput, and cost monitoring controls for AI-enabled s