IT Developer/Senior Site Reliability Engineer

Aptino

FULL_TIME Remote · US , , United States, US Posted: 2026-05-11 Until: 2026-07-10

You will be redirected to the original job posting on BeBee.
Apply directly with the employer.

Job Description

Job Title: IT Developer / Senior Site Reliability Engineer (SRE – Agentic Operations) Location: Dallas, TX OR San Francisco, CA (Remote) Work Schedule : Monday to Friday [08:00 AM – 05:00 PM (PST)] Role Overview: We are looking for a Senior Site Reliability Engineer with strong development expertise to help evolve a traditional production support model into an automation-first, AI-assisted reliability platform . This role operates at a senior/staff level , focusing on reliability across distributed systems rather than a single application. You will work on modernizing operations following migration to Microsoft Azure , combining core SRE practices with agentic AI-driven automation . Unlike conventional SRE roles, this position emphasizes building intelligent, multi-agent systems that enhance incident response, system reliability, and operational efficiency—while ensuring humans remain accountable for critical decisions. How This Role Differs from Traditional SRE Begins with mastering manual SRE workflows Progressively introduces AI-assisted automation Moves toward semi-autonomous operational systems Maintains human-in-the-loop control for production-critical actions Key Responsibilities 1. Production Reliability & Operations Design, manage, and optimize highly available production systems in Azure Participate in and lead on-call rotations and critical incident response Conduct deep root cause analysis and lead blameless post-incident reviews Define and maintain SLIs, SLOs, and observability frameworks Monitor systems using tools like Dynatrace dashboards, alerts, and tracing Troubleshoot issues across: Java-based services Kubernetes clusters Cloud infrastructure Collaborate with engineering, platform, and security teams to reduce risk and operational overhead Ensure adherence to security, compliance, and regulatory standards (e.g., HIPAA) 2. AI-Driven (Agentic) Operations This is the core differentiator of the role. < data-start="2431" data-end="2481"> Incident Intelligence & Signal Processing </> Build systems that ingest and correlate: Logs, metrics, traces Alerts and monitoring signals Support tickets and escalation data Convert raw signals into structured, actionable incident insights < data-start="2704" data-end="2745"> Automated Triage & Investigation </> Develop AI agents that: Analyze telemetry and system changes Identify probable failure points Recommend next actions Implement parallel/multi-agent workflows for faster diagnosis < data-start="2954" data-end="2992"> Remediation & Safe Automation </> Design automation for controlled actions such as: Service restarts Scaling operations Rollbacks and feature toggles Ensure all production-impacting actions follow: Predefined guardrails Approval workflows (human-in-the-loop) Gradually evolve from advisory systems → semi-autonomous execution < data-start="3334" data-end="3383"> Communication & Post-Incident Automation </> Build agents that: Generate incident updates for stakeholders Draft post-incident reports Standardize communication across teams Ensure outputs are auditable, consistent, and production-ready Technology Environment Core Stack Cloud: Microsoft Azure Containers: Kubernetes, Docker Backend: Java-based services CI/CD: GitHub Actions Observability: Dynatrace Automation & Scripting Python, Bash, Ansible AI & Automation Frameworks Microsoft Agent Framework Azure-hosted AI agents Multi-agent orchestration systems Human-in-the-loop safety models Required Qualifications: 7+ years of experience in Site Reliability Engineering / Production Engineering