Job Description
Systems Reliability Engineers use a software engineering approach to architect, design, automate, monitor, and operate platforms at scale. This includes engineering reliable infrastructure with close business segment alignment, delivering services through efficient and resilient architectures, and integrating next-generation technologies to accelerate developer velocity. This position is for a senior-level systems reliability engineer eager to play an integral role on a Platform team, leading the design and delivery of infrastructure, tooling, and platform services. The Senior SRE will partner with application and product teams to onboard technologies, design resilient architectures, build and support development pipelines, automate infrastructure, create telemetry for monitoring, and reinforce best practices for security and reliability. The Senior SRE serves as a subject matter expert in compute and platform infrastructure, sharing knowledge internally and across teams. Responsibilities: Engineer for high reliability, security, and operational excellence across on-premises and cloud environments. Partner with application teams to design, build, and support CI/CD pipelines, infrastructure automation, and monitoring solutions. Develop and maintain telemetry, documentation, and best practices for performance, capacity, and incident management. Serve as a subject matter expert in compute and platform infrastructure, sharing knowledge internally and across teams. Collaborate with business partners to facilitate infrastructure migrations and evangelize DevOps best practices. Accelerate technology adoption across teams through consultation, onboarding, and enablement of new tools and platforms, including AI-assisted development workflows. Design and build reusable CI/CD pipeline components that teams across the enterprise can consume as self-service modules. Define module interfaces with clear contracts, versioning, and deprecation policies so consuming teams can adopt with confidence. Integrate security and compliance tooling (SAST, DAST, vulnerability scanning, SIEM, cloud configuration scanning) into automated pipeline workflows that reduce manual security onboarding effort. Partner with security teams to translate manual compliance processes into automated, auditable pipeline steps. Build self-service capabilities for common infrastructure needs (DNS provisioning, secrets management, infrastructure catalog) that reduce developer dependency on SRE ticket queues. Instrument platform services and modules with telemetry to measure adoption, reliability, and developer experience. Basic Qualifications: 5+ years' experience in software engineering, systems reliability engineering, or platform/infrastructure engineering. 5+ years of systems administration experience with major operating system platforms. 5+ years of automating the operations of large systems. 3+ years of experience building developer-facing tools, platforms, or internal services. Demonstrated experience designing reusable components, libraries, or modules used by teams outside your own. Experience working in a large enterprise environment with multiple teams, competing priorities, and cross-organizational coordination. Strong written communication skills — able to produce module documentation, architecture proposals, and developer guides that engineers across the organization rely on. Technical Requirements: Deep expertise with distributed systems, container orchestration, and cloud platforms (e.g., Kubernetes, AWS, Docker). Strong experience with infrastructure-as-code and automation platforms (e.g., Terraform, OpenTofu, Ansible, AWS CDK). Proficiency in at least one scripting and programming language (e.g., Bash, Python, Go, JavaScript/TypeScript). Deep expertise in Linux administration, including performance monitoring, configuration, and troubleshooting. Hands-on experience with CI/CD pipelines and source control systems (e.g., GitLab CI/CD, GitHub Actions), including authoring reusable pipeline components and templates. Solid understanding of networking fundamentals and protocols (HTTP, TLS, SSH, DNS, VPCs,load balancing). Experience integrating security scanning tools into CI/CD pipelines (e.g., Snyk, Checkmarx,Wiz, Semgrep, GitGuardian). Experience with monitoring, logging, and observability platforms (e.g., Datadog, Splunk, CloudWatch). Experience deploying infrastructure programmatically through APIs and SDKs. Ability to implement monitoring, instrumentation, and telemetry for systems and applications. Familiarity with API design principles — building interfaces that other teams depend