Senior Site Reliability Engineer

VeridianTech

FULL_TIME Remote · US Richardson, TX, United States, TX, US Posted: 2026-05-11 Until: 2026-07-10

You will be redirected to the original job posting on BeBee.
Apply directly with the employer.

Job Description

Role: Senior Site Reliability Engineer Locations: Richardson, TX / Raleigh, NC / Phoenix, AZ / Hartford, CT / Indianapolis, IN Type of Hiring: FTE Job Description: Bachelor's degree or foreign equivalent required from an accredited institution. Will also consider three years of progressive experience in the specialty in lieu of every year of education At least 11 years of Information Technology experience At least 6 years of Site reliability engineering (SRE) experience in large programs with focus on architecting and implementing observability, automation across the entire lifecycle of operations. Observability & Monitoring: Implement logging, monitoring, and alerting using any one of Dynatrace, Datadog, Splunk, Nagios, Prometheus, Grafana, ELK stack, or New Relic. Analyze monitoring data/ golden signals to identify trends and patterns and proactively address potential problems. Engagement to debug, optimize code, and automate routine operational tasks Improve automation and increase the system's self-healing capability Incident Management: participate in production incidents, perform root cause analysis (RCA), and drive post-mortem improvements. Develop and maintain dashboards and reports to visualize system health and performance. Use various technologies such as: ansible, Python, terraform, Powershell/Shell, JSON, create automation to reduce toil in operations Develop automation solutions for repeated incidents/ service tasks for provisioning, scaling, backup, performance management, security, capacity mgmt etc. for infrastructure operations - Or - Develop automation/optimization solutions for repeated tickets/ signals on application operations Preferred Qualifications: Working Knowledge of Troubleshooting and providing speedy solution in case of failure of the database. SLI, SLO, error budgets. Event correlation, AIOps with deep understanding of ITSM tools Microservices architecture with API's and REST API's CICD tooling and best practices Cloud platforms such as AWS, Azure, and Google Container orchestration and practices, including Kubernetes, Docker Swarm Infrastructure automation tools like Terraform, Cloud Formation, Ansible, and Puppet (Any one) Scripting Languages: any of the following: Python, JSON, Java, Node.JS, PHP, PowerShell(M) or Bash/Shell/Perl ITSM tools such as: ServiceNow Excellent Communications and client interaction skills along with exceptional written and verbal skills as well as technical documentation Extraordinary Planning, Project Management, Coordination, and Analytical skills Hands-on experience in working in Global Delivery Model with onsite/offshore resources Exceptional Organizational Skills Ability to manage and prioritize tasks efficiently Readiness to demonstrate a proactive attitude Solid attention to detail and excellent written and verbal communication skills are required Ability to work in team in diverse/ multiple stakeholder environment