The Lead Site Reliability Administrator plays a critical role in ensuring the stability, scalability, and performance of cloud-based services. This position bridges the gap between development and operations by implementing automation, proactive monitoring, and robust incident management practices. The role requires deep technical expertise in cloud infrastructure, containerization, and DevOps tooling, along with strong leadership skills to guide reliability initiatives across teams. The administrator will be responsible for maintaining high availability, driving continuous improvement, and ensuring systems meet stringent service level objectives in a fast-paced, globally distributed environment.
Responsibilities
- Design and implement solutions to improve service availability, performance, and operational stability
- Automate routine tasks and processes within a cloud DevOps environment to increase efficiency
- Develop proactive monitoring and alerting systems to reduce incident frequency
- Respond to incidents in accordance with defined service level agreements
- Provide ongoing feedback to development teams on system defects, stability, and improvement opportunities
- Create and maintain runbooks and operational patterns for production application support
- Collaborate with IT and development teams to define and implement KPI monitoring and real-time transaction tracking
- Lead change validation efforts for deployments by infrastructure and development teams
- Participate in advanced troubleshooting of production issues reported by users or customers
- Take ownership of incident resolution, including root cause analysis and participation in SWAT investigations
- Work rotating shifts as required to support continuous operations
- Participate in on-call rotations to ensure 24/7/365 system coverage
Requirements
- Extensive experience with Linux systems and proficiency in scripting languages such as Shell, Python, Perl, or JavaScript
- Hands-on experience with public cloud platforms including Google Cloud, AWS, and Azure, as well as PaaS technologies like Kubernetes, Cloud Foundry, and BOSH
- Operational knowledge of containerization technologies such as Docker, rkt, and Mesos, along with microservices and RESTful architectures
- Proficiency with continuous delivery and automation tools such as GitOps, Ansible, Rundeck, or Argo CD
- Experience supporting middleware and Java-based applications including Apache, Tomcat, Spring, Struts, and Spark
- Familiarity with relational and NoSQL databases including Oracle, Postgres, MariaDB, and Cassandra
- Strong understanding of monitoring and observability tools such as New Relic, Dynatrace, AppDynamics, Zabbix, and check_mk, as well as logging platforms like Graylog and Kibana
- Experience with messaging and search technologies including Kafka, RabbitMQ, Solr, and Elasticsearch
- Proven ability to diagnose and resolve complex issues in high-volume environments with adherence to security and ITIL standards
- Demonstrated leadership and collaboration skills with the ability to manage multiple priorities and work across teams
Tech Stack
Linux, Shell, Python, Perl, JavaScript, Google Cloud, AWS, Azure, Kubernetes, Cloud Foundry, BOSH, Docker, rkt, Mesos, microservices, RESTful architectures, GitOps, Ansible, Rundeck, Argo CD, Apache, Tomcat, Spring, Struts, Spark
Benefits
- Comprehensive benefits package supporting physical, emotional, and financial wellbeing
- Eligibility for variable and commission-based compensation
- Vacation entitlement
- Paid time off
Compensation
$103,250 - $153,250. Compensation may vary based on candidate’s education, experience, skills, geographical location, and alignment with internal equity and external market
Team
Part of a cloud DevOps organization, collaborating cross-functionally with development teams and IT business partners
- Innovation
- Creativity
- Collaboration
- AI-First
- Future-Driven
- Human-Centered
Additional Information
- This role operates in a dynamic, agile environment with frequent deployments and rapid incident response cycles.
- Candidates must be comfortable working in a high-pressure, on-call environment with mission-critical systems.
- Strong documentation and communication skills are essential for effective cross-team collaboration.
- Opportunities for professional growth and specialization in cloud-native technologies are supported.
- Regular participation in post-incident reviews and system improvement initiatives is expected.


