We are seeking an experienced and analytically-minded Site Reliability Engineer to join Arista Networks on a permanent, remote basis from Ireland. In this role, you will be instrumental in building, deploying, and operating critical production systems with a focus on scalability, reliability, observability, and security, while collaborating with cross-functional teams to ensure resilient and future-ready infrastructure.
What You'll Do
- Design, build, and deploy production systems with a focus on scalability, reliability, observability, and performance, ensuring systems meet stringent security standards
- Develop and maintain comprehensive automation solutions to eliminate toil and streamline operational efficiency across production environments
- Proactively monitor production systems, establish intelligent alerting strategies, and implement automated incident response mechanisms to minimise downtime
- Create and maintain detailed incident response runbooks; conduct thorough postmortem analyses following incidents to identify root causes and prevent recurrence
- Collaborate with software engineering teams to identify and resolve infrastructural bottlenecks, designing innovative solutions that enhance product deployment workflows
- Manage and optimise monitoring infrastructure using industry-standard tools, ensuring comprehensive visibility across all systems
- Plan, communicate, and execute maintenance windows on production systems with minimal disruption to service availability
- Triage platform and infrastructural issues with decisiveness and analytical rigour; engage with third-party vendors and support teams as required
- Deploy new systems and updates in a staged, risk-managed manner, ensuring safe and incremental rollouts
- Survey and adopt best practices in infrastructure and platform management to maintain secure, scalable, and fault-tolerant systems
- Study the design and implementation details of open-source systems to enhance troubleshooting capabilities and accelerate issue resolution
- Work transparently with stakeholders to communicate system status, planned maintenance, and infrastructure improvements
Technical Stack
Our technology stack includes automation with Ansible and Terraform, observability using Prometheus and Grafana, cloud platforms such as AWS, GCP, and Azure, container orchestration with Kubernetes and Docker, and CI/CD pipelines via Jenkins and GitLab.
Work Mode
This is a remote role available to candidates based in Ireland, offering full remote flexibility within the country.
