Responsibilities
- Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting and maintaining SLOs, SLIs and Error Budgets, creating dashboards.
- Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's.
- Manage site stability, performance, reliability, and maintain uptime for production environments.
- Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns.
- Strive for automation to reduce toil and increase development velocity.
- Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed.
- Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
- Document resolution run books and standard operating procedures.
- Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
- Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams.
- Implementation of reliability and observability tools (like New Relic, Prometheus, Grafana etc.,)
- Collaborates with Security team and other platform engineering teams to build reliable, maintainable, and scalable solutions that improve our security posture.
Requirements
- Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
- Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic etc.)
- Experience implementing observability plans around logs, metrics, and traces.
- Experience in an agile development team developing software.
- Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code (Terraform, CloudFormation).
- Extensive experience with Docker, Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef.
- Strong experience with containerization technology and/or Kubernetes.
- Experience with Release automation, system administration, configuration management.
- Experience with programming languages (Java, Python, Go, etc.).
- Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.
- Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.
- Strong analytical and programming skills (Python, Go, Java etc.).
- Deep understanding around best practices for modern cloud security.
- Proven experience building observability for security concerns, such as privilege escalations and bot detection.
Work Arrangement
Hybrid
Additional Information
- candidates hired into fully remote roles are required to participate in an in-person interview or face-to-face meeting prior to their first day of employment


