Responsibilities
- Develop and maintain observability solutions using platforms like Datadog, Prometheus and Grafana
- Take a leading role in incident management, including coordinating response efforts, troubleshooting issues, and identifying follow-up actions
- Partner with product engineering teams to architect reliable systems, recover from incidents, and learn from mistakes
- Work with teams to implement and maintain SLOs, monitoring, and alerting strategies that ensure reliability at scale
- Design and implement automation and support tooling to improve system resilience, maintain operational safety and reduce operational overhead
- Lead the development and maintenance of runbooks, alert definitions, and incident response procedures
- Participate in on-call rotations to provide 24/7 support for critical production systems
Requirements
- 6+ years of experience in Site Reliability Engineering or similar DevOps roles focused on system reliability and incident management
- Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog
- Experience in at least one systems programming language, such as Python, Go, Rust, C/C++, or Java
- Expertise with Infrastructure as Code tools, like Terraform and Helm
- Expertise with at least one major cloud service provider (AWS, GCP, Azure)
- Strong communication skills, with the ability to lead incident response and effectively collaborate across teams
- Willingness and experience engaging with on-call rotations and emergency response procedures
- A high degree of agency and bias towards action. Identify problems and work autonomously to solve them
- Excellent problem-solving skills and a methodical approach to troubleshooting complex issues
Nice to Have
- Experience building multi-tenant, multi-cloud SaaS/DBaaS Platforms
- 4+ years of hands-on experience architecting applications for Cloud Platforms, and managing Cloud based infrastructure
- Knowledge of edge computing or mesh networking
- Experience instrumenting advanced observability practices (tracing, profiling) in distributed systems
- Experience working with globally distributed teams
- Proven experience in project management
Benefits
- Competitive salaries and meaningful equity
- Health, dental, vision, life, and disability insurance, plus a 401(k) and flexible spending accounts
- Flexible time off
- Atlanta and San Francisco offices are open if you ever want a place to work or meet up with teammates
Work Arrangement
Hybrid
Additional Information
- Grit.
- Curiosity.
- Adaptability.
- And a genuine spark for what we’re building.