About the Role
This role focuses on building and enhancing systems that detect, respond to, and resolve service incidents. The engineer will work closely with multiple teams to strengthen monitoring, reduce downtime, and improve operational workflows.
Responsibilities
- Design and implement tools for detecting and managing service disruptions
- Improve incident response workflows to reduce resolution time
- Collaborate with engineering teams to identify root causes of outages
- Develop automation to minimize manual intervention during incidents
- Maintain and scale alerting and monitoring infrastructure
- Contribute to on-call rotations and post-incident reviews
- Enhance observability across distributed systems
- Support the development of runbooks and response playbooks
- Work on reliability improvements for critical backend services
- Integrate incident data into dashboards for real-time visibility
- Optimize escalation paths for faster team engagement
- Ensure compliance with incident management standards
- Refine alerting thresholds to reduce noise
- Participate in system design reviews with a focus on resilience
- Drive adoption of best practices in incident handling
- Collaborate on cross-team initiatives to improve uptime
- Investigate performance bottlenecks during high-severity events
- Build tools for incident simulation and readiness testing
- Document incident trends and recommend preventive measures
- Support integration of new services into incident management frameworks
Nice to Have
- Master’s degree in computer science or related field
- Experience with large-scale geospatial data systems
- Background in building internal developer platforms
- Contributions to open-source observability tools
- Prior work in SRE or platform engineering roles
- Experience with real-time data processing pipelines
- Knowledge of incident command frameworks
- Familiarity with post-incident analysis methodologies
Compensation
Competitive salary based on experience and location
Work Arrangement
Hybrid work model with flexible remote options
Team
Part of the platform reliability team focused on system resilience and operational excellence
Why This Role Matters
Service reliability is critical as systems grow in complexity. This role ensures incidents are detected early, managed efficiently, and resolved quickly to maintain trust and performance.
What You’ll Build
You will develop tools that automate detection, triage, and remediation of incidents, reducing human toil and increasing system resilience across the organization.
Available for qualified candidates