About the Role

This role focuses on building and enhancing systems that detect, respond to, and resolve service incidents. The engineer will work closely with multiple teams to strengthen monitoring, reduce downtime, and improve operational workflows.

Responsibilities

Design and implement tools for detecting and managing service disruptions
Improve incident response workflows to reduce resolution time
Collaborate with engineering teams to identify root causes of outages
Develop automation to minimize manual intervention during incidents
Maintain and scale alerting and monitoring infrastructure
Contribute to on-call rotations and post-incident reviews
Enhance observability across distributed systems
Support the development of runbooks and response playbooks
Work on reliability improvements for critical backend services
Integrate incident data into dashboards for real-time visibility
Optimize escalation paths for faster team engagement
Ensure compliance with incident management standards
Refine alerting thresholds to reduce noise
Participate in system design reviews with a focus on resilience
Drive adoption of best practices in incident handling
Collaborate on cross-team initiatives to improve uptime
Investigate performance bottlenecks during high-severity events
Build tools for incident simulation and readiness testing
Document incident trends and recommend preventive measures
Support integration of new services into incident management frameworks

Nice to Have

Master’s degree in computer science or related field
Experience with large-scale geospatial data systems
Background in building internal developer platforms
Contributions to open-source observability tools
Prior work in SRE or platform engineering roles
Experience with real-time data processing pipelines
Knowledge of incident command frameworks
Familiarity with post-incident analysis methodologies

Compensation

Competitive salary based on experience and location

Work Arrangement

Hybrid work model with flexible remote options

Team

Part of the platform reliability team focused on system resilience and operational excellence

Why This Role Matters

Service reliability is critical as systems grow in complexity. This role ensures incidents are detected early, managed efficiently, and resolved quickly to maintain trust and performance.

What You’ll Build

You will develop tools that automate detection, triage, and remediation of incidents, reducing human toil and increasing system resilience across the organization.

Available for qualified candidates

Mapbox is hiring a Software Development Engineer II, Incidents

About the Role

Responsibilities

Nice to Have

Compensation

Work Arrangement

Team

Why This Role Matters

What You’ll Build

Similar Jobs

Operations Coordinator - Clinical Trials (Bilingual EN/ES)

Account Executive, Rosetta Stone Latin America

Space Systems Engineer (Greece)

2026 Undergraduate Summer Internship - Venture Capital

Sales and Business Development Manager - Saudi Arabia

Senior M365 Power Platform Specialist