About the Role
The role involves designing and maintaining observability solutions to support reliable and scalable infrastructure. The engineer will implement monitoring, logging, and alerting systems to improve system health visibility and reduce mean time to resolution.
Responsibilities
- Design and deploy observability frameworks for cloud-native environments
- Integrate telemetry data from distributed services and infrastructure components
- Develop and maintain alerting rules to detect system anomalies
- Optimize log aggregation and retention strategies
- Collaborate with engineering teams to instrument applications for monitoring
- Troubleshoot performance issues using metrics and tracing tools
- Ensure observability solutions meet reliability and scalability requirements
- Automate deployment and configuration of monitoring tools
- Maintain dashboards for real-time system health insights
- Support incident response with detailed diagnostic data
- Evaluate and adopt new observability technologies
- Enforce best practices in metric naming and data collection
- Improve system reliability through proactive monitoring
- Document system architecture and observability patterns
- Participate in on-call rotations for critical systems
- Drive initiatives to reduce alert fatigue
- Work with security teams to ensure data compliance in logs
- Scale monitoring infrastructure to match platform growth
- Standardize observability tooling across teams
- Mentor engineers on monitoring and debugging techniques
Nice to Have
- Experience with OpenTelemetry or similar open-source observability tools
- Background in building internal developer platforms
- Contributions to open-source observability projects
- Familiarity with service mesh technologies
- Advanced knowledge of time-series databases
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model with flexibility for remote work
Team
Collaborative engineering team focused on scalable infrastructure
Why This Role Matters
Effective observability is critical to maintaining system reliability as the platform scales. This role directly impacts the ability to detect, diagnose, and resolve issues quickly, minimizing downtime and improving developer productivity.
What You’ll Build
You will design end-to-end observability pipelines, integrate telemetry across services, and create tooling that empowers teams to monitor and debug their systems effectively.
Available for qualified candidates