About the Role
This role involves building and scaling observability solutions to support reliable and secure operations of distributed systems. The engineer will work closely with infrastructure and development teams to improve system visibility, incident response, and long-term reliability.
Responsibilities
- Design and implement observability pipelines for metrics, logs, and traces
- Develop alerting strategies that reduce noise and improve incident detection
- Maintain and scale monitoring platforms across hybrid and cloud environments
- Collaborate with engineering teams to instrument services effectively
- Troubleshoot complex system issues using telemetry data
- Optimize data retention and querying performance for observability tools
- Define and track key reliability metrics and SLOs
- Improve incident response workflows using observability insights
- Automate operational tasks related to monitoring and alerting
- Ensure observability systems meet security and compliance standards
- Evaluate and integrate new observability technologies
- Document system architecture and monitoring practices
- Support on-call rotations with post-incident reviews
- Drive best practices in logging, metric collection, and distributed tracing
- Work with global teams across different time zones
- Identify systemic risks through data analysis
- Contribute to reliability-focused engineering initiatives
- Enhance visibility into network and node performance
- Standardize telemetry configurations across services
- Improve system health diagnostics using real-time data
- Support capacity planning with observability insights
- Promote a culture of operational excellence
- Respond to critical incidents with detailed root cause analysis
- Ensure high availability of monitoring infrastructure
- Integrate observability into CI/CD pipelines
Compensation
Competitive salary and equity package
Work Arrangement
Remote
Team
Distributed engineering team focused on infrastructure and reliability
Why This Role Matters
Observability is critical to maintaining the integrity and performance of a decentralized network. This role directly impacts system uptime, developer productivity, and user trust by ensuring issues are detected and resolved quickly.
What We Value
We prioritize technical depth, proactive problem-solving, and a commitment to long-term system health. Candidates should demonstrate curiosity, ownership, and a drive to improve complex systems.
Available for qualified candidates