About the Role
Lead the development and integration of network operations and observability frameworks to enhance monitoring, troubleshooting, and system resilience across enterprise platforms.
Responsibilities
- Architect end-to-end observability systems integrating logs, metrics, and traces
- Design scalable network monitoring solutions for hybrid environments
- Define standards for telemetry data collection and alerting
- Collaborate with engineering teams to embed observability into services
- Evaluate and implement tools for network performance analysis
- Establish incident detection and root cause analysis workflows
- Drive automation of operational checks and health assessments
- Optimize network traffic visibility across cloud and on-prem infrastructure
- Develop dashboards and reporting for system health and latency
- Ensure compliance with security and data governance policies
- Lead post-incident reviews and recommend system improvements
- Support capacity planning using performance data insights
- Integrate network telemetry with centralized monitoring platforms
- Standardize instrumentation practices across development teams
- Troubleshoot complex network and application-layer issues
- Maintain documentation for network topologies and monitoring rules
- Evaluate new observability technologies and protocols
- Provide guidance on service-level objectives and error budgets
- Coordinate with security teams on monitoring access controls
- Mentor engineers on best practices in network operations
- Ensure high availability of monitoring infrastructure
- Implement proactive alerting to reduce mean time to detection
- Support disaster recovery planning with monitoring coverage
- Drive adoption of SLOs and error budgeting across services
- Contribute to on-call rotation for critical system alerts
Compensation
Competitive salary based on experience
Work Arrangement
Hybrid work model with flexible scheduling
Team
Collaborative engineering environment focused on infrastructure reliability
Technology Stack
- Uses Prometheus for metrics collection and alerting
- Leverages Grafana for visualization and dashboarding
- Implements OpenTelemetry for standardized instrumentation
- Operates on AWS with hybrid on-prem connectivity
- Employs Kubernetes for container orchestration
Culture & Values
- Emphasizes transparency in system performance reporting
- Promotes blameless postmortems after incidents
- Encourages continuous learning and tool experimentation
- Values proactive problem detection over reactive fixes
- Supports engineers in presenting technical findings
Available for qualified candidates