About the Role
This position is responsible for ensuring the reliability and performance of network infrastructure through proactive monitoring, troubleshooting, and coordination with engineering teams during incidents.
Responsibilities
- Monitor system health and network performance across global environments
- Respond to alerts and initiate incident resolution procedures
- Escalate technical issues to appropriate engineering teams
- Document incidents and maintain accurate status updates
- Perform root cause analysis for recurring system issues
- Support deployment of new services and infrastructure changes
- Maintain operational documentation and runbooks
- Coordinate with cross-functional teams during outages
- Ensure compliance with security and operational policies
- Conduct routine system health checks and capacity reviews
- Assist in developing automation for repetitive tasks
- Participate in on-call rotation for 24/7 coverage
- Troubleshoot network connectivity and service delivery problems
- Validate backup and failover mechanisms
- Track key performance metrics and service level indicators
- Support cloud-based infrastructure operations
- Identify potential risks to system stability
- Improve monitoring coverage and alerting accuracy
- Respond to security-related events in coordination with security teams
- Maintain familiarity with system architecture and dependencies
Nice to Have
- Certifications such as CompTIA Network+, CCNA, or equivalent
- Experience supporting real-time messaging or communication platforms
- Background in large-scale distributed systems
- Familiarity with microservices architecture
- Exposure to CI/CD pipelines and deployment automation
- Working knowledge of Kubernetes or Docker
- Experience with log analysis tools like Splunk or ELK
- Understanding of service-level objectives and error budgeting
- Prior work in a 24/7 NOC environment
- Scripting experience beyond basic automation tasks
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model with flexible scheduling
Team
Part of a global team supporting real-time digital infrastructure
What We Look For
Candidates should demonstrate a consistent track record of maintaining system reliability, responding effectively to incidents, and collaborating across technical teams to resolve complex issues.
Work Environment
Fast-paced operations center with real-time monitoring, requiring vigilance, clear communication, and adherence to procedures during high-pressure situations.
Available for qualified candidates