About the Role
The role involves maintaining and improving the reliability of large-scale systems by combining software engineering and operations expertise to support critical infrastructure and services.
Responsibilities
- Monitor system performance and respond to incidents
- Design and maintain scalable infrastructure systems
- Implement automated solutions for operational tasks
- Collaborate with development teams to improve code deployability
- Troubleshoot complex production issues
- Develop tools to enhance system observability
- Support continuous integration and delivery pipelines
- Optimize system reliability and uptime
- Participate in on-call rotation for critical systems
- Document system architecture and operational procedures
- Enforce security and compliance standards
- Contribute to disaster recovery planning
- Evaluate new technologies for operational efficiency
- Improve monitoring and alerting frameworks
- Work with distributed systems and cloud platforms
- Ensure efficient resource utilization across environments
- Drive incident post-mortem analysis and follow-up actions
- Promote best practices in configuration management
- Support containerized application deployments
- Maintain high availability for critical services
- Assist in capacity planning and forecasting
- Integrate feedback loops for system improvements
- Collaborate on performance tuning initiatives
- Support global infrastructure with low-latency requirements
- Contribute to internal knowledge sharing
Nice to Have
- Experience with large-scale production environments
- Background in open-source contributions
- Familiarity with service mesh technologies
- Knowledge of database administration
- Experience with infrastructure as code tools
- Understanding of site reliability engineering principles
- Exposure to global team collaboration
- Proficiency with automation frameworks
- Experience in agile development environments
- Strong grasp of system architecture patterns
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid work model with flexibility for remote operations
Team
Collaborative engineering environment focused on system stability and performance
About the Team
- This team focuses on building resilient systems using open-source technologies.
- Engineers work closely with development and operations groups to deliver reliable services.
Technology Stack
- Primary tools include Linux, Kubernetes, Prometheus, and Git.
- Cloud platforms and containerized environments are central to operations.
Available for qualified candidates