About the Role
The role involves bridging development and operations by applying engineering principles to infrastructure and operations problems. The focus is on building and maintaining reliable systems at scale.
Responsibilities
- Design and implement scalable monitoring solutions for distributed systems
- Develop automation tools to improve system reliability and reduce manual intervention
- Respond to and resolve critical production incidents in a timely manner
- Collaborate with development teams to enhance application performance and resilience
- Drive post-incident reviews and implement corrective actions
- Optimize system performance and availability across cloud environments
- Maintain and improve CI/CD pipelines for faster and safer deployments
- Enforce best practices in configuration management and infrastructure as code
- Support capacity planning and system scalability initiatives
- Contribute to disaster recovery planning and execution
- Evaluate and integrate new technologies that improve system stability
- Ensure compliance with security and operational standards
- Mentor junior engineers and share operational knowledge
- Participate in on-call rotations for critical systems
- Improve observability through logging, tracing, and metrics collection
- Troubleshoot complex cross-system issues in production environments
- Promote a culture of blameless post-mortems and continuous improvement
- Work closely with product teams to influence system design for reliability
- Automate routine operational tasks to increase efficiency
- Monitor system health and proactively address potential failures
Nice to Have
- Master's degree in computer science or related field
- Experience supporting mission-critical enterprise systems
- Contributions to open-source projects
- Familiarity with service mesh technologies
- Knowledge of large-scale data replication and consistency models
- Experience with performance benchmarking and tuning
- Background in software development with production code contributions
- Exposure to edge computing or hybrid cloud architectures
- Certifications in cloud or systems administration
- Track record of improving system uptime and reducing incident frequency
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid remote and office-based work model
Team
Collaborative engineering team focused on system reliability and scalability
Why This Role Matters
- This position plays a key role in maintaining the stability and performance of large-scale services used by global customers.
- Engineers in this role directly influence the reliability and efficiency of core infrastructure platforms.
Technology Environment
- Work is conducted in a Linux-based, open-source environment with extensive use of cloud-native technologies.
- Primary tools include Kubernetes, Prometheus, Git, and Ansible, running on public and private cloud infrastructures.
Available for qualified candidates


