About the Role
The role involves maintaining and improving the reliability, scalability, and performance of production systems by implementing robust automation, monitoring, and cloud infrastructure solutions.
Responsibilities
- Design and manage cloud infrastructure for high availability
- Implement and maintain CI/CD pipelines
- Monitor system performance and respond to incidents
- Automate deployment and operational workflows
- Ensure system security and compliance standards
- Collaborate with development teams on service reliability
- Troubleshoot and resolve infrastructure issues
- Optimize resource utilization and cost efficiency
- Maintain documentation for systems and processes
- Support disaster recovery and backup strategies
- Lead incident response and post-mortem analysis
- Evaluate and integrate new technologies
- Improve observability through logging and metrics
- Manage configuration and infrastructure as code
- Scale systems to meet growing demand
- Enforce best practices in system design
- Participate in on-call rotations
- Drive improvements in deployment reliability
- Ensure consistency across development, staging, and production
- Support containerization and orchestration platforms
- Maintain network and service connectivity
- Work with monitoring tools to detect anomalies
- Implement access controls and identity management
- Contribute to capacity planning
- Promote a culture of operational excellence
Nice to Have
- Master's degree in a technical field
- Certifications in cloud platforms or DevOps practices
- Experience with large-scale distributed systems
- Background in site reliability engineering
- Familiarity with service mesh technologies
- Knowledge of database administration
- Experience with multi-region deployments
- Contributions to open-source projects
- Public speaking or technical writing experience
- Leadership in technical initiatives
Compensation
Competitive salary and benefits package
Work Arrangement
Remote-friendly with flexible scheduling
Team
Collaborative engineering team focused on scalable systems
Why This Role Matters
This position plays a critical part in ensuring system uptime and performance, directly impacting customer experience and business continuity through resilient infrastructure design and rapid incident resolution.
Technology Stack
The team uses AWS, Kubernetes, Terraform, Prometheus, Grafana, Docker, Jenkins, and GitLab for infrastructure, deployment, and monitoring, with a strong emphasis on automation and observability.
Available for qualified candidates