About the Role

The role involves maintaining and improving the reliability, scalability, and performance of production systems by implementing robust automation, monitoring, and cloud infrastructure solutions.

Responsibilities

Design and manage cloud infrastructure for high availability
Implement and maintain CI/CD pipelines
Monitor system performance and respond to incidents
Automate deployment and operational workflows
Ensure system security and compliance standards
Collaborate with development teams on service reliability
Troubleshoot and resolve infrastructure issues
Optimize resource utilization and cost efficiency
Maintain documentation for systems and processes
Support disaster recovery and backup strategies
Lead incident response and post-mortem analysis
Evaluate and integrate new technologies
Improve observability through logging and metrics
Manage configuration and infrastructure as code
Scale systems to meet growing demand
Enforce best practices in system design
Participate in on-call rotations
Drive improvements in deployment reliability
Ensure consistency across development, staging, and production
Support containerization and orchestration platforms
Maintain network and service connectivity
Work with monitoring tools to detect anomalies
Implement access controls and identity management
Contribute to capacity planning
Promote a culture of operational excellence

Nice to Have

Master's degree in a technical field
Certifications in cloud platforms or DevOps practices
Experience with large-scale distributed systems
Background in site reliability engineering
Familiarity with service mesh technologies
Knowledge of database administration
Experience with multi-region deployments
Contributions to open-source projects
Public speaking or technical writing experience
Leadership in technical initiatives

Compensation

Competitive salary and benefits package

Work Arrangement

Remote-friendly with flexible scheduling

Team

Collaborative engineering team focused on scalable systems

Why This Role Matters

This position plays a critical part in ensuring system uptime and performance, directly impacting customer experience and business continuity through resilient infrastructure design and rapid incident resolution.

Technology Stack

The team uses AWS, Kubernetes, Terraform, Prometheus, Grafana, Docker, Jenkins, and GitLab for infrastructure, deployment, and monitoring, with a strong emphasis on automation and observability.

Available for qualified candidates

Zingtree is hiring a Senior DevOps / Platform Reliability Engineer