What You'll Do
Architect and manage resilient, production-grade AWS infrastructure with a focus on security, availability, and long-term maintainability. You’ll shape the foundation that powers next-generation AI agent systems, ensuring it aligns with industry best practices and compliance standards.
Drive full lifecycle management of cloud resources using Terraform or OpenTofu, enforcing consistent patterns across teams and achieving complete infrastructure automation. Own the design and optimization of CI/CD pipelines in Harness, implementing safe deployment strategies and automated validation to support rapid iteration.
Design and operate secure environments by implementing least-privilege access, encryption at rest and in transit, secrets management, and network controls. Support machine learning initiatives by provisioning and maintaining SageMaker endpoints, Bedrock integrations, and GPU-backed compute resources.
Ensure system reliability through proactive monitoring, observability, and incident response. Define meaningful SLOs, reduce alert fatigue, and equip on-call engineers with actionable insights. Automate routine operations using Python, Bash, or Go, and contribute to disaster recovery planning with tested failover procedures.
Leverage AI coding tools like Claude Code and GitHub Copilot as productivity accelerators, while applying rigorous code review standards to all generated output—ensuring correctness, security, and clarity.
Requirements
You have a Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience. At least seven years in cloud infrastructure, DevOps, or SRE roles are required, including three years focused on large-scale AWS environments.
Your technical expertise spans core AWS services: EC2, VPC, IAM, RDS/Aurora, ECS/EKS, Lambda, S3, CloudWatch, Route 53, and KMS. You don’t just configure—you design systems with scalability and operational excellence in mind.
You’ve built reusable, production-ready Terraform or OpenTofu modules and managed state securely at scale. You’re experienced with CI/CD platforms, particularly Harness, though familiarity with GitHub Actions or Jenkins is also valuable.
Proficiency with Kubernetes (especially EKS), Docker, and container orchestration is essential. You write scripts in Python, Bash, or Go not for quick fixes, but to build durable operational tooling. You understand Linux internals, networking, and storage in production contexts.
You’ve used observability tools like CloudWatch, Datadog, Prometheus, or Grafana to monitor live systems and improve reliability. You’re comfortable participating in an on-call rotation and responding to critical incidents.
Communication is part of your craft—you explain complex infrastructure choices clearly to both technical and non-technical audiences. You document decisions thoroughly and lead by example, treating every system as if you’ll be the one supporting it years later.
Experience using AI-assisted development tools actively in your workflow is expected. You know how to prompt effectively, validate outputs critically, and integrate AI-generated code safely into production systems.
Benefits
Receive comprehensive medical coverage extending to your spouse and children. Enjoy paid time off that includes annual, casual, and sick leave, along with dedicated parental leave for both maternity and paternity. Religious observance is supported through Hajj and Umrah leave options.
- Performance-linked financial rewards
- Festival and referral bonuses
- Gratuity and leave encashment benefits
- Gym membership and wellness support
- Complimentary meals during office days
- Unlimited tea and coffee
- Transportation and mobile data allowances
- Career development funding
- Two annual team retreats: Summer and Winter Field Weeks
- Quarterly team outings
- Occasional gifts and recognition


