VSolvit is seeking a DevOps / Site Reliability Engineering (SRE) Lead to architect and manage a resilient, secure, and scalable Azure cloud environment. You will be responsible for the end-to-end design and operation of cloud-native infrastructure, CI/CD, observability, and reliability practices.
What You'll Do
- Design and operate Azure landing zones aligned with enterprise governance.
- Lead AKS production architecture including multi-region, high availability, and zero-downtime deployments.
- Define SLOs, SLIs, and error budgets across digital platforms.
- Implement production reliability frameworks including chaos testing and resilience validation.
- Architect blue-green, canary, and progressive deployment strategies.
- Build enterprise Infrastructure as Code (IaC) frameworks using Terraform, Bicep, or ARM with Azure DevOps or GitHub Actions.
- Implement a GitOps model using ArgoCD or Flux, enforce immutable infrastructure patterns, and standardize reusable infrastructure modules.
- Design secure multi-stage CI/CD pipelines with automated code quality, SAST/DAST, container scanning, and dependency vulnerability scanning.
- Integrate Azure Key Vault and managed identities into pipelines and enable automated rollback and deployment guardrails.
- Operate and optimize Azure Kubernetes Service (AKS) and Helm, implementing pod security policies, workload identity, and network policies.
- Build internal platform templates for developer self-service.
- Design an enterprise observability stack using Azure Monitor, Log Analytics, Application Insights, Prometheus, and Grafana.
- Define centralized logging, distributed tracing, and alerting frameworks.
- Lead production incident response and postmortem analysis and build real-time dashboards for leadership visibility.
- Implement Azure Policy & RBAC frameworks and design secure multi-tenant cloud architecture.
- Integrate Defender for Cloud, Conditional Access, and Identity Federation.
- Lead SOC2, ISO, and internal audit cloud controls and define a least privilege model across subscriptions.
- Implement predictive monitoring, anomaly detection, and automate capacity scaling using telemetry insights.
- Integrate ML-based alert reduction and noise suppression and enable self-healing infrastructure patterns.
- Lead DevOps and SRE engineers and establish reliability KPIs and maturity roadmap.
- Collaborate with Architecture, Security, Data, and Product teams.
- Drive platform modernization strategy and mentor teams on cloud native best practices.
What We're Looking For
- Proven experience leading the design and implementation of resilient, enterprise-scale Azure cloud platforms.
- Deep hands-on expertise with Azure Kubernetes Service (AKS) production architecture, including multi-region and high-availability deployments.
- Strong background in Site Reliability Engineering, including defining SLOs/SLIs, error budgets, and implementing chaos testing.
- Expert-level proficiency with Infrastructure as Code tools like Terraform, Bicep, or ARM.
- Extensive experience building and operating secure, multi-stage CI/CD pipelines with integrated security scanning.
- Demonstrated skill in implementing GitOps (ArgoCD/Flux), observability stacks (Azure Monitor, Prometheus, Grafana), and cloud security controls (Defender for Cloud, RBAC).
- Experience leading cloud compliance efforts (SOC2, ISO) and designing secure, multi-tenant architectures.
- Strong leadership skills with experience mentoring teams and collaborating across Architecture, Security, Data, and Product functions.
Technical Stack
- Azure, Azure Kubernetes Service (AKS)
- Terraform, Bicep, ARM
- Azure DevOps, GitHub Actions
- ArgoCD, Flux, Helm
- Azure Monitor, Log Analytics, Application Insights
- Prometheus, Grafana
- Azure Key Vault, Defender for Cloud
Team & Environment
You will lead DevOps and SRE engineers and collaborate closely with Architecture, Security, Data, and Product teams.
Work Mode
This position follows a hybrid work model and is based in Hyderabad, India.
VSolvit is an equal opportunity employer.


