Serve as a technical leader ensuring high system reliability, availability, and performance through collaboration with engineering and operations teams. Champion best practices in automation, monitoring, and resilient infrastructure design across production environments.
Responsibilities
- Define and manage service-level objectives and indicators, track error budgets, and guide long-term reliability planning.
- Proactively detect reliability risks and lead efforts to resolve or prevent system issues.
- Develop and promote standardized CI/CD practices and pipelines across teams.
- Deploy and scale observability systems including metrics, logs, and distributed tracing using modern tools.
- Create intuitive dashboards and actionable alerts with runbooks to support on-call operations.
- Lead on-call rotations, incident triage, and coordinate responses, ensuring thorough post-incident reviews with root cause analysis.
- Automate deployment recovery processes, including rollbacks and health verifications.
- Perform load and resilience testing, and manage capacity planning with cost-efficient scaling and caching strategies.
- Optimize performance of databases, message queues, and network configurations for latency and throughput.
- Reduce manual work through automation and self-service platforms, and standardize deployment and recovery workflows.
- Implement reliability controls such as chaos engineering, rate limiting, circuit breakers, and retry policies.
- Manage and secure Kubernetes clusters, container runtimes, and service mesh configurations.
- Operate infrastructure as code using tools like Terraform, Bicep, or CloudFormation, including secrets and policy management.
- Integrate security into DevOps workflows, including vulnerability scanning, dependency checks, and IAM hardening.
- Work closely with development, QA, and product teams on system design, release approaches, and production readiness.
- Develop documentation and conduct training to improve team-wide reliability practices.
- Administer and tune relational and NoSQL databases for performance and availability.
- Partner with security teams to enforce strict database access policies and controls.
Requirements
- Bachelor’s degree in a technical field such as computer science, engineering, or related discipline.
- 4 to 8 or more years of experience in SRE, DevOps, or operations roles with direct responsibility for large-scale production systems.
- Hands-on experience with at least two major cloud platforms such as AWS, Azure, or GCP, including managed services.
- Proficiency with Docker and Kubernetes, including managed services like AKS, EKS, or GKE.
- Strong background in observability using OpenTelemetry, metrics, logs, traces, alerting, and incident postmortems.
- Experience with Infrastructure as Code tools such as Terraform, including modules and CI integration.
- Programming skills in Python or Go, and scripting in Bash for automation and tool development.
- Knowledge of SLOs, error budgets, capacity planning, resilience testing, and progressive delivery methods.
- Excellent communication skills, ability to stay composed under pressure, and a focus on continuous improvement.
- Strong SQL expertise and hands-on experience with major databases including PostgreSQL, MySQL, or SQL Server.
- Deep understanding of database architecture, replication, and indexing strategies.
- Familiarity with core networking concepts including DNS, TCP/IP, and load balancing.
- Experience designing reusable CI/CD pipeline templates and workflows.
- Proficiency with at least two CI/CD platforms such as Jenkins, GitHub Actions, GitLab CI, or Azure DevOps.
Nice to Have
- Advanced degree in Computer Science, Engineering, or a related technical field.
- Experience with additional programming languages such as Java, Node.js, or Go.
- Knowledge of frontend frameworks including React, Angular, or Vue.js.
- Hands-on experience implementing GitOps workflows using tools like ArgoCD or Flux.
- Familiarity with service mesh technologies such as Istio or Linkerd.
- Experience with advanced deployment strategies including blue-green, canary releases, and feature flags.
- Background in database CI/CD and automated schema migration tools like Flyway or Liquibase.
- Integration experience with security scanning tools such as SonarQube, OWASP, or Snyk.
- Proficiency with monitoring tools including Prometheus, Grafana, ELK, or Application Insights.
- Experience with configuration management systems like Ansible, Chef, or Puppet.
- Multi-cloud or hybrid cloud deployment experience.
- Experience building internal developer platforms to improve productivity.
- Development of CLI tools or IDE extensions to streamline developer workflows.
- Implementation of policy-as-code using tools like OPA or Sentinel.
- Cloud platform certifications from AWS, Azure, or GCP.
- Kubernetes certifications such as CKA or CKAD.
- Experience with monorepo tooling like Nx, Turborepo, or Bazel.
- Familiarity with API gateways and microservices architecture patterns.
Tech Stack
Azure, Docker, AKS, EKS, Helm, Istio, OpenTelemetry, Prometheus, Grafana, Azure Monitor, Log Analytics, Dynatrace, Elastic, GitHub Actions, Azure DevOps Pipelines, canary deployments, blue-green deployments, Terraform, Terragrunt, Bicep, Vault, Azure Key Vault, SSM, Dependabot, Cosign, OPA, PostgreSQL, MySQL


