Burlington, Massachusetts, United States Employment

Veracode is hiring a Site Reliability Engineering Manager

About the Role

Veracode is looking for a Site Reliability Engineering Manager to lead the reliability, availability, and operational excellence of our production systems. You will define and enforce reliability standards, manage production risk, and ensure services meet agreed-upon service levels.

What You'll Do

  • Lead a 9 member global Site Reliability Engineering Team.
  • Set objectives and key results, KPIs and manage team performance.
  • Act as the primary point of accountability for reliability concerns that span multiple teams, including DevOps, Security, Database, and Product Engineering.
  • Manage the team on-call schedule and act as the point of escalation for alerts and production incidents.
  • Create tickets, groom the backlog, and prioritize work in sprints.
  • Utilize AWS services to design scalable cloud solutions that support critical systems.
  • Partner with software engineering teams to ensure monitoring and alerting is in place for consistent, scalable, and automated service delivery.
  • Own the design and enforcement of the organization’s observability strategy.
  • Drive alert hygiene, standardization, and reduction of alert fatigue across the organization.
  • Lead efforts to automate infrastructure deployment and management using Terraform, Kubernetes, and other cloud-native tools.
  • Create automated incident response workflows to handle common infrastructure and application issues.
  • Collaborate with security teams to ensure systems adhere to industry-standard security practices.
  • Document and train engineering teams on best practices in reliability, scalability, and operational excellence.
  • Design, operate, and continuously improve on-call and incident response processes.
  • Contribute to incident and process post-mortems.
  • Ensure uptime, SLAs, and availability of critical platform components through process improvements and automation.
  • Monitor existing application and infrastructure while working to improve existing monitoring.
  • Communicate effectively with project stakeholders and management.
  • Develop and support processes to maintain uptime, SLAs and availability of critical platform components.
  • Troubleshoot and resolve production issues related to systems, network, and application.

What We're Looking For

  • Bachelor's Degree in Computer Science, Information Science, Engineering, or related/relevant field or equivalent experience.
  • 2+ years working as a manager or team lead with direct reports.
  • 5+ years working in an SRE, DevOps, Cloud Engineering or similar role.
  • Experience with AWS and automation tools like Terraform, CloudFormation, or Ansible.
  • Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters.
  • Hands-on proficiency with observability, monitoring, and alerting tools (Datadog, Sumologic, Prometheus, Grafana, etc.).
  • Familiarity with CI/CD pipelines and repository management tools (e.g., GitLab, Jenkins, GitHub).
  • Strong programming skills for automation (Python, Go, or similar languages).
  • Solid understanding of infrastructure as code (IaC) and GitOps methodologies.
  • Strong communication skills with the ability to collaborate effectively across different teams.
  • Ability to work in an Agile environment.
  • Proven experience in troubleshooting production environments and improving system reliability.
  • Experience with on-call/incident management systems such as PagerDuty, VictorOps or OpsGenie.

Nice to Have

  • Experience with service meshes (e.g., Istio) to enhance application observability and security.
  • Familiarity with advanced Kubernetes features (e.g., StatefulSets, Helm, Operators).
  • Knowledge of database management and migration processes, including RDS and DMS.

Technical Stack

  • Cloud & Infrastructure: AWS, Terraform, CloudFormation, Ansible, Kubernetes
  • Monitoring & Observability: Datadog, Sumologic, Prometheus, Grafana
  • CI/CD & Development: GitLab, Jenkins, GitHub, Python, Go
  • Additional Tools: Istio, Helm, RDS, DMS

Team & Environment

You will lead a 9 member global team of Site Reliability Engineers.

Veracode provides employment opportunities to all applicants without regard to race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Required Skills
AWSTerraformCloudFormationAnsibleKubernetesDatadogSumologicPrometheusGrafanaGitLabSite Reliability EngineeringDevOpsCloud EngineeringAutomationTroubleshooting
Scaling your freelance income?

Invoice multiple clients effortlessly

Managing 3+ international clients? Glopay streamlines everything. One EU company, unlimited invoices, automatic compliance. You just send and get paid.

Unlimited clients & invoices
Multi-currency support
Automated tax compliance
Client portal for easy payments
Scale with Glopay
Trusted by 10,000+ freelancers
About company
Veracode

Veracode offers industry-leading application security solutions, helping businesses secure their software through comprehensive security testing and development tools.

Visit website
Job Details
Department Engineering
Category infrastructure
Posted 14 days ago