Ericsson is looking for a Senior Site Reliability Engineer to champion the reliability, availability, performance, and scalability of our mission-critical services. In this senior role, you will partner with development and operations teams to guide system design and provide leadership in incident response.
What You'll Do
- Serve as a technical leader ensuring production service reliability, scalability, and performance.
- Collaborate with development teams to embed operability and automation into system architecture.
- Lead high-severity incident response, driving resolution and coordinating stakeholder communications.
- Champion root cause analysis and postmortems; ensure remediation is implemented and verified.
- Design and maintain sophisticated monitoring, alerting, deployment, and infrastructure automation systems.
- Oversee creation and regular review of operational runbooks/playbooks; lead resilience and chaos testing exercises.
- Drive service lifecycle processes, including operational readiness, onboarding, and decommissioning.
What We're Looking For
- B.Sc., M.Sc., degree in a relevant area, or equivalent experience.
- 7–10+ years in systems engineering, DevOps, or SRE roles, with at least 3 years in a senior/lead capacity driving reliability initiatives.
- Expert knowledge of SRE principles: SLIs, SLOs, error budgets, and reliability engineering methodologies.
- Advanced Linux systems administration and troubleshooting skills, spanning cloud (AWS/Azure/GCP) and on-premises environments.
- Extensive production experience with Kubernetes and container ecosystems (Docker, CRI).
- Proficiency with Infrastructure as Code (Terraform, CloudFormation, Ansible) and automation scripting (Python, Go, Bash).
- Strong background in designing/operating CI/CD pipelines, automated deployments, and rollout strategies (canary, blue-green).
- Expertise with observability tools such as Prometheus, Grafana, ELK/EFK, Splunk, plus distributed tracing frameworks (Jaeger, Zipkin, OpenTelemetry).
- Solid networking skills (TCP/IP, routing, load balancing) and security best practices (TLS, identity, secrets management).
- Demonstrated thought leadership in designing and operating complex distributed systems.
- Proven ability in capacity planning, performance tuning, profiling, and cost optimization at scale.
- Understanding of telecom architectures (IMS, 4G/5G core concepts) and carrier-grade availability standards.
- Command operational excellence during incidents, coordinating cross-team responses in high-pressure situations.
- Lead structured problem-solving for deep root cause analysis with actionable follow-through.
- Establish operational standards, best practices, and governance for reliability engineering across teams.
- Exceptional communication to bridge technical and business contexts, influencing senior stakeholders.
- Mentorship and coaching for junior and mid-level engineers; fostering a culture of reliability-first thinking.
- Strategic decision-making under pressure, balancing innovation with risk management.
- Initiative to identify systemic risks and champion enterprise-grade improvements.
Nice to Have
- Experience with OSS/BSS, network management tooling, and telecom protocols.
- Knowledge of regulatory/compliance constraints in telecom deployments.
- Reliability-first, automation-first, and risk-aware approach; skilled at balancing speed and safety in delivery.
- Advanced cloud or Kubernetes certifications (AWS Professional, Azure Expert, GCP Professional, CKA/CKAD) beneficial.
- SRE leadership training, incident response, or chaos engineering certifications preferred.
Technical Stack
- Operating Systems: Linux
- Cloud: AWS, Azure, GCP
- Containers & Orchestration: Kubernetes, Docker, CRI
- Infrastructure as Code: Terraform, CloudFormation, Ansible
- Scripting & Languages: Python, Go, Bash
- Observability: Prometheus, Grafana, ELK/EFK, Splunk, Jaeger, Zipkin, OpenTelemetry
Work Mode
This is a local position based in Ottawa, Canada.
Ericsson is proud to be an Equal Opportunity employer.




