Tecsys Inc. is looking for a Site Reliability Engineer to join our Network and Security Operations Center (NOC). In this role, you will maintain and optimize our mission-critical SaaS cloud infrastructure across AWS and Kubernetes, blending reliability engineering with incident command. Your focus will be on automation, observability, and continuous improvement to ensure high availability and performance.
What You'll Do
- Collaborate with Engineering teams to support services before go-live through system design consulting, platform development, capacity planning, and launch reviews.
- Maintain live services by measuring and monitoring availability, latency, and overall system health.
- Own observability: Enhance monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards.
- Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention.
- Scale systems sustainably through automation and push for changes that improve reliability and velocity.
- Be on-call.
- Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes.
- Implement and maintain monitoring, logging, alerting, and SLA Reporting.
- Create and maintain technical documentation.
- Implement, maintain, and mature SRE best practices.
- Lead incidents as Incident Commander; coordinate cross-team response, manage communications, and ensure rapid service restoration.
- Provide support for planning and deployment teams to enable stability, predictability, and scale.
- Collaborate with the Platform Engineering team to implement strategic efforts, provide feedback, and foster collaboration.
- Work cross-functionally with internal teams and vendors to manage global growth, focusing on high performance, availability, and reliability.
What We're Looking For
- 5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.
- Experience designing and deploying large-scale systems, multi-vendor platforms, and globally distributed infrastructure.
- Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale.
- Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar).
- Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable).
- Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards.
- Experience with incident management, on-call participation, escalation, and structured postmortems.
- Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics.
- Curiosity, ownership, and a bias for action.
- Basic knowledge of Java- or .Net-based development required.
- Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners, and colleagues beyond Quebec.
- Must be a Canadian Citizen, Permanent Resident of Canada, or have a valid Canadian work permit.
Nice to Have
- Experience with FedRAMP (The Federal Risk and Authorization Management Program) compliance is a strong asset.
Technical Stack
- AWS, Kubernetes (EKS)
- Terraform, Ansible
- GitLab CI/CD, Jenkins
- Datadog
- Python, Bash, Java, .Net
Team & Environment
You will be part of the Network and Security Operations Center (NOC), a team at the heart of platform reliability. You'll collaborate closely with Platform Engineering and other Engineering teams.
Work Mode
This role follows a hybrid work model and is based in Canada.
Tecsys is an equal opportunity employer. Accommodation is available for applicants selected for an interview.





