Toronto, Ontario, Canada Hybrid Employment

Tecsys Inc. is hiring a Site Reliability Engineer

About the Role

Tecsys Inc. is looking for a Site Reliability Engineer to join our Network and Security Operations Center (NOC). In this role, you will maintain and optimize our mission-critical SaaS cloud infrastructure across AWS and Kubernetes, blending reliability engineering with incident command. Your focus will be on automation, observability, and continuous improvement to ensure high availability and performance.

What You'll Do

  • Collaborate with Engineering teams to support services before go-live through system design consulting, platform development, capacity planning, and launch reviews.
  • Maintain live services by measuring and monitoring availability, latency, and overall system health.
  • Own observability: Enhance monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards.
  • Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention.
  • Scale systems sustainably through automation and push for changes that improve reliability and velocity.
  • Be on-call.
  • Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes.
  • Implement and maintain monitoring, logging, alerting, and SLA Reporting.
  • Create and maintain technical documentation.
  • Implement, maintain, and mature SRE best practices.
  • Lead incidents as Incident Commander; coordinate cross-team response, manage communications, and ensure rapid service restoration.
  • Provide support for planning and deployment teams to enable stability, predictability, and scale.
  • Collaborate with the Platform Engineering team to implement strategic efforts, provide feedback, and foster collaboration.
  • Work cross-functionally with internal teams and vendors to manage global growth, focusing on high performance, availability, and reliability.

What We're Looking For

  • 5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.
  • Experience designing and deploying large-scale systems, multi-vendor platforms, and globally distributed infrastructure.
  • Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale.
  • Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar).
  • Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable).
  • Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards.
  • Experience with incident management, on-call participation, escalation, and structured postmortems.
  • Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics.
  • Curiosity, ownership, and a bias for action.
  • Basic knowledge of Java- or .Net-based development required.
  • Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners, and colleagues beyond Quebec.
  • Must be a Canadian Citizen, Permanent Resident of Canada, or have a valid Canadian work permit.

Nice to Have

  • Experience with FedRAMP (The Federal Risk and Authorization Management Program) compliance is a strong asset.

Technical Stack

  • AWS, Kubernetes (EKS)
  • Terraform, Ansible
  • GitLab CI/CD, Jenkins
  • Datadog
  • Python, Bash, Java, .Net

Team & Environment

You will be part of the Network and Security Operations Center (NOC), a team at the heart of platform reliability. You'll collaborate closely with Platform Engineering and other Engineering teams.

Work Mode

This role follows a hybrid work model and is based in Canada.

Tecsys is an equal opportunity employer. Accommodation is available for applicants selected for an interview.

Required Skills
AWSKubernetesEKSTerraformAnsibleGitLab CI/CDJenkinsDatadogPythonBashJavaSite Reliability EngineeringInfrastructure as CodeMonitoring
Visa expiring soon?

Extend or switch without leaving Thailand

Running out of time on your current visa? SVBL identifies your best option — extension, category switch, or long-term visa — and handles the entire process.

Visa extensions & category switches
LTR & DTV visa applications
90-day reporting managed
Overstay prevention
Check your options
Prevent overstay issues
About company
Tecsys Inc.

Tecsys is a fast-growing innovator offering supply chain solutions to industry leading healthcare systems, hospitals, and pharmacy businesses to distributors, retailers, and 3PLs. They work with industry leaders to transform their supply chains through technology.

Visit website
Job Details
Category infrastructure
Posted 4 months ago