Toronto, Ontario, Canada Remote (City) Employment

RBC Borealis is hiring a Lead Site Reliability Engineer

About the Role

RBC Borealis is looking for a Lead Site Reliability Engineer to ensure the reliability, scalability, and performance of our systems. You will apply SRE principles including incident management and observability, driving automation and innovation within our technical landscape.

What You'll Do

  • Work closely with Quality Engineering, DevOps, Development, IT, and Cloud teams to align SRE practices with organizational goals.
  • Design, implement, and maintain reliable and scalable systems to ensure high availability and performance.
  • Monitor system health, identify bottlenecks, and proactively resolve issues to minimize downtime.
  • Develop and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
  • Apply SRE principles to improve system reliability and reduce operational toil.
  • Architect, deploy, and manage cloud-based infrastructure (e.g., AWS, Azure, GCP).
  • Optimize cloud resources for cost efficiency and performance.
  • Implement Infrastructure as Code (IaC) using tools like Terraform, CloudFormation, or Pulumi.
  • Set up and configure new SaaS applications, ensuring seamless integration with existing systems.
  • Automate deployment pipelines using CI/CD tools (e.g., Jenkins, GitHub Actions, GitLab CI/CD).
  • Collaborate with cross-functional teams to ensure smooth onboarding of SaaS solutions.
  • Write clean, efficient, and maintainable code in languages such as Python, Go, Java, or Ruby.
  • Develop automation scripts for repetitive tasks, monitoring, and incident response.
  • Build and maintain tools to improve developer productivity and system reliability.
  • Lead incident response efforts, including root cause analysis and post-mortem reviews.
  • Implement robust monitoring and alerting systems using tools like Prometheus, Grafana, or Datadog.
  • Ensure effective communication and resolution during critical incidents.
  • Establish and refine incident management processes to minimize Mean Time to Recovery (MTTR).
  • Design and implement observability solutions to provide deep insights into system performance and behavior.
  • Utilize tools like Prometheus, Grafana, Datadog, or New Relic to monitor system health and detect anomalies.
  • Develop dashboards and alerts to ensure proactive issue detection and resolution.
  • Implement security best practices for cloud and SaaS environments.
  • Ensure compliance with industry standards and regulations (e.g., GDPR, SOC 2, ISO 27001).
  • Conduct regular security audits and vulnerability assessments.
  • Work closely with development, operations, and product teams to align technical solutions with business goals.
  • Document processes, workflows, and best practices to foster knowledge sharing within the team.
  • Mentor junior team members and contribute to a culture of continuous learning.

What We're Looking For

  • Proficiency in programming languages such as Python, Go, Java, or Ruby.
  • Strong understanding of cloud platforms (AWS, Azure, GCP) and their services.
  • Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Hands-on experience with CI/CD pipelines and tools (e.g., Jenkins, GitHub Actions, GitLab CI/CD).
  • Knowledge of Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation, Pulumi).
  • Proven experience in applying SRE principles to improve system reliability and scalability.
  • Experience in incident management, root cause analysis, and post-mortem processes.
  • Proven experience in deploying and managing SaaS applications.
  • Familiarity with SaaS integration and API management.
  • Experience with monitoring tools (e.g., Dynatrace).
  • Strong scripting skills in Bash, Python, or similar languages.
  • Experience in automating repetitive tasks and workflows.
  • Excellent problem-solving and troubleshooting abilities.
  • Strong communication and collaboration skills.

Nice to Have

  • Salesforce DevOps: Familiarity with Salesforce & Flosum for managing source-driven development and CI/CD workflows.
  • Bachelor’s degree in Computer Science, Engineering, or in a field relevant to the role.
  • Strategic thinker with excellent interpersonal skills to work across functions and businesses.

Technical Stack

  • Languages: Python, Go, Java, Ruby, Bash
  • Cloud: AWS, Azure, GCP
  • Containers & Orchestration: Docker, Kubernetes
  • CI/CD: Jenkins, GitHub Actions, GitLab CI/CD
  • Infrastructure as Code: Terraform, CloudFormation, Pulumi
  • Monitoring & Observability: Prometheus, Grafana, Datadog, New Relic, Dynatrace
  • Platforms: Salesforce, Flosum

Benefits & Compensation

  • A comprehensive Total Rewards Program including bonuses and flexible benefits, competitive compensation, commissions, and stock where applicable.
  • Leaders who support your development through coaching and managing opportunities.
  • Ability to make a difference and lasting impact.
  • Work in a dynamic, collaborative, progressive, and high-performing team.
  • A world-class training program in financial services.
  • Flexible work/life balance options.
  • Opportunities to do challenging work.

Work Mode

This is a local-city role based in Toronto, Canada.

We thrive on the challenge to be our best, progressive thinking to keep growing, and working together to deliver trusted advice to help our clients thrive and communities prosper. We care about each other, reaching our potential, making a difference to our communities, and achieving success that is mutual.

Required Skills
PythonGoJavaRubyAWSAzureGCPDockerKubernetesJenkinsGitHub ActionsGitLab CI/CDTerraformCloudFormationPulumi
Visa expiring soon?

Extend or switch without leaving Thailand

Running out of time on your current visa? SVBL identifies your best option — extension, category switch, or long-term visa — and handles the entire process.

Visa extensions & category switches
LTR & DTV visa applications
90-day reporting managed
Overstay prevention
Check your options
Prevent overstay issues
About company
RBC Borealis

RBC Borealis, an RBC Institute for Research, is a curiosity-driven research centre dedicated to achieving state-of-the-art in machine learning. Established in 2016, with labs in Toronto, Montreal, Waterloo, and Vancouver, it supports academic collaborations and partners with world-class research centres in artificial intelligence, focusing on ethical AI to help communities thrive.

Visit website
Job Details
Department Software Development
Category infrastructure
Posted 14 days ago