As a Staff Software Engineer in Infrastructure, you will play a pivotal role in defining the technical foundation of our platform. You'll lead the design and implementation of systems that support high-traffic services across cloud infrastructure, developer tooling, video delivery, and AI/ML workloads. Your work will directly influence the scalability, reliability, and efficiency of our production environment.
What You'll Do
- Define technical direction for core infrastructure, balancing performance, security, and developer productivity.
- Architect and deploy large-scale solutions on AWS and EKS, using Terraform and CI/CD pipelines to automate delivery.
- Build robust internal platforms and automation tools that empower engineering teams to ship faster and operate reliably.
- Improve observability, incident response, and cost management across distributed systems.
- Lead initiatives in video infrastructure, optimizing CDN strategies and encoding pipelines for global reach.
- Collaborate with data science teams to design GPU-enabled environments for training and serving machine learning models.
- Mentor engineers across the organization, contributing to design reviews and raising standards in operational practices.
- Take ownership during critical incidents, guiding resolution and driving postmortem improvements.
What We're Looking For
- 10+ years of experience building and managing production cloud systems.
- Deep knowledge of AWS, Kubernetes (EKS), networking, and Linux internals.
- Proven track record with Infrastructure as Code (Terraform), CI/CD (GitHub Actions, Argo), and Helm.
- Experience designing software-driven infrastructure, not just configuring systems.
- Strong grasp of distributed systems, cloud security, and scalability challenges.
- History of delivering measurable improvements in system reliability or developer velocity.
- Clear communicator who can lead technical consensus across teams.
Nice to Have
- Background in AI/ML infrastructure, including GPU orchestration and model serving (e.g., Triton, vLLM).
- Familiarity with autoscaling frameworks like KEDA or Karpenter.
- Experience with service mesh, multi-cluster Kubernetes, or video encoding (HLS/DASH).
Our Environment
We operate a remote-first culture with collaboration hubs in San Francisco and Kitchener, Ontario, supporting team members across several U.S. states. Our culture values craftsmanship, clarity, and long-term thinking. We believe diverse perspectives lead to better solutions and actively foster inclusion. We support continuous learning and expect engineers to lead through mentorship, technical rigor, and hands-on problem solving — especially in ambiguous, cross-functional domains.


