About the Role
The role involves building and operating resilient, automated systems tailored for AI workloads, ensuring high availability, observability, and rapid iteration across distributed environments.
Compensation
Competitive salary and equity package
Work Arrangement
Remote within the Americas
Team
Small, autonomous engineering team focused on AI infrastructure
What You’ll Do
- Design and implement infrastructure that natively supports AI model training and inference.
- Automate deployment pipelines for machine learning models and supporting services.
- Monitor system performance and proactively address reliability concerns.
- Collaborate with research and engineering teams to operationalize experimental systems.
- Optimize cloud resource usage for cost efficiency and scalability.
- Maintain secure, compliant environments aligned with data governance policies.
- Troubleshoot complex issues across distributed compute and storage layers.
- Develop tooling to streamline developer workflows and reduce operational overhead.
- Lead incident response and post-mortem analysis for production systems.
- Contribute to architectural decisions for long-term platform sustainability.
What We Look For
- Proven experience with cloud platforms such as AWS, GCP, or Azure.
- Strong scripting skills in Python, Bash, or similar languages.
- Familiarity with containerization and orchestration tools like Docker and Kubernetes.
- Hands-on experience with infrastructure-as-code tools such as Terraform or Pulumi.
- Deep understanding of networking, security, and identity management in cloud environments.
- Experience with CI/CD systems and automated testing frameworks.
- Knowledge of observability tools including logging, metrics, and tracing platforms.
- Background in managing GPU-accelerated workloads is highly desirable.
- Ability to debug performance bottlenecks in distributed systems.
- Clear communication skills for cross-functional collaboration.
Why This Role Stands Out
- Work directly on infrastructure that powers cutting-edge AI applications.
- Shape operational practices in a growing technical organization.
- Solve challenging scalability problems at the intersection of ML and systems engineering.
- Influence tooling and architecture decisions from an early stage.
- Operate with high autonomy and measurable impact on product delivery.
Application Process
- Submit your resume and a brief note explaining your interest.
- Complete a technical screening focused on real-world scenarios.
- Participate in a pair-programming session with the engineering team.
- Final interview with leadership to discuss alignment and expectations.
Available for qualified candidates