About the Role
This position involves leading the development and refinement of runtime infrastructure that powers scalable AI execution, ensuring efficiency, reliability, and performance across distributed environments.
Responsibilities
- Design and implement core components of runtime systems
- Optimize execution performance for AI and machine learning workloads
- Collaborate with cross-functional teams to define system requirements
- Diagnose and resolve complex performance bottlenecks
- Contribute to architectural decisions for scalable infrastructure
- Ensure runtime compatibility across diverse hardware environments
- Develop tooling to monitor and improve system behavior
- Lead code reviews and set engineering best practices
- Mentor junior engineers in systems programming and design
- Work closely with research teams to integrate new AI models
- Improve fault tolerance and system resilience
- Drive automation in testing and deployment pipelines
- Maintain detailed technical documentation
- Evaluate emerging technologies for runtime improvements
- Support security and compliance requirements in execution layers
- Participate in incident response and on-call rotations
- Refactor legacy components for better maintainability
- Contribute to open-source projects when applicable
- Ensure backward compatibility during system upgrades
- Collaborate on debugging low-level system issues
- Improve startup and execution latency
- Work with containerization and orchestration technologies
- Integrate observability into runtime components
- Support deployment across cloud and edge environments
- Balance feature development with technical debt reduction
Compensation
Competitive salary and equity package
Work Arrangement
Hybrid work model with flexibility for remote or on-site collaboration
Team
Part of a core engineering team focused on runtime systems and performance optimization
About the Team
The Runtime team builds the foundational execution layer that powers AI inference and training workflows. We focus on speed, efficiency, and scalability across heterogeneous environments.
Tech Stack
- Primary languages: C++, Rust
- Infrastructure: Kubernetes, Docker, Prometheus
- Cloud platforms: AWS, GCP
- Monitoring: Grafana, OpenTelemetry
- CI/CD: GitHub Actions, ArgoCD
Growth Opportunities
- Opportunities to lead major system redesigns
- Present technical work to broader engineering groups
- Contribute to strategic planning for runtime evolution
- Mentor engineers across multiple teams
Sponsorship available for qualified candidates requiring work authorization

