About the Role
The role involves building and managing core infrastructure systems that power AI research and deployment. Engineers will work closely with research and product teams to deliver robust, automated, and efficient platforms.
Responsibilities
- Design and deploy scalable cloud infrastructure
- Automate provisioning and configuration management
- Ensure high availability and fault tolerance
- Monitor system performance and optimize resource usage
- Implement security best practices across environments
- Support deployment pipelines for machine learning models
- Troubleshoot production issues across distributed systems
- Maintain documentation for infrastructure components
- Collaborate with engineering teams on system design
- Evaluate and integrate new infrastructure technologies
- Manage container orchestration platforms
- Enforce compliance with data protection standards
- Develop backup and disaster recovery procedures
- Optimize costs for cloud resource consumption
- Integrate observability tools for system insights
- Support on-call rotations for critical systems
- Ensure infrastructure aligns with research workflows
- Implement access controls and identity management
- Work with distributed storage solutions
- Contribute to incident response protocols
- Improve deployment velocity through tooling
- Maintain network architecture for low-latency communication
- Support GPU-accelerated computing environments
- Drive migration from legacy to modern infrastructure
- Participate in code and design reviews
Nice to Have
- Experience supporting AI or machine learning workloads
- Background in high-performance computing
- Familiarity with large-scale data pipelines
- Knowledge of GPU cluster management
- Experience with low-latency networking
- Contributions to open-source infrastructure projects
- Advanced degrees in computer science or related field
- Certifications in cloud or systems engineering
Compensation
Competitive salary with equity and benefits
Work Arrangement
Remote with flexible hours
Team
Small, fast-moving team focused on AI infrastructure
Why This Role Matters
The infrastructure built by this role directly enables faster experimentation and deployment of AI models, accelerating research progress and product development.
Technology Stack
Uses Kubernetes, Terraform, Prometheus, Python, and cloud-native services to manage scalable, secure, and observable systems.
Available for qualified candidates


