About the Role
Design and implement scalable systems to support AI workloads, ensuring high availability and performance through proactive monitoring, incident response, and infrastructure automation.
Responsibilities
- Develop automation tools to streamline operations and reduce manual intervention
- Monitor system performance and troubleshoot production issues
- Collaborate with development teams to improve service reliability
- Implement and maintain observability solutions including logging and alerting
- Drive incident response and post-mortem analysis for critical outages
- Optimize infrastructure for scalability and efficiency
- Support deployment pipelines and continuous integration workflows
- Ensure systems meet security and compliance standards
- Contribute to capacity planning and resource forecasting
- Maintain documentation for systems and operational procedures
- Participate in on-call rotations for critical services
- Evaluate new technologies for infrastructure improvements
- Enhance disaster recovery and failover mechanisms
- Work closely with software engineers to refine system design
- Improve system uptime and reduce mean time to recovery
- Integrate infrastructure changes with minimal service disruption
- Apply software engineering principles to operations challenges
- Promote best practices in configuration management
- Support global infrastructure deployments
- Ensure alignment with long-term platform architecture goals
Nice to Have
- Master's degree in a technical discipline
- Experience supporting AI or machine learning infrastructure
- Familiarity with GPU-accelerated computing environments
- Knowledge of Kubernetes in production settings
- Experience with large-scale data processing systems
- Background in performance benchmarking and optimization
- Exposure to hardware-software co-design principles
- Contributions to open-source infrastructure projects
- Certifications in cloud or systems administration
Compensation
Competitive salary and comprehensive benefits package
Work Arrangement
Hybrid work model available
Team
Part of a high-performance engineering team focused on AI systems
Why This Role Matters
This position plays a critical role in maintaining the backbone of AI infrastructure, enabling cutting-edge research and development by ensuring systems are resilient, efficient, and scalable.
What We Value
We prioritize technical excellence, proactive problem solving, collaboration across teams, and a commitment to continuous improvement in system design and operations.
Available for qualified candidates


