About the Role
The role involves building and managing reliable, scalable infrastructure to power AI-driven platforms, ensuring system efficiency, security, and uptime through automation and modern DevOps practices.
Compensation
Competitive salary with equity and benefits
Work Arrangement
Hybrid remote policy with office presence options
Team
Collaborative engineering team focused on scalable systems
What You’ll Do
- Design and implement scalable cloud infrastructure for AI workloads
- Maintain system reliability, monitoring, and incident response protocols
- Automate deployment pipelines using infrastructure-as-code tools
- Collaborate with engineering teams to optimize service performance
- Ensure compliance with security and operational best practices
- Troubleshoot complex system issues across distributed environments
- Support capacity planning and cost optimization initiatives
- Improve observability through logging and metrics systems
- Manage containerized environments using Kubernetes
- Work on disaster recovery and high availability strategies
- Integrate third-party services and APIs securely
- Participate in on-call rotations for critical systems
- Evaluate new technologies for infrastructure improvements
- Document system architecture and operational procedures
- Contribute to internal tooling for developer productivity
What We’re Looking For
- Proven experience in systems engineering or platform infrastructure
- Strong knowledge of cloud platforms such as Google Cloud or AWS
- Proficiency with configuration management and IaC tools
- Experience operating production systems at scale
- Familiarity with container orchestration platforms
- Solid understanding of networking and security principles
- Background in monitoring and alerting systems
- Skill in scripting and automation with Python or similar
- Track record of improving system reliability
- Ability to work cross-functionally with technical teams
- Experience with CI/CD pipelines and Git workflows
- Knowledge of Linux system administration
- Understanding of distributed systems design
- Problem-solving mindset with attention to detail
- Commitment to operational excellence
Nice to Have
- Experience supporting machine learning infrastructure
- Familiarity with service mesh technologies
- Contributions to open-source infrastructure projects
- Background in high-growth startups
- Knowledge of database administration at scale
- Experience with multi-region deployments
- Understanding of compliance frameworks
- Involvement in incident post-mortems
- Exposure to real-time data processing systems
- Prior work in AI or generative technology environments
Available for qualified candidates