About the Role
The role involves supporting internal and external users by maintaining system reliability, diagnosing technical issues, and improving operational workflows within a cloud-based machine learning environment.
Responsibilities
- Provide technical support for deployment and operation of machine learning platforms
- Troubleshoot infrastructure and pipeline issues in cloud environments
- Collaborate with engineering teams to resolve system outages
- Monitor system performance and proactively identify risks
- Document technical solutions and share knowledge across teams
- Assist in automating deployment and testing processes
- Respond to support tickets with clear and timely resolutions
- Work with logs, metrics, and tracing data to diagnose problems
- Support CI/CD pipeline stability and reliability
- Improve system observability through tooling and alerts
- Assist customers with integration and configuration issues
- Participate in on-call rotations for incident response
- Contribute to post-mortem analyses after incidents
- Maintain up-to-date knowledge of cloud platform changes
- Support security and compliance requirements in production systems
- Optimize resource usage in cloud infrastructure
- Assist in scaling systems to meet growing demand
- Collaborate on internal tools for developer productivity
- Ensure consistency between development, staging, and production environments
- Support containerized applications running on Kubernetes
Nice to Have
- Experience supporting AI or ML platforms
- Exposure to large-scale production systems
- Familiarity with Helm, Terraform, or similar infrastructure tools
- Knowledge of gRPC and API design
- Experience with distributed tracing tools
- Background in technical writing or documentation
- Previous work in remote-first teams
- Open source contributions in DevOps or infrastructure projects
Compensation
Competitive salary and benefits package
Work Arrangement
Remote
Team
Collaborative engineering team focused on AI and machine learning infrastructure
Why This Role Matters
- This position plays a key role in ensuring platform stability and user satisfaction by bridging engineering and customer needs.
- Engineers in this role directly impact the reliability and scalability of AI infrastructure used by data science teams.
What We Offer
- Flexible remote work environment
- Opportunities for professional growth and mentorship
- Exposure to cutting-edge machine learning technologies
- Inclusive culture with a focus on collaboration and innovation
Not applicable