About the Role
Build and improve the foundational systems that support machine learning models in production, working closely with data scientists and engineers to deliver robust, scalable solutions.
Responsibilities
- Develop and manage platforms that support training and deployment of machine learning models
- Collaborate with data science teams to understand their infrastructure needs
- Ensure platform reliability, scalability, and performance
- Implement monitoring and observability for ML systems
- Optimize workflows for model training and inference
- Maintain secure and compliant environments for data processing
- Support CI/CD pipelines tailored for ML components
- Troubleshoot production issues related to ML infrastructure
- Evaluate and integrate new tools and technologies for ML workflows
- Document architecture and operational procedures
- Drive automation across deployment and testing processes
- Work with distributed systems and cloud-native technologies
- Improve tooling for experiment tracking and model versioning
- Contribute to capacity planning for compute and storage
- Ensure efficient resource utilization across ML workloads
- Participate in incident response and on-call rotations
- Enforce best practices in configuration and deployment
- Support reproducibility of machine learning experiments
- Collaborate on data pipeline integrations
- Help define standards for model serving infrastructure
Nice to Have
- Experience with ML frameworks like TensorFlow or PyTorch
- Background in MLOps or platform engineering
- Exposure to feature stores or model registries
- Knowledge of data orchestration tools
- Familiarity with serverless architectures
- Experience with large-scale data processing
- Contributions to open-source projects
- Understanding of model monitoring and drift detection
- Prior work in regulated industries
- Advanced degree in computer science or related field
Compensation
Competitive salary and benefits package
Work Arrangement
Hybrid or remote options available
Team
Part of the machine learning platform team focused on scalable infrastructure
Our Tech Stack
- We use Kubernetes for orchestration, Python for core services, and Terraform for infrastructure management.
- Our platform runs on Google Cloud Platform with a strong emphasis on observability and automation.
Growth and Development
- Engineers are encouraged to lead initiatives, mentor peers, and contribute to technical strategy.
- We support continuous learning through conferences, courses, and internal knowledge sharing.
Available for eligible candidates