Responsibilities
- Work on systems for managing multiple clusters, optimizing model portfolios, enabling predictive scaling, maintaining control planes, onboarding models, enhancing model performance, and building APIs, SDKs, and command-line tools for deployment management.
- Evaluate and enhance the reliability and scalability of current distributed systems, application programming interfaces, data storage solutions, and underlying infrastructure components.
- Collaborate with product-focused teams to grasp functional needs and implement effective technical solutions aligned with business objectives.
- Develop clean, tested, and sustainable code and infrastructure-as-code for both newly introduced and established systems.
- Lead code and design reviews, produce technical documentation for developers, and define testing approaches to ensure system resilience and fault tolerance.