Responsibilities
- Design and implement frameworks, automation, and internal tools that promote efficiency and continuous innovation across engineering teams.
- Utilize Kubernetes, Docker, and Python to improve developer velocity in building and deploying ML inference applications.
- Build and maintain distributed systems through all stages of the software lifecycle, including design, coding, testing, documentation, and troubleshooting.
- Create developer-facing products and services that simplify access to and interaction with the machine learning platform.
- Work across public cloud platforms such as AWS and GCP, applying best practices in infrastructure scaling and capacity planning.
- Deploy and manage containerized applications in production using technologies including Kubernetes, Service Mesh, ArgoCD, and related orchestration tools.
- Collaborate with technical leads and machine learning engineers to define requirements and implement robust technical solutions.
- Take full ownership of features from concept through deployment, including infrastructure defined through code.
- Investigate, prototype, and integrate new machine learning tools with a focus on reliability, scalability, and long-term maintainability.
- Proactively identify and resolve system issues, automate operational workflows, and enable self-service capabilities for engineering teams.
- Participate in on-call rotations to support system reliability and incident response.
Work Arrangement
Hybrid
Other
- Availability for on-call support on a rotational basis.
- Availability for on-call support on a rotating basis.
- Availability for on-call support on a rotating basis.


