Develop automated systems for scalable inference operations, including setup, configuration, updates, rollbacks, and regular maintenance, emphasizing consistency and safety.
Design and refine deployment strategies for inference tasks on Kubernetes, covering deployment methods, dynamic scaling, multi-cluster setups, GPU resource management, and secure update procedures.
Ensure platform reliability through software solutions by establishing and enhancing service level indicators, objectives, error tolerance, alerting effectiveness, and automated responses to recurring issues.
Manage and maintain a large-scale infrastructure of GPU and datacenter systems, supporting hardware from early testing through production deployment.

Develop automated systems for scalable inference operations, including setup, configuration, updates, rollbacks, and regular maintenance, emphasizing consistency and safety.
Design and refine deployment strategies for inference tasks on Kubernetes, covering deployment methods, dynamic scaling, multi-cluster setups, GPU resource management, and secure update procedures.
Ensure platform reliability through software solutions by establishing and enhancing service level indicators, objectives, error tolerance, alerting effectiveness, and automated responses to recurring issues.
Manage and maintain a large-scale infrastructure of GPU and datacenter systems, supporting hardware from early testing through production deployment.

Job applications will remain open until at least February 21, 2026.
Artificial intelligence tools are utilized in the recruitment process.
The company is dedicated to building a diverse and inclusive workplace and adheres to equal opportunity employment practices, prohibiting discrimination based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability, or any other legally protected status.

NVIDIA is hiring a Senior Software Engineer – Inference Platform Infrastructure

Similar Jobs