NVIDIA is looking for a Senior Systems Engineer, Artificial Intelligence Operations to drive improvements in AI cluster reliability and performance. You will be at the intersection of customer needs and technical innovation, working in a diverse, supportive environment where everyone is inspired to do their best work.
What You'll Do
- Bring together and understand internal and external customer requirements to improve AI cluster resiliency and design AIOps-based solutions.
- Develop automated workflows for issue detection and root cause analysis and closely collaborate with operators to debug sophisticated, full-stack AI cluster problems.
- Deliver compelling technical presentations and lead hands-on demos or training.
- Handle evaluation deployments (POC/POV) and ensure smooth, reliable installations by staying engaged throughout the customer journey.
What We're Looking For
- Bachelor of Science or equivalent experience.
- 12+ years of networking experience in enterprise or service provider environments, with strong hands-on expertise in routing and switching.
- Proficient in scripting and automation using Python or similar languages, with strong Linux expertise.
- Proven experience working directly with customers to resolve issues and ensure success in Systems Engineer or SRE roles.
- Exceptional oral, written, and presentation skills for clearly communicating complex technical topics.
- Demonstrated ability to collaborate effectively across teams, partnering with operations, engineering, and product development.
Nice to Have
- Experience with data center infrastructure and cloud architectures.
- Background in network performance monitoring or observability.
- Previous experience working at a technological start-up.
Technical Stack
- Python
- Linux
NVIDIA is an equal opportunity employer.



