Responsibilities
- Develop and Maintain Production Systems: Design, implement, and improve software that powers GPU fleet lifecycle management and machine configuration at scale.
- Automate Infrastructure: Build and enhance automation frameworks for machine provisioning, configuration management, and deployment.
- Support New Hardware Introduction (NPI): Enable bring-up, validation, and production readiness for new server and accelerator platforms.
- Enhance Machine Lifecycle Processes: Improve and refine workflows for bare metal provisioning, firmware updates, and system health monitoring.
- Debug Hardware and Firmware Issues: Investigate failures across BIOS, BMC, firmware, networking, storage, and boot flows.
- Collaborate Across Teams: Work closely with infrastructure, security, and product engineering teams to develop scalable and maintainable solutions.
Requirements
- 2+ years of experience working with Go (Golang) or Python in production environments.
- 2+ years of experience with configuration management tools and practices.
- Comfortable working in Linux environments and debugging issues at the OS, hardware, and networking layers.
- Able to independently troubleshoot complex systems and communicate effectively across software, infrastructure, and vendor teams.
Nice to Have
- Experience with Go in infrastructure, systems, or backend development.
- Hands-on experience with bare metal provisioning and lifecycle management, including technologies such as Redfish, BMC, IPMI, DHCP, and PXE.
- Experience diagnosing issues involving drivers, firmware, and hardware compatibility across GPU servers.
- Experience incorporating AI-assisted development tools into engineering workflows, including code generation, debugging, test development, and documentation.
- Experience building Linux distributions or managing OS customization and imaging.
- Familiarity with Ansible for system configuration and automation.
- Exposure to Kubernetes and container orchestration concepts.