Responsibilities
- Automate and maintain deployment workflows to streamline robot cell rollouts, reduce manual setup, and eliminate recurring operational issues.
- Containerise and stabilise our robotics software stack to ensure consistent, reproducible environments across development, cloud and deployed production systems.
- Standardise and manage Linux and Windows environments, establishing repeatable configuration and provisioning practices.
- Provision and maintain cloud infrastructure, including secure access, permissions and identity management (SSO, SSH), with clear ownership of reliability and recovery practices.
- Configure and manage networking infrastructure, including VPNs and secure remote access.
- Own endpoint security and system hardening across internal machines and deployed systems.
- Design and operate monitoring and alerting systems, leading incident response, root cause analysis and preventative improvements.
- Own operational intake and prioritisation of infrastructure issues, ensuring clear communication and structured resolution.
- Maintain clear documentation of environments, incident learnings, processes and recovery procedures to ensure continuity.
Requirements
- Proven experience automating infrastructure and deployment workflows using scripting and programming (e.g. Ansible, Python, Bash).
- Strong hands-on Linux systems administration experience.
- Experience with scalable containerised environments in production.
- Practical experience with cloud infrastructure provisioning, reliability and access management (AWS, GCP or Azure).
- Solid understanding of networking fundamentals, VPN configuration and secure remote access.
- Experience implementing and managing credentials, SSO and identity/access control systems.
- Demonstrated experience configuring, standardising and hardening Linux and Windows endpoints.
- Experience building monitoring and alerting systems, and using logs and metrics for structured diagnosis.
- Experience leading incident response, including clear communication, root cause analysis and preventative follow-up actions.
- Experience establishing repeatable configuration or environment management practices in production systems.
- Strong troubleshooting capability, with evidence-based investigation and durable problem resolution.
- Ability to write clear, structured technical documentation for repeatable processes, incident learnings and handovers.
- Experience working in GPU-enabled or hardware-adjacent environments (e.g. NVIDIA stack, CUDA, driver-level debugging).
Nice to Have
- Experience operating infrastructure in robotics, autonomy, AI-heavy or other hardware-adjacent environments.
- Experience with SaltStack or similar infrastructure management frameworks and their use from early-stage foundations to multi-site production-grade deployments.
- Experience with Infrastructure as Code tools (e.g. Terraform, Pulumi or similar).
- Experience designing backup, disaster recovery and resilience testing processes.
- Experience building internal tooling to improve developer productivity.
- Prior experience serving as a senior escalation point or mentoring engineers in operational best practices.
