As a Senior Solutions Architect specializing in Cloud Infrastructure and DevOps, you will play a central role in shaping how enterprise customers design, deploy, and manage advanced computing environments. Your work will center on AI and high-performance computing (HPC) platforms, with a strong emphasis on Kubernetes, GPU integration, and automated infrastructure solutions.
Key Responsibilities
- Guide clients through the design and optimization of large-scale computing systems, including implementation of monitoring, logging, and workload orchestration using Kubernetes and Linux-based schedulers
- Deliver hands-on technical support across the full stack—from hardware and operating systems to container platforms, networking, and storage
- Evaluate existing infrastructure and recommend production-ready, Kubernetes-driven container platforms integrated with enterprise storage and networking
- Develop and maintain technical methodologies, operational playbooks, and best practices for internal and customer use
- Support research initiatives and lead proof-of-concept projects to validate new architectures, features, and upgrade paths
- Produce detailed technical documentation, including runbooks, onboarding guides, and reference architectures
- Act as the primary technical advisor for key accounts, influencing long-term decisions around platform architecture and DevOps strategy
Required Qualifications
- Advanced degree in Computer Science, Engineering, Physics, Mathematics, or a related field—equivalent experience accepted
- Minimum of eight years in roles focused on cloud infrastructure, automation, and scalable system design
- Proven experience deploying and tuning HPC and AI clusters, with strong knowledge of data center networking and system architecture
- Hands-on deployment and optimization of NVIDIA GPU-based systems, including CUDA integration and GPU workload analysis
- Extensive Kubernetes experience, particularly in GPU and HPC environments, covering orchestration, scaling, and resource scheduling
- Strong command of Linux systems (RedHat, Ubuntu), OS security, and networking protocols
- Experience with high-performance storage systems such as Lustre, GPFS, ZFS, and XFS, including Kubernetes-native storage solutions
- Proficiency in scripting (Python, Bash) and Infrastructure-as-Code tools like Ansible and Terraform
- Familiarity with observability platforms including Grafana, Prometheus, and Loki for building resilient, monitored systems
- Track record of designing scalable technical solutions and advising enterprise clients through architectural reviews and technical workshops
Preferred Skills
- Experience with CI/CD pipelines and automated software delivery
- Hands-on use of NVIDIA GPU and Network Operators for managing GPU and network resources in Kubernetes
- Direct experience with NVIDIA Base Command Manager (BCM) for large-scale GPU cluster provisioning and management
- Working knowledge of RDMA technologies, including InfiniBand and RoCE, in AI or HPC contexts
Technology Environment
Key tools and platforms include Kubernetes, Linux (RedHat, Ubuntu), Lustre, GPFS, ZFS, XFS, Python, Bash, Ansible, Terraform, Grafana, Prometheus, Loki, CUDA, NVIDIA GPUs, NVIDIA Base Command Manager (BCM), RDMA, InfiniBand, RoCE, and GPU and Network Operators.


