Responsibilities
- Architect, implement, and sustain high-performance global networks across multiple vendors and protocols.
- Use data analysis to detect, diagnose, and resolve network issues to reduce system outages.
- Assess and propose advanced network technologies, hardware, and software platforms.
- Engage in architectural design evaluations to ensure network solutions meet business goals and are scalable, reliable, and efficient.
- Coordinate with third-party vendors and partners to validate and test selected network components.
- Build and roll out systems and automation tools to enhance network stability and performance.
- Define and apply industry-standard best practices while helping shape next-generation scalable network architectures.
- Uphold compliance with established IT governance policies and operational standards.
- Lead technical initiatives to solve complex infrastructure challenges, contributing to strategic roadmaps alongside top-tier engineering talent.
Requirements
- Minimum of 8 years of hands-on experience in building and maintaining large hybrid data center networks, excluding enterprise-only environments.
- Strong expertise in TCP/IP networking, including protocols such as BGP, OSPF, VXLAN, EVPN, and QoS.
- Proven experience creating network automation workflows using Python, Ansible, or similar infrastructure automation tools.
- Skilled in using diagnostic tools like Wireshark, tcpdump, nmap, MTR, and curl to troubleshoot connectivity, latency, and bottlenecks.
- Background in designing and operating multi-tenant network environments.
- Direct experience deploying and managing network hardware from Cisco, Arista, Juniper, and Mellanox.
- Familiarity with cloud networking platforms including AWS, GCP, and Azure.
- Extensive experience working in Linux environments, including troubleshooting and system administration.
Nice to Have
- Familiarity with RoCE and Infiniband protocols is advantageous.
- Experience with containerization and orchestration tools such as Docker, Kubernetes, or workload managers like Slurm is beneficial.
- Knowledge of AI training workloads and their impact on network performance is a plus.
Responsibilities
- Design, deploy, manage and maintain global multi-vendor, multi-protocol high performance compute networks.
- Analyze data to diagnose and identify root causes to network issues to minimize downtime
- Evaluate and recommend network technologies, hardware, and software solutions.
- Participate in design reviews to ensure the proposed network architecture aligns with business needs and is optimized for performance, scalability, and reliability.
- Manage relationships with external vendors and partners to test and verify hardware and software selections.
- Develop, and deploy systems and tools to keep all networks running reliably and efficiently
- Establish and implement industry best practices and contribute to the design of new scalable network solutions
- Ensure compliance with IT governance standards and best practices.
- Lead projects to address complex technical challenges, directly contributing to roadmaps and partner alongside the best engineers in the industry to develop world-class solutions
Required
- 8+ years of professional experience building, managing, and supporting large-scale hybrid data center networks (excluding enterprise networks).
- High level of proficiency with TCP/IP networking architecture and technologies such as BGP, OSPF, VXLAN, EVPN, and QoS.
- Experience developing network automation pipelines using Python, Ansible, or other languages/tools utilized in infrastructure automation.
- Proficient in using tools such as Wireshark, tcpdump, nmap, MTR, and curl to identify connectivity issues, latency problems, and network bottlenecks.
- Experience designing and supporting multi-tenant networks
- Hands-on experience deploying and supporting network devices from Cisco, Arista, Juniper, and Mellanox.
- Experience working with cloud networks such as AWS, GCP, and Azure.
- Solid experience working in and troubleshooting within a Linux environment.
Preferred
- Knowledge of RoCE and Infiniband protocols a plus
- Experience with Docker, Kubernetes, or Slurm a plus
- Understanding of AI training workloads and the demands they exert on networks a plus