San Francisco On-site Employment $190,000 - $250,000

Together AI is hiring a Senior Network Engineer

Responsibilities

  • Architect, implement, and sustain high-performance global networks across multiple vendors and protocols.
  • Use data analysis to detect, diagnose, and resolve network issues to reduce system outages.
  • Assess and propose advanced network technologies, hardware, and software platforms.
  • Engage in architectural design evaluations to ensure network solutions meet business goals and are scalable, reliable, and efficient.
  • Coordinate with third-party vendors and partners to validate and test selected network components.
  • Build and roll out systems and automation tools to enhance network stability and performance.
  • Define and apply industry-standard best practices while helping shape next-generation scalable network architectures.
  • Uphold compliance with established IT governance policies and operational standards.
  • Lead technical initiatives to solve complex infrastructure challenges, contributing to strategic roadmaps alongside top-tier engineering talent.

Requirements

  • Minimum of 8 years of hands-on experience in building and maintaining large hybrid data center networks, excluding enterprise-only environments.
  • Strong expertise in TCP/IP networking, including protocols such as BGP, OSPF, VXLAN, EVPN, and QoS.
  • Proven experience creating network automation workflows using Python, Ansible, or similar infrastructure automation tools.
  • Skilled in using diagnostic tools like Wireshark, tcpdump, nmap, MTR, and curl to troubleshoot connectivity, latency, and bottlenecks.
  • Background in designing and operating multi-tenant network environments.
  • Direct experience deploying and managing network hardware from Cisco, Arista, Juniper, and Mellanox.
  • Familiarity with cloud networking platforms including AWS, GCP, and Azure.
  • Extensive experience working in Linux environments, including troubleshooting and system administration.

Nice to Have

  • Familiarity with RoCE and Infiniband protocols is advantageous.
  • Experience with containerization and orchestration tools such as Docker, Kubernetes, or workload managers like Slurm is beneficial.
  • Knowledge of AI training workloads and their impact on network performance is a plus.

Responsibilities

  • Design, deploy, manage and maintain global multi-vendor, multi-protocol high performance compute networks.
  • Analyze data to diagnose and identify root causes to network issues to minimize downtime
  • Evaluate and recommend network technologies, hardware, and software solutions.
  • Participate in design reviews to ensure the proposed network architecture aligns with business needs and is optimized for performance, scalability, and reliability.
  • Manage relationships with external vendors and partners to test and verify hardware and software selections.
  • Develop, and deploy systems and tools to keep all networks running reliably and efficiently
  • Establish and implement industry best practices and contribute to the design of new scalable network solutions
  • Ensure compliance with IT governance standards and best practices.
  • Lead projects to address complex technical challenges, directly contributing to roadmaps and partner alongside the best engineers in the industry to develop world-class solutions

Required

  • 8+ years of professional experience building, managing, and supporting large-scale hybrid data center networks (excluding enterprise networks).
  • High level of proficiency with TCP/IP networking architecture and technologies such as BGP, OSPF, VXLAN, EVPN, and QoS.
  • Experience developing network automation pipelines using Python, Ansible, or other languages/tools utilized in infrastructure automation.
  • Proficient in using tools such as Wireshark, tcpdump, nmap, MTR, and curl to identify connectivity issues, latency problems, and network bottlenecks.
  • Experience designing and supporting multi-tenant networks
  • Hands-on experience deploying and supporting network devices from Cisco, Arista, Juniper, and Mellanox.
  • Experience working with cloud networks such as AWS, GCP, and Azure.
  • Solid experience working in and troubleshooting within a Linux environment.

Preferred

  • Knowledge of RoCE and Infiniband protocols a plus
  • Experience with Docker, Kubernetes, or Slurm a plus
  • Understanding of AI training workloads and the demands they exert on networks a plus
Required Skills
DockerKubernetes
About company
Together AI
Together AI is a research-driven artificial intelligence company that believes open and transparent AI systems will drive innovation. They are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models, and have contributed to leading open-source research, models, and datasets.
All jobs at Together AI Visit website
Job Details
Category infrastructure
Posted 8 days ago