San Francisco On-site Employment $250,000 - $325,000

Together AI is hiring an Engineering Manager, Site Reliability Engineering

Responsibilities

  • Guide a team of approximately ten SRE engineers spanning several functional areas, collaborating closely with technical leads to define strategic direction.
  • Advance the team’s transition from manual processes to automated, scalable infrastructure by identifying operational toil, setting limits, and driving engineering initiatives that reduce it.
  • Remain actively involved in technical work, including coding, architectural reviews, incident leadership, and key technical decisions.
  • Establish consistent coaching and feedback practices to support engineer growth, particularly in incident management, on-call behavior, and solving systemic issues.
  • Improve on-call procedures and incident response effectiveness, ensuring blameless postmortems lead to concrete engineering actions.
  • Collaborate with a peer SRE engineering manager across time zones to align on organizational practices, hiring strategies, and operational improvements.
  • Support team growth by managing hiring, performance leveling, and career progression for engineers in your region.
  • Manage capacity planning, prioritize engineering work across functional areas, and advocate for SRE priorities in cross-organizational engineering discussions.

Work Arrangement

On-site

Team

Team of 20 engineers organized into three functional areas: bare-metal and day-0/day-2 operations, inference platform, and virtual clusters platform.

Other

Relocation assistance is provided.

About company
Together AI
Together AI is a research-driven artificial intelligence company that believes open and transparent AI systems will drive innovation. They are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models, and have contributed to leading open-source research, models, and datasets.
All jobs at Together AI Visit website
Job Details
Department Site Reliability Engineering
Category management
Posted 8 days ago