Responsibilities
- Ensure high availability, monitoring, and incident management for AI infrastructure, including on-call duties for AWS deployment systems, conducting root cause investigations, and leading post-mortem reviews without blame.
- Create automated systems and internal tools to simplify IT operations, reduce manual effort, and accelerate deployment speed within CI/CD and Kubernetes platforms.
- Collaborate with infrastructure teams to enhance CI/CD systems used by IT and enterprise networking groups, and work with security and compliance units to embed monitoring tools into release pipelines.
- Improve system observability and documentation practices by establishing performance indicators, deploying monitoring solutions, and producing clear, accurate technical records that reflect best-in-class standards.
- Design and implement full-stack internal applications for AI platforms using Go or Python programming languages.
Work Arrangement
remote-first, not remote-only
Other
- The company operates with a remote-first policy, allowing remote work while not excluding in-office collaboration entirely.
- Team members meet quarterly for focused, in-person work periods known as 'surges' to drive key initiatives.