Ensure high availability, monitoring, and incident management for AI infrastructure, including on-call duties for AWS deployment systems, conducting root cause investigations, and leading post-mortem reviews without blame.
Create automated systems and internal tools to simplify IT operations, reduce manual effort, and accelerate deployment speed within CI/CD and Kubernetes platforms.
Collaborate with infrastructure teams to enhance CI/CD systems used by IT and enterprise networking groups, and work with security and compliance units to embed monitoring tools into release pipelines.
Improve system observability and documentation practices by establishing performance indicators, deploying monitoring solutions, and producing clear, accurate technical records that reflect best-in-class standards.
Design and implement full-stack internal applications for AI platforms using Go or Python programming languages.

remote-first, not remote-only

The company operates with a remote-first policy, allowing remote work while not excluding in-office collaboration entirely.
Team members meet quarterly for focused, in-person work periods known as 'surges' to drive key initiatives.

Coinbase is hiring a Senior Site Reliability Engineer, Core AI Infrastructure