Responsibilities
- Respond promptly to live system issues, conduct initial diagnosis, resolve problems, and assist in post-event reviews.
- Detect repetitive manual tasks and implement automation to increase efficiency and minimize human error.
- Improve monitoring systems and observability platforms to enhance system visibility and alerting accuracy.
- Champion reliability, scalability, and performance standards throughout engineering teams.
- Develop and manage cloud infrastructure automation using tools such as Terraform, Ansible, or CloudFormation.
- Follow change control procedures to maintain high availability and minimize disruptions to production environments.
- Take part in an on-call schedule, delivering round-the-clock incident support and contributing to root cause investigations.
- Collaborate with software engineers, system architects, external vendors, and IT staff to ensure stable operations.
- Investigate and resolve security weaknesses in coordination with cybersecurity specialists.
- Keep up-to-date documentation for systems, monitoring setups, operational runbooks, and incident protocols.
Benefits
- Full medical coverage fully funded for the employee
- 401k plan with employer matching
- Grant of equity options
- Unlimited paid time off plus designated company holidays
- Wellness-focused programs and resources
Work Arrangement
Hybrid