Jobgether is hiring a Senior Engineer, Production Operations to shape and maintain highly reliable, scalable, and performant cloud systems supporting mission-critical services in our fast-growing fintech environment. This remote role focuses on driving operational excellence through automation, Infrastructure as Code, and robust monitoring while collaborating closely with development and security teams.
What You'll Do
- Design, implement, and maintain core cloud infrastructure and Site Reliability Engineering practices to ensure high availability and performance.
- Develop and optimize cloud infrastructure using Infrastructure as Code tools, primarily Terraform, and automation platforms.
- Collaborate with development and security teams to integrate SRE principles into the software development lifecycle.
- Design and manage monitoring, logging, and alerting solutions to provide clear visibility into system health.
- Participate in incident response, conduct root cause analyses, and contribute to blameless postmortems.
- Identify and implement architectural improvements to enhance system reliability, resilience, and efficiency.
- Automate operational tasks and processes to reduce toil and improve productivity.
- Research, evaluate, and advocate for new tools or technologies to improve operational posture.
- Enhance engineering tooling, processes, and standards for consistent and repeatable application delivery.
What We're Looking For
- 5+ years of experience in Site Reliability Engineering, Production Operations, or similar roles focused on cloud infrastructure and distributed systems.
- Proven experience architecting and maintaining highly available, secure, and scalable systems in a public cloud environment (AWS preferred).
- Strong proficiency with Infrastructure as Code tools, particularly Terraform.
- Experience automating operational tasks using scripting languages (Python, Go, Bash) and automation platforms.
- Expertise in monitoring, logging, and alerting solutions (Datadog, Prometheus, Grafana, ELK stack).
- Solid understanding of incident response best practices and troubleshooting complex production issues.
- Knowledge of distributed systems, microservices architectures, and containerization technologies (Docker, Kubernetes/EKS).
- Exceptional analytical, problem-solving, and collaboration skills, with the ability to communicate technical concepts effectively to technical and non-technical stakeholders.
- Passion for improving system reliability, performance, and operational efficiency.
Nice to Have
- Experience with payments infrastructure or high-volume transactional systems.
- Familiarity with database technologies (PostgreSQL, Cassandra, DynamoDB).
- Experience with CI/CD pipelines and automation of software delivery.
Technical Stack
- Infrastructure as Code: Terraform
- Cloud: AWS
- Languages/Scripting: Python, Go, Bash
- Monitoring/Observability: Datadog, Prometheus, Grafana, ELK stack
- Containers/Orchestration: Docker, Kubernetes/EKS
- Databases: PostgreSQL, Cassandra, DynamoDB
Benefits & Compensation
- Competitive salary with market-based adjustments depending on location and experience.
- Discretionary performance bonus and equity rewards.
- Medical, dental, vision coverage, and HSA match.
- Paid life insurance, AD&D, and disability benefits.
- Traditional 401(k) plan with company match.
- Unlimited PTO and paid company holidays, including pop-up bonus holidays.
- Professional development stipends and mental health resources.
- Fertility healthcare support and 100% paid parental and caregiving leave with additional home support services.
- Flexible work arrangements, remote or in-office opportunities.
- Fully stocked office kitchen, catered lunches, and occasional in-office events.
- Employee resource groups promoting inclusion and collaboration.
Work Mode
This is a remote position open to candidates located within the United States.
Jobgether is an equal opportunity employer.



