About the Role
This role focuses on maintaining and improving the reliability, scalability, and security of production systems through automation, monitoring, and incident response.
Responsibilities
- Design and manage scalable cloud infrastructure using IaC tools
- Implement and maintain CI/CD pipelines for automated deployments
- Monitor system performance and respond to incidents promptly
- Ensure high availability and fault tolerance across services
- Collaborate with development teams to optimize application performance
- Enforce security best practices in infrastructure and deployment workflows
- Manage containerized environments using Kubernetes or similar platforms
- Automate operational tasks to reduce manual intervention
- Maintain comprehensive logging and observability systems
- Support disaster recovery planning and execution
- Optimize cloud resource usage and cost efficiency
- Conduct root cause analysis for production incidents
- Improve system reliability through proactive monitoring
- Participate in on-call rotations for critical systems
- Document architecture, configurations, and operational procedures
Nice to Have
- Certifications in cloud or DevOps technologies
- Experience supporting high-traffic web applications
- Knowledge of service mesh architectures
- Background in security engineering
- Familiarity with regulatory requirements for data protection
Compensation
Competitive salary based on experience and location
Work Arrangement
Remote-friendly with potential for hybrid or office-based work depending on location
Team
Collaborative engineering environment focused on reliability, automation, and continuous improvement
Our Tech Stack
We use AWS for cloud infrastructure, Kubernetes for orchestration, Terraform for provisioning, and Prometheus and Grafana for monitoring. Our pipelines are powered by GitLab CI, and we enforce security through automated scanning and compliance checks.
Why This Role Matters
System reliability directly impacts donor experience and fundraising success. This role ensures our platform remains resilient, secure, and efficient under growing demand.
Available for qualified candidates in certain regions