What You'll Do
Take ownership of system reliability for large-scale, distributed FinTech applications. You'll design and manage infrastructure that ensures consistent uptime, fast response times, and seamless scalability under heavy load.
Partner with development teams to embed reliability practices into every phase of the software lifecycle. Help shape deployment strategies, improve system resilience, and reduce incident frequency through proactive engineering.
Lead post-incident reviews to uncover root causes and implement lasting fixes. Build and maintain monitoring and alerting systems using Prometheus and Grafana to detect issues before they impact users.
Automate repetitive operational tasks using Terraform and Ansible, reducing manual effort and minimizing human error. Support continuous delivery through managed deployment pipelines using ArgoCD or FluxCD.
Participate in on-call rotations to respond to critical incidents, ensuring rapid resolution and minimal service disruption. Uphold security and compliance standards across all infrastructure components.
Requirements
- 2-4 years of hands-on experience with distributed systems, focusing on reliability, scalability, and performance
- Proven expertise in Kubernetes and containerized environments
- Deep experience with infrastructure-as-code tools such as Terraform and Ansible
- Strong command of CI/CD and GitOps tools, including ArgoCD or FluxCD
- Extensive experience with monitoring solutions like Prometheus and Grafana
- Proficiency with cloud platforms including AWS, GCP, and Oracle Cloud
- Advanced skills in managing MongoDB and PostgreSQL databases
- Strong problem-solving abilities and a collaborative mindset
Technical Stack
Our environment is built on MongoDB, PostgreSQL, ArgoCD, FluxCD, Grafana, Prometheus, Oracle Cloud, AWS, GCP, Kubernetes, containers, Terraform, and Ansible. You’ll work across these technologies to maintain and enhance system performance and reliability.
