Responsibilities
- Build and maintain scalable, highly available systems capable of handling large-volume workloads.
- Implement monitoring solutions with alerting mechanisms to detect and address system issues proactively.
- Automate repetitive operational processes including deployments, monitoring checks, and security policy enforcement using appropriate tools.
- Enhance system efficiency by detecting performance constraints and applying targeted improvements.
- Manage infrastructure as code using tools such as Terraform, Ansible, or equivalent to ensure reliable and repeatable environments.
- Apply organizational security standards and controls across infrastructure and data systems.
- Evaluate system behavior and performance metrics to forecast capacity requirements.
- Troubleshoot service disruptions, conduct root cause investigations, and deploy corrective measures to avoid recurrence.