Responsibilities
- Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.
- Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring.
- Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience.
- Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.
- Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations.
- Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth.
Requirements
- Hands-on experience with Elastic Stack (Elasticsearch, Kibana, Logstash/Beats).
- Strong understanding of observability & monitoring (metrics, logs, traces, APM).
- Experience with defining and configuring dashboards, alerts, and SLI/SLOs.
- Basic infrastructure-management exposure (capacity planning, performance insights, scaling, monitoring).
Nice to Have
- Experience with DevOps tools: GitLab, TeamCity, CI/CD pipelines.
- Scripting/programming in Python, Java or C#.
- Basic Linux experience.
- Exposure to additional monitoring tools (Grafana, Prometheus, Splunk, etc.).

