Responsibilities
- Design and implement automation for repetitive operational tasks, deployments, and infrastructure management to minimize errors and boost efficiency using scripting, CI/CD pipelines, and infrastructure-as-code tools.
- Proactively enhance system availability by detecting performance bottlenecks and refining architectural components.
- Evaluate and forecast infrastructure needs such as CPU, memory, and storage to support growing user demand and ensure scalability.
- Develop and maintain auto-scaling mechanisms to manage traffic surges automatically while preserving system responsiveness.
- Monitor system health, reliability, and performance using observability platforms like Prometheus, Grafana, or ElasticSearch.
- Respond swiftly to system outages, incidents, and failures with the goal of minimizing service disruption.
- Oversee incident documentation, stakeholder communication, and post-incident reviews to ensure transparency and accountability.
- Perform root cause analyses after system failures to determine underlying issues and prevent recurrence.
- Apply insights from incident reviews to implement corrective actions, system enhancements, and reliability best practices.
Requirements
- Minimum of three years of hands-on experience in IT operations or systems administration.
- Exceptional communication skills: articulate clearly, adapt tone for diverse audiences, resolve conflicts constructively, and communicate effectively under pressure in both German and English.
- Strong collaboration skills: work effectively across development, operations, and business teams to align goals and drive company-wide solutions.
- Passion for automation: champion automated workflows to reduce manual effort and improve team productivity.
- Analytical problem-solving ability: investigate complex technical issues, identify root causes, and develop long-term fixes.
- High degree of self-management: thrive in remote settings, organize complex tasks independently, and maintain focus without supervision.
- Proficiency with container orchestration and container technologies, particularly Kubernetes.
- Experience with CI/CD tools and infrastructure deployment methods such as GitHub Workflows, Helm, and Kustomize.
- Familiarity with cloud platforms, preferably Google Cloud, though experience with other providers is acceptable.
- Fluent written German with strong spelling and grammar skills.
Nice to Have
- Prior experience working with PHP-based applications.
Benefits
- Opportunity to shape innovative projects within a collaborative and forward-thinking team.
- Full flexibility in scheduling and work location.
- Option to work from home or partner coworking spaces, provided stable internet is available.
- Ongoing access to professional development and training opportunities.
- Employment with a profitable, self-sustained German tech company not reliant on external investors.
- Teams focused on outcomes with a culture of open, direct feedback.
- Provision of modern hardware: choice between Thinkpad or MacBook.
- Part of a cohesive, international team with members from over 40 countries.
- Annual team gatherings in various European locations.
- Immediate autonomy and responsibility from day one.
- Employer contribution to retirement savings.
- Informal workplace culture: first-name basis, no dress code, equal interactions.
- Flexible working hours Monday through Friday with core hours from 10 AM to 4 PM.
Work Arrangement
100% remote
Your new dream job
- Automate operational workflows, deployments, and infrastructure management using code to reduce human error and increase efficiency through scripting, CI/CD pipelines, and infrastructure-as-code.
- Continuously improve system reliability and availability by identifying performance constraints and optimizing system design.
- Forecast infrastructure capacity needs (CPU, memory, storage) to ensure systems scale effectively with demand, and implement auto-scaling solutions to handle traffic fluctuations without manual input.
- Monitor system performance, availability, and health using tools such as Prometheus, Grafana, or ElasticSearch. Respond proactively to issues and manage incidents to reduce user impact. Handle incident documentation, communication, and follow-up analysis.
- Conduct root cause investigations after incidents to understand failures and implement changes that improve system resilience and prevent recurrence.
Your typical day at Digistore24
- Start the day with a team video call to review progress and plan tasks.
- Approach work in a structured manner, setting clear daily goals and routines.
- Dedicate time to advancing Site Reliability Engineering processes.
- Collaborate with team members who provide support and guidance.
- Participate in a daily stand-up to share priorities and blockers and receive actionable feedback.
- Focus deeply on improvements to auto-scaling, monitoring, and alerting by temporarily disabling distractions.
- Test proposed solutions in real environments.
- Document successful approaches for discussion with the Head of IT Operations.
- Assist a developer with designing a new CI/CD pipeline after discussing requirements and delivering a prototype.
- Review and adjust resource allocation for an application based on current usage metrics.
- Identify and implement monitoring for an untracked endpoint by creating a ticket and adding Terraform code.
This position is NOT for you if ...
- You do not align with our company values.
- You have less than three years of experience in IT operations.
- You are unable to take ownership and require constant approval for decisions.
- You struggle with planning and prioritizing your own workload.
- You are not motivated by solving complex technical challenges.
Our values
Carefully consider our core values. Are you committed to living them every day?
Other
- Initial contract duration is 12 months.
- Must align with company values.
- Must be capable of working remotely with strong self-organization.
- Must be able to independently plan and prioritize tasks.
- Must enjoy tackling complex technical problems.
- Must be willing to take initiative without consulting others on every detail.
- Fluent in German with excellent spelling and grammar.
- Ability to switch between German and English as needed.
- No formal dress code.
- First-name basis across the organization.
- Core working hours: 10 AM to 4 PM, Monday to Friday.
- Modern equipment provided: Thinkpad or MacBook.