Responsibilities
- Oversee configuration and performance tuning of HPC clusters and Linux computation servers.
- Support resolution of cluster performance bottlenecks and assist in structuring HPC workloads.
- Deploy, configure, and maintain Linux systems using HPC-oriented software tools.
- Enforce standardized security policies across all Linux-based infrastructure components.
- Set up and manage monitoring and system management solutions for cluster environments.
- Identify and implement automation strategies for deployment and configuration processes.
- Deliver expert guidance and support for Linux system deployments.
- Compile, install, and maintain Linux applications from source code on HPC clusters.
Requirements
- Minimum of five years of experience managing production Linux servers based on RHEL or similar distributions.
- Familiarity with widely used open-source tools including NGINX, PostgreSQL, MariaDB, and Git.
- Working knowledge of core network services such as DHCP, DNS, and NTP.
- At least three years of hands-on experience with Ansible or other infrastructure-as-code configuration tools.
- Ability to create and customize operating system images from the ground up.
- Proficiency in compiling software from source, with understanding of dependencies, libraries, and linking mechanisms.
- Proven experience managing large on-premises infrastructure remotely, including servers, storage, and networking equipment.
- Strong verbal and written communication abilities.
- Demonstrated expertise in optimizing Linux system performance.
Nice to Have
- Direct experience operating and managing production workloads in AWS cloud environments.
- Hands-on use of HPC cluster management tools such as Bright Cluster Manager, Slurm, Univa Grid Engine, EasyBuild, or Spack.
- Practical skills in scripting with Python, R, or comparable programming languages.