Responsibilities
- Act as the main on-call security contact for the Americas region.
- Manage continuous alert triage, incident handling, and root cause investigations for AI data centers in California, Tennessee, Washington, and other locations.
- Serve as the key security authority during Americas business hours, especially when Singapore headquarters is unavailable.
- Take direct ownership of high-priority incidents such as GPU cluster cryptojacking, ransomware attacks, data exfiltration, and container breakout events.
- Oversee end-to-end incident lifecycle including forensic analysis, containment, and recovery.
- Develop and maintain regional incident response procedures and operational runbooks.
- Work with global security teams to refine SIEM detection logic, automate responses via SOAR, and conduct incident response drills.
- Lead security incident management for customers, including ticket resolution, coordination with customer teams, and alignment with Sales and Customer Success on external messaging.
- Function as the primary escalation point, coordinating incident decisions with Singapore leadership, Legal, and business units during critical events.
- Write and maintain SIEM detection rules using tools like Wazuh, Splunk, or Elastic SIEM to identify threats such as unusual GPU usage, suspicious SSH access, container escapes, Kubernetes API misuse, and InfiniBand anomalies.
- Design detection coverage using the MITRE ATT&CK Cloud and Container frameworks.
- Proactively detect and resolve monitoring gaps in security visibility.
- Conduct threat hunting based on hypotheses to uncover hidden threats.
- Execute at least two formal threat hunting campaigns monthly, delivering detailed reports and new detection logic.
- Develop runtime threat detection using eBPF technologies (e.g., Tetragon, Falco, Cilium) to enhance coverage beyond traditional host agents.
- Implement detection-as-code practices in the region, including version control, CI/CD pipelines, testing, and coverage tracking for detection rules.
- Lead security readiness reviews for new AI data centers in the Americas prior to deployment.
- Evaluate security configurations for network perimeters, out-of-band management, BMC/IPMI settings, KVM/QEMU baselines, GPU isolation methods, and InfiniBand key configurations.
- Drive system hardening initiatives including CIS-compliant Linux configurations, auditd setup, SSH security, privileged account controls, and firmware vulnerability tracking.
- Collaborate with engineering teams to deploy eBPF-based runtime monitoring for detecting container escapes and abnormal system calls.
- Monitor and respond to CVEs affecting NVIDIA GPU drivers, CUDA, NCCL, UFM, BMC firmware, and other critical software components.
- Manage vulnerability response and coordinate patching schedules across the Americas region.
- Lead identity and access management efforts, including deployment of jump hosts (Teleport / Boundary), just-in-time access, and session auditing.
- Oversee firewall, intrusion prevention, and web application firewall configurations for all Americas AI data centers.
- Coordinate with DDoS mitigation providers like Cloudflare Magic Transit or Arbor and develop comprehensive regional DDoS response strategies.
Work Arrangement
Remote — California, Tennessee, Washington
Work Arrangement
Remote — California, Tennessee, Washington
Other
- Must participate in a 7×24 on-call rotation during major incidents.
- Must conduct daily cross-time-zone coordination with Singapore HQ (SGT).
- Professional fluency in both English and Mandarin Chinese is required.
- Must be able to communicate effectively in English with US customers, MSSPs, law enforcement, and auditors, and in Mandarin with the Singapore HQ team and management for complex technical discussions and strategic reporting.