Responsibilities
- Manage full lifecycle deployment of inference workloads, including setup, optimization, SLA adherence, and incident resolution.
- Deliver quantifiable gains in token generation speed, latency, and cost efficiency across various model types and usage patterns.
- Develop and maintain key infrastructure for KV cache management and request scheduling to improve system throughput.
- Design and validate split prefill/decode processing pipelines along with scalable Kubernetes-based orchestration.
- Identify and eliminate performance constraints across compute, memory, and inter-process communication layers; implement comprehensive monitoring.
- Collaborate with clients to align deployment strategies and platform enhancements with their model designs and performance needs.
- Influence platform evolution by contributing to architectural decisions focused on simplifying deployments, boosting hardware efficiency, and enabling new model support.
- Join a rotating on-call schedule, covering up to one week per month, to ensure system stability and meet service level objectives.
Compensation
$165,000 – $350,000 base salary annually, with potential equity through stock options.
Work Arrangement
Not specified
Team
Not specified
Other
- Base salary range is $165,000 – $350,000 per year, based on experience, skills, qualifications, and location.
- Total compensation may include equity in the form of stock options.
- Equal Employment Opportunity Employer policy is in effect.
- Applicants with arrest and conviction records will be considered in accordance with applicable laws.
- A confirmation email will be sent upon successful application submission.
- If no confirmation is received, contact careers@fluidstack.io with resume/CV, role applied for, and submission date for follow-up.
Not specified