Responsibilities
- Build and ship AI agents that serve real users: tool-calling LLM systems with structured output, parallel API orchestration, and streaming responses.
- Design evaluation harnesses and quality scoring — we use Langfuse, rubrics to measure safety, effectiveness, and personalization.
- Own the full loop: prototype a new agent capability, validate it with evals, deploy it to staging and production, monitor traces, and iterate.
- Improve reliability, latency, and cost through prompt caching strategies, token budgets, retry logic, and observability.
- Write the tools agents use: API integrations with Pydantic validation, exercise search over local databases, structured workout submission.
Requirements
- Strong Python skills: you've built and deployed services on large production systems.
- Experience with LangChain/LangGraph or similar agent frameworks.
- Hands-on experience with LLMs in production: prompt engineering, tool/function calling, structured output, evaluation.
- Comfort with async Python, HTTP APIs, and streaming protocols (SSE, webhooks).
- Experience with data validation and schema design (Pydantic, JSON Schema).
- Ability to debug across layers: from a broken LLM tool call to a misconfigured Terraform resource.
- Clear communication: you'll work directly with product, mobile, and backend engineers.
Nice to Have
- Familiarity with AWS (Bedrock, ECR, CloudFront, S3, Cognito) or other cloud agent hosting.
- Observability and tracing tools (Langfuse, OpenTelemetry, Datadog).
- Exposure to evaluation frameworks: LLM-as-a-judge, automated scoring, dataset management.
- Infrastructure-as-code (Terraform, CDK).
Work Arrangement
Remote (Country) — continental US
Additional Information
- Remote-First Employment eligible to all employees located anywhere in the continental US. No travel required.
- Flexible PTO so you can rest, recharge, and take care of life outside of work.
- Future Membership Enjoy our platform for free!