UP.Labs is seeking a Sr. AI Quality Engineer to own end-to-end quality for our AI-powered inference system. This hybrid AI QA and Product Analyst role sits at the intersection of LLM inference, event-driven backend state-machines, and freight domain logic. You will define what "correct" means, build the systems to measure and enforce it, and lead deep-dive investigations into edge cases and failures.
What You'll Do
- Own end-to-end system quality for our AI-powered freight audit platform.
- Develop and maintain a quality rubric for key use cases and exception types.
- Build and curate golden datasets, including customer-specific variations.
- Own ongoing quality review in development and production: inspect high-volume outputs, diagnose failures, and convert discoveries into roadmap items.
- Define and execute regression tests for new model changes, backend logic changes, or customer-specific use cases.
- Investigate and diagnose issues across the full product stack, from email ingestion to final reporting.
- Triage quality incidents by tracing through logs, event histories, and data queries to isolate root cause.
- Produce high-signal findings reports with minimal reproduction steps, evidence, and recommended fixes.
- Build scalable quality operations, including a repeatable triage playbook and classification system.
- Define monitoring and dashboards for key quality signals like volume anomalies and exception drift.
- Partner with engineering and AI teams to improve system observability and traceability.
- Act as a product and domain translator, understanding freight billing workflows and converting customer requirements into testable rules.
- Identify systemic gaps where real-world data doesn't fit our schema and propose product changes.
What We're Looking For
- Experience in roles that blend quality assurance, investigation, and systems thinking.
- Demonstrated experience evaluating AI/LLM output quality for tasks like extraction, classification, and structured outputs.
- Strong technical ability to debug production issues using log and trace tools like Datadog, ELK, or Honeycomb.
- Strong technical ability to debug production issues using SQL and/or Python for analysis and reproduction.
- Strong technical ability to debug issues within event-driven architectures and workflow state machines.
- Ability to write crisp requirements and acceptance criteria, translating ambiguity into concrete test cases.
- Comfort operating in messy, high-volume, edge-case-heavy environments.
Nice to Have
- Freight, logistics, audit, or billing domain experience.
- Experience designing evaluation metrics like precision/recall, drift detection, and per-customer scorecards.
- Familiarity with workflow engines, state machines, and distributed systems failure modes.
- Experience with annotation workflows, taxonomy design, and building human-in-the-loop QA processes.
Technical Stack
- SQL, Python
- Datadog, ELK, Honeycomb, OpenTelemetry/Jaeger
Team & Environment
Our culture values high ownership—you don't stop at identifying a problem, you drive it to root cause and resolution. We operate comfortably with ambiguity and edge cases, building clarity systematically. You'll need to communicate effectively across product, engineering, machine learning, and operations teams.






