As the AI Test Architect, you will shape the future of quality assurance by designing and deploying a next-generation 'Quality Intelligence' system powered by generative AI. This role is central to ensuring the integrity of AI-driven features across a large-scale, cloud-native SaaS platform used by hundreds of thousands of professionals worldwide.
Key Responsibilities
- Design and implement a unified Quality Intelligence platform that leverages generative AI to forecast defect-prone areas, optimize test coverage, auto-generate test cases, and enable self-correcting test execution.
- Define and lead the adoption of an enterprise-wide AI-first testing strategy, including methods for evaluating non-deterministic outputs, monitoring model drift, and detecting hallucinations throughout the development lifecycle.
- Establish ethical and compliance standards for AI testing, aligned with evolving regulatory expectations.
- Develop rigorous evaluation frameworks for internal AI agents and generative features, including red teaming, adversarial testing, and benchmarks focused on bias, prompt injection, jailbreaking, and goal alignment.
- Build statistically sound evaluation pipelines using tools such as LangFuse, LangSmith, DeepEval, RAGAS, or Arize Phoenix, incorporating LLM-as-judge patterns and human-in-the-loop validation.
- Create test harnesses for agentic behaviors, tool use, planning logic, multi-agent simulations, and runtime observability.
- Integrate AI-powered testing into GitHub-based CI/CD workflows, enabling predictive flakiness detection, automated quality gates, and AI-generated test suites.
- Design self-healing test frameworks by combining AI plugins with Playwright or Cypress, reducing maintenance overhead as UIs and models evolve.
- Lead synthetic data generation, curate reference datasets, and implement AI-driven data masking to support high-fidelity, privacy-compliant testing at scale.
- Collaborate with product, data science, ML engineering, and security teams to embed quality controls into AI feature development from inception.
- Train and mentor QA teams to adopt AI-augmented testing practices through workshops, documentation, and community initiatives.
- Champion AI quality standards across the organization, including dashboards that track DORA metrics alongside AI-specific indicators like hallucination rate and red team success.
- Implement telemetry systems for AI quality, including drift detection, faithfulness scoring, and compliance monitoring, integrated with platforms like Langfuse.
- Establish feedback mechanisms for model refinement, A/B testing safeguards, and proactive risk management in production environments.
Required Qualifications
- Minimum of 8 years of experience in Quality Engineering or Test Architecture within cloud-native SaaS environments.
- At least 2 years focused on testing AI, ML, or LLM-based systems.
- Strong technical foundation in AWS, including serverless architectures, microservices, and infrastructure-as-code using Terraform or CloudFormation.
- Hands-on experience with GitHub CI/CD ecosystems.
- Proven ability to architect and test LLM-powered applications using LangChain, LangGraph, LangSmith, or similar frameworks.
- Expertise in modern test automation tools such as Playwright or Cypress, with practical experience integrating AI-based self-healing capabilities.
- Proficiency in JavaScript/TypeScript and/or Python.
- Firm grasp of core AI concepts including transformers, embeddings, RAG architectures, and evaluation trade-offs.
- Experience with LLM evaluation platforms such as Bedrock Evaluations, Prompt Management, Guardrails, DeepEval, RAGAS, Arize Phoenix, or Langfuse.
- Track record of technical leadership and cross-functional influence.
Preferred Qualifications
- Background in red teaming using tools like Cobalt Strike, Sliver, or Nmap.
- Familiarity with adversarial testing methodologies and security-focused evaluation techniques.
Technology Environment
AWS, Terraform, CloudFormation, GitHub CI/CD, LangChain, LangGraph, LangSmith, Playwright, Cypress, JavaScript, TypeScript, Python, Bedrock Evaluations, Prompt Management, Guardrails, DeepEval, RAGAS, Arize Phoenix, Langfuse, Cobalt Strike, Sliver, Nmap.
Work Mode
This is a fully remote position open to candidates based in Colombia.