We are looking for an AI Quality & Evaluation Engineer to own the quality planning and execution of an AI-powered chat application operating over complex law enforcement and mobile device data.
This is a highly hands-on role focused on execution rather than high-level QA strategy. You will design, build, and run automated and semi-automated tests for LLM-driven workflows, create evaluation datasets, and continuously stress the system with realistic and extreme investigative scenarios.
What you will be doing:
Design, plan, and execute quality tests for an AI chat application built on LLMs and investigative data.
Build and maintain automation frameworks for prompt regression testing, multi-turn conversations, and model upgrades.
Create and curate evaluation datasets used for regression testing, benchmarking, and model comparison.
Design complex investigative scenarios including ambiguous, incomplete, or conflicting datasets.
Execute manual exploratory testing to uncover hallucinations, reasoning failures, and edge cases.
Work closely with engineering, product, and data teams as part of the development lifecycle.
Validate release readiness and identify regressions related to prompts, models, or data pipelines.
What makes this role different:
You are evaluating AI behavior rather than fixed expected outputs.
You help define what correctness and quality mean for AI reasoning over sensitive data.
You actively invent scenarios to stretch and break the system.
Your work directly impacts trust, reliability, and investigative confidence.
This is a highly hands-on role focused on execution rather than high-level QA strategy. You will design, build, and run automated and semi-automated tests for LLM-driven workflows, create evaluation datasets, and continuously stress the system with realistic and extreme investigative scenarios.
What you will be doing:
Design, plan, and execute quality tests for an AI chat application built on LLMs and investigative data.
Build and maintain automation frameworks for prompt regression testing, multi-turn conversations, and model upgrades.
Create and curate evaluation datasets used for regression testing, benchmarking, and model comparison.
Design complex investigative scenarios including ambiguous, incomplete, or conflicting datasets.
Execute manual exploratory testing to uncover hallucinations, reasoning failures, and edge cases.
Work closely with engineering, product, and data teams as part of the development lifecycle.
Validate release readiness and identify regressions related to prompts, models, or data pipelines.
What makes this role different:
You are evaluating AI behavior rather than fixed expected outputs.
You help define what correctness and quality mean for AI reasoning over sensitive data.
You actively invent scenarios to stretch and break the system.
Your work directly impacts trust, reliability, and investigative confidence.
Requirements:
What you should bring
5+ years of experience in QA, test automation, or validation engineering.
Strong hands-on experience building automated tests.
Experience testing complex, data-heavy systems.
Familiarity with API testing tools (e.g., Postman).
Strong analytical, debugging, and problem-solving skills.
High attention to detail with the ability to see the bigger picture.
Excellent English, written and spoken.
Nice to have:
Experience testing AI, ML, or LLM-based systems.
Experience with prompt testing or NLP evaluation techniques.
Experience building synthetic or semi-synthetic datasets.
Experience working with databases (SQL).
What you should bring
5+ years of experience in QA, test automation, or validation engineering.
Strong hands-on experience building automated tests.
Experience testing complex, data-heavy systems.
Familiarity with API testing tools (e.g., Postman).
Strong analytical, debugging, and problem-solving skills.
High attention to detail with the ability to see the bigger picture.
Excellent English, written and spoken.
Nice to have:
Experience testing AI, ML, or LLM-based systems.
Experience with prompt testing or NLP evaluation techniques.
Experience building synthetic or semi-synthetic datasets.
Experience working with databases (SQL).
This position is open to all candidates.
















