Pentagon, IC want industry to provide an ‘evaluation harness’ to standardize testing of AI systems

Pentagon, IC want industry to provide an ‘evaluation harness’ to standardize testing of AI systems

Pentagon, IC want industry to provide an ‘evaluation harness’ to standardize testing of AI systems

https://defensescoop.com/2026/03/11/ai-system-testing-dod-intelligence-agencies/

Publish Date: 2026-03-11 11:57:00

Source Domain: defensescoop.com

The Defense Department and Intelligence Community are on the hunt for an “evaluation harness” to test vendors’ AI technologies for government use.

The Pentagon’s Defense Innovation Unit, headquartered in Silicon Valley, released a solicitation Wednesday for the effort, dubbed “MYSTIC DEPOT,” which will be pursued via a commercial solutions opening contracting mechanism.

The release comes as Defense Secretary Pete Hegseth and Pentagon CTO Emil Michael are pushing the department to accelerate the widespread integration of artificial intelligence capabilities for warfighting and back-office functions.

To keep pace with rapid technology developments in the fast-moving field of AI, agencies need to be able to assess new models against defined benchmarks as they are released.

“The Department of War (DoW), in partnership with the Office of the Director of National Intelligence (ODNI), seeks an evaluation harness and government-specific benchmarks that together enable rigorous, reproducible, vendor-agnostic assessment of any AI system against government-defined criteria,” officials wrote in the solicitation, using a secondary name authorized by the Trump administration to refer to the Department of Defense. “The Government intends to use this harness across multiple programs. Solutions should be designed for broad applicability rather than single-program optimization.”

For the benchmark development portion of the program, the government seeks solutions from vendors that can be applied across unclassified, secret and top secret workflows, with a methodology that addresses requirements elicitation, task decomposition, input design, scoring criteria development, baseline establishment, validation, maintenance and “gaming resistance.”

For the evaluation harness, agencies want an “integrated infrastructure of an execution environment, tooling, and methodology” for AI system assessment that’s deployable across…

Source