Microsoft says its MDASH agentic security system has moved beyond benchmark testing and into active use across Windows, Azure, and identity security workflows, where it is being used to find, validate, prioritize, and help remediate software vulnerabilities.
The June 17 update matters because it is not just another claim that an AI system scored well on a security benchmark. In a new Microsoft Security post, Taesoo Kim, Microsoft’s vice president of agentic security, describes MDASH as a multi-model scanning system now plugged into Microsoft’s own engineering and security tools. Validated findings can flow into GitHub Advanced Security, Azure DevOps, and Microsoft Defender rather than sitting in a separate research report.
That workflow detail is the important shift. AI-assisted bug finding is becoming less about a model answering a challenge and more about whether an organization can connect discovery, exploitability validation, ownership, pull requests, production risk signals, and remediation. Finding a bug faster only helps if the result reaches the team that can fix it with enough context to act.
What Microsoft says MDASH found this month
Microsoft’s latest Patch Tuesday cohort includes MDASH-assisted discoveries across Windows Hyper-V, the Windows kernel, Active Directory Domain Services, Remote Desktop Client, HTTP.sys, DNS Client, and DHCP Client. The listed vulnerabilities include remote code execution, elevation of privilege, and information disclosure issues.
The highest-severity examples in the June 17 disclosure include CVE-2026-45657, a Windows kernel use-after-free remote code execution flaw with a 9.8 CVSS score, and CVE-2026-47291, an HTTP.sys integer overflow remote code execution flaw also scored at 9.8. The set also includes Hyper-V remote code execution issues, an Active Directory Domain Services stack-based buffer overflow, and a Remote Desktop Client heap-based buffer overflow.
Microsoft frames these as pre-exploitation discoveries in difficult areas of the codebase, not as evidence of attacks in the wild. That distinction matters for security teams reading the announcement: the point is not that these CVEs should replace normal Patch Tuesday prioritization, but that AI-assisted systems are starting to reach the kinds of kernel, virtualization, directory, and protocol surfaces that normally require deep manual review.
How the system works differently from a single-model scanner
MDASH is short for Microsoft Security’s multi-model agentic scanning harness. Microsoft first detailed the system in May, when it said MDASH had helped researchers find 16 Windows vulnerabilities, including four Critical remote code execution flaws across the Windows networking stack and related services. That earlier technical disclosure described a pipeline with specialized agents, configurable model panels, domain plugins, validation, deduplication, and proof stages.
The June update adds more detail about where Microsoft believes the system improved. The biggest gains came early in the pipeline, especially in prepare and scan stages. MDASH now does sharper scoping between the code under audit and surrounding context, builds a fuller threat model around untrusted input paths, uses a more reliable call graph for reachability analysis, and routes work more selectively to specialized agents.
Those details are more useful than the headline score because they explain what security teams should look for in AI vulnerability tools. A stronger foundation model can improve results, but Microsoft’s argument is that the durable value is the harness around the model: scope files, domain-specific plugins, validation logic, proof generation, and integrations that carry forward as models change.
The CyberGym score is impressive, but not the whole story
Microsoft says the latest MDASH version reached 96.5% on CyberGym under an “any crash” measure, up from the earlier public benchmark result. CyberGym is a large-scale evaluation built around real-world vulnerability reproduction. Its own benchmark description says it includes 1,507 vulnerabilities across 188 open-source projects including networking, cryptography, operating-system kernels, and multimedia software.
The score is a strong signal that agentic systems can reason through known vulnerability-reproduction tasks at a level that would have looked unrealistic a short time ago. Microsoft also tested newer model configurations on the 52 cases MDASH missed, reporting projected improvements to roughly 97.8% to 98.1% if those gains held without regressions.
Still, the miss analysis is the more practical part for defenders. Of the 52 missed CyberGym tasks, Microsoft attributed 34 to the prove stage, where the system failed to generate a working proof of concept. Ten were validate-stage failures, where intended findings were rejected as false positives, and eight were scan-stage misses. In other words, the hard part is increasingly not only noticing suspicious code. It is proving reachability, handling complex input formats, matching the evaluation or build environment, and turning a candidate finding into a reproducible security issue.
Microsoft’s examples are concrete. Structured file formats such as fonts, PDFs, IVF/AV1 video, and WPG images can make proof generation difficult because inputs must satisfy format validation before they reach the vulnerable code path. Build complexity and environment mismatches can also stop an agent from producing a proof even when the underlying reasoning is close.
Why this matters for enterprise security teams
For enterprises, MDASH points toward a new security tooling question: not simply “which model found the bug,” but “how does the system move from code analysis to triage and fix?” Microsoft says MDASH findings can appear as code scanning alerts in GitHub Advanced Security, surface in Azure DevOps as pipeline gates and work items, and show up in Microsoft Defender with threat intelligence and runtime signals.
That matters because large organizations already struggle with alert volume. AI vulnerability discovery could make that better by finding deeper issues earlier, or worse by creating a flood of speculative reports. The difference will come from validation, prioritization, access controls, evidence quality, and whether remediation work lands inside the tools developers and security teams already use.
Microsoft also tied MDASH to its broader Build 2026 security announcements. In a June 2 post, the company said the expanded MDASH preview includes Microsoft Defender integration and that GitHub Code Security integration can enrich code findings with production signals such as internet exposure and data sensitivity. That is the enterprise control layer buyers should watch: whether AI-discovered vulnerabilities become ranked, owned engineering work rather than another dashboard.
The next test is real-world ambiguity
MDASH is arriving at a moment when AI-assisted vulnerability discovery is becoming strategically important and politically sensitive. The same class of systems that can help defenders find flaws sooner can also make offensive discovery cheaper. That makes integration, access control, and disclosure handling as important as raw benchmark performance.
The near-term question is whether systems like MDASH can keep precision high when they leave curated benchmark tasks and run against constantly changing proprietary code. Real programs come with incomplete documentation, unusual build systems, generated code, stale components, and business constraints that do not fit cleanly into a benchmark harness.
Microsoft is clearly trying to make the case that AI bug hunting is becoming production infrastructure. The most credible part of that case is not the 96.5% CyberGym number by itself. It is the combination of deeper Microsoft deployment, new real CVE discoveries, transparent failure analysis, and workflow integrations that bring findings into Defender, GitHub, and Azure DevOps. If that model holds up outside Microsoft’s own stack, AI vulnerability discovery could become a normal part of secure software development rather than a specialized research event.