Microsoft Vulnerability Severity and Content Classifications for AI Systems

Microsoft is committed to earning and maintaining the trust of our customers in how we develop and deploy Artificial Intelligence (AI) systems. This includes safeguarding customers from vulnerabilities in our software, services, and devices, and managing the risks of producing harmful AI-generated content, including potential exploitation of those vulnerabilities.

AI Security Vulnerabilities

Safeguarding our customers from vulnerabilities in our AI systems involves providing timely security updates and guidance when such vulnerabilities are reported to Microsoft. The following tables outline Microsoft severity classification for common vulnerability types for AI systems, based on the Microsoft Security Response Center (MSRC) advisory rating. MSRC uses this information as guidelines to triage reported issues and determine severity levels, with consideration also given to the ease of exploitation.

Inference Manipulation

This category consists of vulnerabilities that could be exploited to manipulate the model’s response to individual inference requests, without modifying the model itself.
The severity of the vulnerability is determined based on the resulting security impact.
Content-related issues are assessed separately. See categories listed in the Content-related issues section.

Vulnerability	Description	Security Impact	Severity
Prompt Injection	The ability to inject instructions that cause the model to generate unintended output resulting in a specific security impact.	Allows an attacker to exfiltrate another user’s data or perform privileged actions on behalf of another user, requiring no user interaction (e.g., zero click).	Critical
	Example: In an instruction-tuned language model, a textual prompt from an untrusted source contradicts the system prompt and is incorrectly prioritized above the system prompt, causing the model to change its behavior.	Allows an attacker to exfiltrate another user’s data or perform privileged actions on behalf of another user, requiring some user interaction (e.g., one or more clicks).	Important
	References: Greshake et al. 2023, Rehberger 2023	Allows an attacker to influence or manipulate the generated output.	Content-related issues
Input Perturbation	The ability to perturb otherwise valid inputs in a way that causes the model produces incorrect outputs. Also known as model evasion or adversarial examples.	Allows an attacker to exfiltrate another user’s data or perform privileged actions on behalf of another user, requiring no user interaction (e.g., zero click).	Critical
	Example: In an image classification model, an attacker perturbs the input image such that it is misclassified by the model.	Allows an attacker to exfiltrate another user’s data or perform privileged actions on behalf of another user, requiring some user interaction (e.g., one or more clicks).	Important
	References: Szegedy et al. 2013, Biggio & Roli 2018	Allows an attacker to influence or manipulate the generated output.	Content-related issues

Inferential Information Disclosure

This category consists of vulnerabilities that could be exploited to infer information about a model’s training data, architecture and weights, or inference-time input data.
Inferential information disclosure vulnerabilities specifically involve inferring information using the model itself (e.g., through the legitimate inference interface). Vulnerabilities that obtain information in other ways (e.g., storage account misconfiguration) are assessed using existing definitions (e.g., “Information Disclosure”).
Evaluation of these vulnerabilities is based on the level of confidence or accuracy attainable by a potential attacker. A vulnerability in this category is only applicable if an attacker can obtain a sufficient level of confidence or accuracy.
The severity of the vulnerability is determined based on the classification of the impacted data, using the data classification definitions from the Microsoft Vulnerability Severity Classification for Online Services.

Targeting training data

For vulnerabilities targeting the training data, the severity is determined by the classification of this data.

Vulnerability	Description	Data classification of training data	Severity
Membership Inference	The ability to infer whether specific data records, or groups of records, were part of the model’s training data.	Highly Confidential or Confidential	Moderate
	Example: An attacker guesses potential data records and then uses the outputs of the model to infer whether these were part of the training dataset, thus confirming the attacker’s guess. References: Carlini et al. 2022, Ye et al. 2022	General or Public	Low
Attribute Inference	The ability to infer sensitive attributes of one or more records that were part of the training data.	Highly Confidential or Confidential	Important
	Example: An attacker knows part of a data record that was used for training and then uses the outputs of the model to infer the unknown attributes of that record.	General	Moderate
	References: Fredrikson et al. 2014, Salem et al. 2023	Public	Low
Training Data Reconstruction	The ability to reconstruct individual data records from the training dataset.	Highly Confidential or Confidential	Important
	Example: An attacker can generate a sufficiently accurate copy of one or more records from the training data, which would not have been possible without access to the model.	General	Moderate
	References: Fredrikson et al. 2015, Balle et al. 2022	Public	Low
Property Inference	The ability to infer sensitive properties about the training dataset.	Highly Confidential or Confidential	Moderate
	Example: An attacker can infer what proportion of data records in the training that belong to a sensitive class, which would not have been possible without access to the model. References: Zhang et al. 2021, Chase et al. 2021	General or Public	Low

Targeting model architecture/weights

For vulnerabilities targeting the model itself, the severity is determined by the classification of the model architecture/weights.

Vulnerability	Description	Data classification of model architecture/weights	Severity
Model Theft	The ability to infer or extract the architecture or weights of the trained model.	Highly Confidential or Confidential	Critical
	Example: An attacker is able to create a functionally equivalent copy of the target model using only inference responses from this model.	General	Important
	References: Jagielski et al. 2020,  Zanella-Béguelin et al. 2021	Public	Low

Targeting prompt/inputs

For vulnerabilities targeting the inference-time inputs, the severity is determined by the classification of these inputs.

The ability to extract or reconstruct the system prompt(s) of a service, by interacting with the service or otherwise, is not considered to be a vulnerability.

Vulnerability	Description	Data classification of system prompts/user input	Severity
Input Extraction	The ability to extract or reconstruct other users’ inputs to the model.	Highly Confidential or Confidential	Important
	Example: In an instruction-tuned language model, an attacker uses a specially crafted input that causes the model to reveal (part of) another user’s input to the attacker.	General of Public	Low

Content-Related Issues

AI-generated content may occasionally diverge from Microsoft's Responsible AI Principles due to system malfunctions or malicious use. The table below outlines key areas of concern identified during our safety review process. These categories are not exhaustive and may evolve as our understanding and standards develop.

At this stage, content-related issues are classified as either In Scope or Out of Scope for investigation, without severity ratings. Scope determination focuses primarily on finished AI services—those that are publicly accessible or integrated into Microsoft products—rather than on experimental or internal-only systems.

Context matters. When evaluating whether a content-related issue falls within scope, we consider factors such as the depth of the model’s response, availability of information, and nature of user interaction. This is an emerging space, and we actively welcome submissions related to safety and societal harm. These inputs help inform our standards, guide product conversations, and shape iterations of this bug bar.

Category	Description
Chemical, Biological, Radiological or Nuclear (CBRN)	AI-generated content related to the harmful use of CBRN that demonstrates information not readily available publicly.
Information Integrity	AI-generated content that is false, misleading, or incomplete, created with the intent to deceive or manipulate public understanding.
Election Integrity	AI-generated content that misrepresents, distorts, or fabricates information related to elections, with the intent of misleading others.
Sexual Content	AI-generated content that depicts or promotes sexual violence, exploitation, or graphic sexual material.
Hate Speech	AI-generated content that promotes hate speech, incites violence, or encourages discrimination, harassment, or harmful actions against individuals or groups based on ethnicity, race, religion, gender, sexual orientation, disability, or any other protected characteristic.
Self-Harm	AI-generated content that promotes, glorifies, or encourages self-harm, suicide, or behaviors that may lead to physical or psychological injury.
Inciting Violence	AI-generated content that incites, glorifies, or encourages violence, including threats or calls to action that could result in real-world harm.
Inappropriate Language	AI-generated content that includes offensive, profane, or abusive language, including slurs, insults, or derogatory remarks.
Fairness	AI-generated content or behavior that results in unequal treatment, biased outcomes, or reduced quality of service for individuals based on demographic characteristics, including marginalized or protected groups.
Malicious Uses	AI-generated content or capabilities that significantly reduce the effort, expertise, or resources required to engage in harmful, unlawful, or unethical activities—particularly when such content reveals or synthesizes information, instructions, or tools that are not widely or readily available to the public.
Child Sexual Exploitation and Abuse Material (CSEAM)	AI-generated content that depicts, promotes, or facilitates child sexual exploitation or abuse. Microsoft Security Response Center is unable to accept reports of this nature. Please use the appropriate in-product reporting features to report this content or file a report at Report a Concern.

Microsoft recognizes that this list may not encompass all vulnerability types or content-related issues, and that new vulnerabilities or issues may be discovered over time. We are committed to continuously updating this list based on trends identified through submitted cases and actively participating in multilateral efforts aimed at building consensus on priority vulnerabilities and issues. We reserve the right to classify any vulnerabilities or issues not currently covered by this document at our discretion. These classifications may be updated as needed. Examples are provided for illustrative purposes only. Any penetration testing against Microsoft systems must adhere to the Microsoft Security Testing Rules of Engagement.