Module E2 of 5 · Track 3E: Security Operations for AI

The threat is in the language, not the packet.

Security Monitoring and Anomaly Detection

Traditional security tools watch bytes and access events. AI security monitoring must also watch what the model sees and says. This module covers the five monitoring layers of an AI stack, why distillation attacks require population-level statistics to detect, how to log security signals without logging sensitive content, and how AgentIQ and DiscoveR instrument the monitoring layer.

40 min read
Track 3E
Intermediate
Security Operations

Module Progress

1 2 3 4 5

Section 01

Why AI monitoring differs

Traditional security monitoring watches bytes, packets, access control events, and file system changes. These signals are structural: a port scan looks different from normal traffic at the network layer. A privilege escalation leaves an audit trail in the operating system. The threat is visible in the metadata.

AI security threats are different in three fundamental ways that require a different monitoring approach.

The signal is in the semantics, not the bytes. A malicious prompt injection and a benign customer support query arrive through the same authenticated API channel with the same headers and the same byte count. The threat is in the meaning of the text, not the structure of the packet. No traditional network security tool can see this difference.

The attacker is often authenticated. A distillation campaign operates through valid API keys. A jailbreak attempt comes from a paying user. An insider running extraction queries has legitimate access. Access control events show nothing unusual because the attacker is authorized to make requests. The attack is in what they request and what they do with the response.

Model drift is a security event. A model whose behaviour has shifted may have been poisoned through a compromised update pipeline, fine-tuned on adversarial data, or jailbroken by a technique that the deployment team has not detected. A 15-percentage-point drop in safety refusal rates over two weeks is not a performance metric. It is a potential security incident.

Traditional security monitoring watches
Network bytes, packet headers, port traffic
Authentication events: logins, token issuance, failures
File system changes: creation, modification, deletion
Process execution and privilege escalation
API call volume, rate limits, error rates
Misses: semantic content of queries and outputs
Misses: authenticated attacker behaviour patterns
Misses: model behaviour drift over time
AI-aware security monitoring also watches
Query semantic similarity clustering across accounts
Output PII rate, hallucination rate, refusal rate
Prompt injection detection signals per output
Retrieval pattern anomalies in vector database queries
Agent tool call patterns, delegation chain depth
Model safety refusal rate on standard probe sets
Population-level query distribution across accounts
Chain-of-thought trace integrity for agentic workflows

Section 02

The five monitoring layers

A production AI system is not one thing. It has a query intake layer, an inference layer, a retrieval layer (if it uses RAG), an agent layer (if it uses autonomous workflows), and a model layer. Each layer has distinct security signals that are invisible to the others. Missing any layer leaves a monitoring blind spot.

Five-layer AI monitoring stack with MITRE ATLAS coverage

1
Input layer · Query intake
Query volume, semantic similarity, account behavioral fingerprint, injection pattern signatures, rate per account per period
AML.T0024
2
Inference layer · Model output
PII rate in outputs, hallucination score distribution, refusal rate, chain-of-thought trace completeness, output format conformance
AML.T0048
3
Retrieval layer · Vector database
Embedding query clustering, document access frequency, retrieval relevance score drift, cross-namespace access attempts, query timing anomalies
AML.T0024
4
Agent layer · Autonomous actions
Tool call frequency, delegation chain depth, blast radius per agent session, cross-tenant access attempts, spawn rate of child agents, failed authorization rate
AML.T0053
5
Model layer · Behaviour over time
Safety refusal rate on probe sets, capability regression on held-out eval, structured output format deviation, response length distribution shift
AML.T0005

Most organisations monitor only layers 1 and 2. Rate limiting and output scanning are the default monitoring posture for deployed LLMs. Layers 3, 4, and 5 are almost universally absent. The retrieval layer is where VectaX-protected RAG systems need specific monitoring. The agent layer is where AgentID token and delegation chain events need to feed into the security dashboard. The model layer is where DiscoveR's scheduled probes detect drift.

Section 03

Input layer signals

The input layer sits at the front of the AI system: every query passes through it before reaching the model. It is the first place where anomalies can be detected, but also the layer where detection is hardest, because malicious queries are semantically crafted to look benign.

Query rate per account
Total queries per account per hour and per day, normalised to account age. A new account submitting thousands of queries in the first hour is anomalous even if each query looks benign.
Alert: 5x baseline for account age cohort
Semantic similarity clustering
Group queries across accounts by embedding similarity. Distinct accounts whose queries cluster together in semantic space may be coordinated. A natural user population has high semantic diversity.
Alert: inter-account similarity above 0.85 cosine
Topic distribution uniformity
Measure the entropy of query topics per account. A legitimate user asks about many topics. A systematic extractor has unusually uniform topic distribution, covering one domain comprehensively.
Alert: topic entropy below 25th percentile for cohort
Prompt injection pattern match
Match queries against known injection signatures: role-play overrides, instruction separator injections, indirect injection via retrieved content. Pattern matching has high false positive risk without semantic context.
Alert: match on high-confidence injection signature
Query length distribution
Monitor the distribution of query lengths per account. Systematic extraction campaigns often show unusual length patterns: very long queries that embed entire contexts, or unusually short probes at high volume.
Baseline: log normal distribution per app type
Repeated identical queries
Track SHA-256 hashes of normalised queries. High rate of identical or near-identical queries from one account or across accounts indicates programmatic extraction rather than natural use.
Alert: same hash appearing 50+ times across accounts

Section 04

Inference layer signals

The inference layer is where the model produces output. Security signals at this layer are about what comes out, not what went in. AgentIQ runs inline at this layer, classifying every output before it reaches the user and feeding those classifications into the monitoring pipeline.

PII rate in outputs
Fraction of responses containing detected PII (names, emails, phone numbers, account IDs, medical identifiers). A rising PII rate indicates the model is leaking data it should not include in outputs.
Alert: PII rate above 0.5% of responses per hour
Hallucination score
Per-response score measuring factual inconsistency relative to retrieved context. A sustained increase over a rolling window indicates model drift or that the retrieval pipeline is returning degraded content.
Alert: p75 score rises 20% above 30-day baseline
Safety refusal rate
Fraction of responses that trigger a safety refusal. Both rises and drops are signals: a drop may indicate jailbreak success; a spike may indicate an active injection campaign probing safety boundaries.
Alert: 15% deviation from 7-day rolling mean
Prompt injection detection rate
Per-response classification: was this output affected by a detected injection attempt? Rising detection rate indicates an active injection campaign. Sudden drop after a rise may indicate the attacker found a bypass.
Alert: detection rate above 1% for 15-minute window
Chain-of-thought integrity
For agentic deployments: does the visible reasoning chain align with the stated task and the delegated permissions? A chain that justifies actions outside the token scope indicates a compromised workflow.
Alert: any scope deviation in chain reasoning
Response latency distribution
Sudden latency spikes can indicate increased reasoning effort (jailbreak attempts), resource exhaustion attacks, or a compromised inference pipeline. Track p50, p95, p99 per model and per API endpoint.
Alert: p95 latency 3x above 24-hour baseline

Section 05

Retrieval layer signals

In a RAG deployment, the retrieval layer is where user queries reach the vector database. This layer has a specific attack surface: an attacker who can observe or manipulate retrieved documents can inject content into the model's context window. Retrieval anomalies are often the first visible signal of an indirect prompt injection attack.

Retrieval relevance drift
Average similarity score between query embedding and retrieved document embeddings. A sustained drop indicates the vector index has been corrupted, poisoned documents have been inserted, or the embedding model has drifted.
Alert: mean relevance drops 15% below 7-day baseline
Document access frequency
Track which documents are retrieved most frequently. A sudden spike in retrieval of one document across many queries may indicate that document has been poisoned to be highly similar to many query patterns.
Alert: one document retrieved in 30%+ of queries
Cross-namespace access attempts
In a multi-tenant RAG deployment, monitor whether queries from one tenant retrieve documents from another tenant's namespace. Should never happen but is invisible without explicit namespace-level logging.
Alert: any cross-namespace retrieval event
Embedding query clustering
Group retrieval queries by semantic cluster over a rolling window. Natural retrieval shows organic diversity. Systematic crawling of the vector index produces unusually uniform coverage of the semantic space.
Alert: query coverage entropy below cohort baseline
New document insertion rate
Track when new documents are added to the vector index and by whom. Unexpected insertions from service accounts or through the retrieval API (rather than the ingestion pipeline) indicate possible poisoning.
Alert: insertion not from approved ingestion pipeline
Retrieval latency spikes
Unusual latency in vector search can indicate a denial-of-service attempt against the retrieval layer, or an unusually large query embedding that is probing the full index rather than a specific topic area.
Alert: p99 retrieval latency 5x above 24h baseline

VectaX-protected retrieval layers need different monitoring. When embeddings are encrypted using VectaX, the monitoring signals change. You can still monitor retrieval frequency, latency, and namespace access from audit logs. But you cannot monitor raw embedding content, which is the correct security posture: the VectaX audit trail provides the access-level signal while keeping the semantic content of queries and documents encrypted.

Section 06

Agent layer signals

Agentic AI deployments introduce a new category of monitoring signal: the actions an agent takes, not just the text it produces. An agent that calls a payments API, spawns child agents, reads from a database, and sends emails in one session has a much larger security footprint than a chatbot that produces text. The agent layer requires monitoring the actions, not just the outputs.

Tool call frequency
Number of tool calls per agent session, per session type, and per tool. An agent that calls the payments API 50 times in one session is anomalous even if each individual call is within scope.
Alert: tool calls per session 5x above historical median
Delegation chain depth
How many levels of parent-child agent delegation exist in a session. Deep delegation chains that were not anticipated by the policy design can indicate prompt injection redirecting the top-level agent to spawn unauthorized sub-agents.
Alert: delegation depth exceeds configured maximum
Blast radius per session
Aggregate count of distinct resources touched per agent session: number of distinct customer records, number of distinct API endpoints, number of distinct files. A high blast radius in a short time is a signal.
Alert: distinct resources 10x above session baseline
Cross-tenant access attempts
Agent requests for resources belonging to a different tenant than the one in the token. Should be blocked by AgentID gateway enforcement, but monitoring the attempt rate provides an early attack signal.
Alert: any cross-tenant attempt event
Failed authorization rate
Rate of requests rejected by the AgentID gateway per session. A spike in rejections from one agent instance indicates either a misconfigured policy or an agent that has been redirected by a prompt injection.
Alert: rejection rate above 5% for any session
Token reuse after intended task
Detect token usage after the task that should have consumed it is complete. Short-lived tokens mitigate this risk, but monitoring for late reuse catches token theft or session persistence attacks.
Alert: any token use after task completion event

Section 07

Model layer and drift as a security event

Model drift is normally treated as a performance issue. In a security context, it is also a security event. A model that was safe at deployment may not be safe after a fine-tuning update, after a new jailbreak technique becomes public, or after a backdoor was inserted during an upstream supply chain compromise.

The distinction matters: a model whose refusal rate on a standard probe set drops from 92 percent to 74 percent over two weeks has drifted by 18 percentage points. That is detectable before the model causes harm if you are running the probes. NIST's evaluation found that a named frontier model responded to 94 percent of malicious requests under common jailbreaking techniques. The model had drifted from its designed safety posture.

Model drift monitoring requires a held-out evaluation set that does not change between measurements. If your evaluation set changes between measurements, you cannot distinguish model drift from evaluation drift.

Safety refusal rate on probe set
Run a fixed set of adversarial probes against the deployed model on a schedule (daily or after each update). Track the refusal rate. A sustained drop is a security signal requiring investigation before the next deployment.
Alert: drop of more than 10 percentage points from baseline
Capability regression
Track performance on a held-out evaluation set across deployments. Unexpected capability drops alongside safety metric drops suggest the model update compromised both safety and utility, which is a supply chain concern.
Alert: more than 5% accuracy drop on any evaluation domain
Structured output conformance
For models deployed with structured output requirements (JSON schema, specific format), track what fraction of outputs conform. Drift in format conformance often indicates a model update has changed the output distribution.
Alert: conformance drops below 95% for 1-hour window
Response length distribution shift
Track the distribution of response lengths over time. A sustained shift to shorter or longer responses often indicates a model update, prompt change, or that a jailbreak is triggering different output paths.
Alert: mean length shifts 25% from 30-day rolling baseline
Post-update comparison scan
Run the full DiscoveR adversarial test suite immediately after every model update, including fine-tuning. Compare pass rates on all attack categories against the previous deployment. Any regression is a blocker for the update.
Alert: any category regression after model update
Jailbreak success rate tracking
Track the rate at which known jailbreak techniques succeed against the deployed model over time. A technique that failed last month and succeeds this month indicates a new vulnerability introduced by an update.
Alert: any previously-failed technique now succeeding

Section 08

Distillation attack detection

Distillation attacks extract the reasoning capabilities of a frontier model through large-scale systematic querying. The attacker builds a training dataset of (prompt, reasoning, response) triples from your model, then trains a student model on this dataset. Done at scale, the student model approximates the frontier model's capabilities at a fraction of the development cost.

In February 2026, Anthropic documented over 16 million exchanges generated by three Chinese AI laboratories across roughly 24,000 fake accounts targeting its Claude models. One proxy network managed more than 20,000 simultaneous fraudulent accounts, mixing extraction traffic with legitimate queries to camouflage the operation. OpenAI and Google's Threat Intelligence Group made similar disclosures in the same period.

16M+
exchanges documented by Anthropic, February 2026
24,000
fake accounts used across one documented campaign
20,000+
simultaneous fraudulent accounts in one proxy network
3
named Chinese AI laboratories in Anthropic's February 2026 disclosure

What distillers actually steal: reasoning capability (chain-of-thought traces teach students how to decompose problems and verify intermediate steps), safety properties (distilled models inherit capability but shed safety alignment), and architectural insight (systematic extraction reveals how the model structures its reasoning across domains).

The monitoring challenge: detection approaches that focus on individual queries all fail. Any transformation that preserves a response as useful to a human also preserves its training signal for a student model. The defense must operate at the population level.

Detection approachWhat it catchesWhat it missesEffectiveness Per-query anomaly detection
Checking individual queries for signs of extraction Obvious probing patterns, known extraction templates Sophisticated queries designed to look organic. Misses the pattern entirely for insiders. Weak Rate limiting per account
Capping queries per API key Bulk harvesters using few accounts Distributed campaigns across many accounts. Insiders with legitimate volume allowances. Partial Account clustering analysis
Group accounts by query pattern similarity Coordinated campaigns where accounts show similar query patterns Single sophisticated attacker. Well-designed query diversity. Partial Population-level topic coverage
Track semantic coverage of the model's capability space Systematic coverage-oriented extraction campaigns Targeted domain extraction. Random sampling strategies. Strong Response similarity clustering
Cluster outputs that look like they come from the same query region Accounts that are systematically extracting the same output region Sophisticated diversified extraction. Does not catch novel query paths. Strong
📋 Mirror Blog · The Distillation Problem Has a New Answer: Make the Harvest Worthless

Section 09

Population-level vs individual signals

The most important insight in AI security monitoring: most attacks are invisible at the individual query level and visible only at the population level. A single distillation query, a single injection attempt that did not succeed, and a single account with slightly higher than normal query volume are all noise. The signal emerges across thousands of queries, across many accounts, over days or weeks.

Why individual inspection fails: what detection actually requires

✖ Individual query inspection
Query: "Explain the steps for solving a constraint satisfaction problem"
Response: Correct, detailed, helpful
Account: Registered user, API key active, no rate limit breach
Network: Normal HTTPS request, no anomalous headers
Classification: Benign. No detection possible at this level.
The same query appears in 1,800 different accounts over 3 days with slight rephrasing. Invisible from this view.
✓ Population-level analysis
Accounts 1-1800: All submit semantically similar constraint-solving queries
Topic coverage: These accounts collectively cover 94% of the model's documented reasoning domains in 3 days
Account age: 78% of accounts are less than 7 days old
Inter-account similarity: Cosine similarity of query embeddings: 0.87 (well above 0.85 threshold)
Classification: High-confidence coordinated extraction campaign. Alert triggered.

Building population-level monitoring requires aggregating signals over time windows (rolling 24-hour, 7-day, and 30-day windows for different signal types), across accounts (grouping by account age cohort, IP range, and query embedding cluster), and across the semantic space of the model (tracking which capability regions of the model have been queried and how uniformly).

This kind of monitoring infrastructure is not built into any standard API gateway. It requires custom telemetry pipeline design: query embeddings must be computed and stored (not the queries themselves), account-level aggregates must be maintained in real-time, and alert thresholds must be calibrated against the genuine user population baseline before anomalies become meaningful.

Section 10

Privacy-preserving logging

AI security monitoring requires logging, but AI systems process sensitive data that must not appear in logs. The solution is to log security signals rather than content: derived metadata that tells you what happened without telling you what was said.

This also makes the logs useful for security analysis. A log file full of raw query text is hard to analyse statistically. A log file full of structured fields (hash, embedding cluster, PII flag, hallucination score, injection detected) is directly queryable by a SIEM.

Recommended AI security log schema: what to log instead of query content
query_hash
"sha256:a3f8c..."
SHA-256 of normalized query. Detects repeated identical queries across accounts without storing query text.
query_embedding_cluster
"cluster_047"
Nearest semantic cluster ID, not the raw embedding. Enables topic distribution analysis without storing vectors.
query_length_bucket
"medium_256-512"
Bucketed length range, not exact token count. Detects length distribution anomalies without enabling content reconstruction.
pii_detected
true
Boolean: did AgentIQ detect PII in the output? Not the PII itself. Feeds the PII rate metric.
hallucination_score
0.23
Normalized 0-1 score. Feeds the hallucination rate distribution. Not the output text.
injection_detected
false
Boolean with optional injection_type field. Feeds injection detection rate. Not the injected content.
refusal
false
Boolean: did the model refuse this request? Feeds refusal rate metric. Not the refusal text.
latency_ms
312
Exact latency for this request. Aggregated into p50/p95/p99 in the monitoring pipeline.
token_id_hash
"sha256:b91d..."
Hash of the AgentID token, not the token itself. Links to delegation chain context without exposing credentials.
retrieval_doc_ids
["doc_0442", "doc_1107"]
Document IDs retrieved, not their content. Feeds document access frequency monitoring.

Never log raw query or output text in AI security logs. AI system logs have broad access in most organisations (engineers, security teams, SREs). Raw query text in logs can itself become a data breach if an employee accesses logs for debugging and the queries contain PII. The security signal you need is in the derived fields, not the raw text.

Section 11

Key metrics and thresholds

A minimal viable AI security monitoring dashboard covers at least one metric per layer with an alert threshold. The thresholds below are starting points. Calibrate against your actual user population baseline before activating alerts: a threshold that is correct for a consumer chatbot will generate constant false positives for a developer API.

Input layer
Query rate (per account, per hour)
Warn: 5x cohort medianAlert: 20x
Inter-account query similarity
Warn: 0.80 cosineAlert: 0.85
Injection pattern match rate
Alert: any high-confidence match
Inference layer
PII rate in outputs (% of responses)
Warn: 0.2%Alert: 0.5%
Safety refusal rate (7-day deviation)
Warn: 8% dropAlert: 15% drop
Injection detection rate (15-min window)
Warn: 0.5%Alert: 1%
Retrieval layer
Retrieval relevance score (7-day baseline)
Warn: -10%Alert: -15%
Cross-namespace access attempts
Alert: any event
Single document retrieval share
Warn: 20%Alert: 30%
Agent and model layers
Tool calls per session
Warn: 3x medianAlert: 5x
Safety refusal rate (probe set)
Warn: -5% from baselineAlert: -10%
Delegation chain depth
Alert: exceeds configured max

Threshold calibration takes time. Set initial thresholds conservatively (high sensitivity, expect false positives) for the first 30 days. Use the false positive rate to tune thresholds toward the actual user population baseline. An alert threshold calibrated against a developer API's baseline will be completely wrong for a consumer chatbot with a different query volume and distribution.

Section 12

MITRE ATLAS mapping

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) provides a framework for classifying AI-specific attack techniques. Each technique has a monitoring signal that can be instrumented. Mapping your monitoring coverage to ATLAS tells you which attack techniques you can detect and which you cannot.

AML.T0024
Exfiltration via ML Inference API
Attacker queries the model repeatedly to extract information about training data (membership inference) or to reconstruct the model's capabilities (distillation). Covered in D1 and this module.
Monitor: query volume per account, inter-account semantic similarity, topic coverage entropy, response clustering
AML.T0005
Backdoor ML Model
Attacker inserts a trigger into the model during training or fine-tuning such that inputs containing the trigger produce attacker-chosen outputs. Covered in D4 (federated learning poisoning context).
Monitor: safety refusal rate on fixed probe set after every update, DiscoveR post-update scan, jailbreak success rate tracking
AML.T0040
ML Supply Chain Compromise
Attacker compromises the model update pipeline, a third-party model component, or training data to insert malicious behaviour into a deployed model.
Monitor: post-update comparison scan with DiscoveR, capability regression on held-out eval, unauthorized model file modification events
AML.T0048
LLM Prompt Injection
Attacker embeds instructions in user input or retrieved content to override the model's intended behaviour, redirect agent actions, or exfiltrate context window contents.
Monitor: AgentIQ injection detection rate per output, delegation chain depth anomalies, agent blast radius spikes, refusal rate changes
AML.T0053
LLM Jailbreak
Attacker constructs inputs designed to bypass the model's safety guardrails and produce harmful or unauthorized outputs, often through elaborate role-playing scenarios or instruction override techniques.
Monitor: safety refusal rate drops, DiscoveR jailbreak category pass rates, output toxicity score spikes, chain-of-thought reasoning alignment checks

Section 13

AgentIQ on the monitoring layer

AgentIQ runs inline at the inference layer, classifying every model output before it reaches the user or triggers a downstream agent action. Each classification produces a structured event record that feeds directly into the security monitoring pipeline described in this module.

The per-output events from AgentIQ are the foundation of inference-layer monitoring. Without them, monitoring the inference layer requires either logging raw outputs (which creates a privacy problem) or building a separate output scanning pipeline (which adds latency and infrastructure). AgentIQ produces the inference-layer monitoring signal as a side effect of its inline enforcement role.

In aggregate, AgentIQ events answer the questions that the inference layer metrics require. What fraction of outputs contained PII in the last hour? Has the hallucination score distribution shifted this week? Is there an active injection campaign: how many outputs have been flagged as injection-affected in the last 15 minutes? These are all derived from AgentIQ's per-output classification stream.

AgentIQ's chain security validation is specifically relevant for agentic monitoring. In multi-step workflows, it checks whether each step in the agent's reasoning chain is consistent with the delegated task and the AgentID token scope. A chain that attempts to justify an out-of-scope action is flagged before the action reaches the Resource Gateway, providing defence in depth: the chain security check catches the problem at the reasoning layer, and the gateway enforces it at the action layer.

Section 14

DiscoveR for model drift monitoring

DiscoveR provides the model-layer monitoring function described in Section 07. It runs structured adversarial tests against your deployed model on a schedule and after model updates, comparing results against the previous scan to detect drift.

The core monitoring workflow: run a DiscoveR baseline scan against the model before deployment. Store the per-category pass rates as the baseline. Run the same scan after every model update and on a weekly schedule. Compare new pass rates against the baseline. Any category where the pass rate has dropped is a potential security regression that blocks the update or triggers an investigation.

The correlation_id feature links scans across remediation cycles. If a DiscoveR scan finds a jailbreak vulnerability and the engineering team deploys a fix, the next scan with the same correlation_id compares only the tests that failed in the previous scan. This confirms that the specific vulnerabilities were addressed and not just that overall pass rates stayed constant while new vulnerabilities appeared.

For continuous monitoring between updates, DiscoveR can be run on a schedule against production endpoints. This catches two things that post-update scanning misses: vulnerabilities introduced by prompt changes (not model changes) and drift that accumulates gradually rather than appearing suddenly after an update.

Section 15

Frequently asked questions

How does AI security monitoring differ from traditional security monitoring?

Traditional security monitoring watches network bytes, access events, and file system changes. The threat is in the packet structure. AI security monitoring must also watch the semantic content of queries and outputs, because the attacker's payload is in the language. A prompt injection and a benign query look identical at the network layer. The attacker is often authenticated: distillers use valid API keys, jailbreak attempts come from paying users. Model drift is a security event: a model that was safe last month may not be safe today if its refusal rates have dropped. None of these signals are visible to traditional security tools.

What are the five monitoring layers in an AI stack?

Input layer: query volume, semantic similarity clustering, injection pattern detection, account behavioral fingerprinting. Inference layer: PII rate in outputs, hallucination score, safety refusal rate, injection detection, chain-of-thought integrity. Retrieval layer: retrieval relevance drift, document access frequency, cross-namespace access attempts, embedding query clustering. Agent layer: tool call frequency, delegation chain depth, blast radius per session, failed authorization rate. Model layer: safety refusal rate on probe sets, capability regression, structured output conformance, jailbreak success rate tracking. Most organisations monitor only the first two layers, leaving the other three as blind spots.

How do you detect distillation attacks through monitoring?

Distillation attacks cannot be detected by inspecting individual queries. A distiller's query is indistinguishable from a legitimate researcher's query at the individual level. Detection requires population-level statistics: accounts that show above-baseline semantic similarity to each other (coordinated extraction), accounts whose query topic distribution is unusually uniform (systematic coverage), query rates abnormally high normalized to account age, response content clustering across multiple accounts. Anthropic's February 2026 disclosure documented 16 million exchanges across 24,000 fake accounts. Individual queries looked legitimate. The population pattern did not.

How should AI systems log security events without capturing sensitive content?

Log security signals rather than content. Log query hashes (SHA-256 of normalized query text) not query text: detects repeated identical queries without storing private content. Log output classification labels (PII detected: true, hallucination score: 0.23) not output text. Log nearest semantic cluster IDs not raw query embeddings. Log bucketed length ranges not exact lengths. Use structured log schema so all fields are directly queryable by SIEM tools without text parsing. Never log raw query or output text: AI system logs have broad access in most organisations and raw query text in logs can itself become a data breach.

What is model drift and why is it a security event?

Model drift is a change in a deployed model's behaviour over time. It is a security event because it can indicate the model has been poisoned through a compromised update pipeline, adversarially fine-tuned, or that a new jailbreak technique is reliably bypassing its safety guardrails. A model whose refusal rate on a standard probe set drops from 92% to 74% has drifted by 18 percentage points. This is detectable before harm occurs if you run scheduled probes. Track safety refusal rate on a fixed probe set, jailbreak success rates, and run DiscoveR after every model update to catch regression before it reaches production.

How does AgentIQ contribute to AI security monitoring?

AgentIQ runs inline at every model output and produces per-request structured classification events: PII detected in output, hallucination score, prompt injection detected with injection type, toxicity score, and chain security status for agentic workflows. These per-request events feed aggregate monitoring: rising PII rate triggers an alert, sustained hallucination score increase indicates model drift, spike in injection detection indicates an active attack campaign. AgentIQ produces the inference-layer monitoring signal as a side effect of its inline enforcement role, without requiring a separate output scanning pipeline.

Next: Module E3 of 5

AI Incident Response

Playbooks for the most common AI security incidents: prompt injection attack, model compromise, distillation campaign, and agent breach. Forensics, containment, and what the remediation cycle looks like with DiscoveR and AgentIQ in the loop.