What are the five layers of an AI stack that need monitoring?

The five layers are: input layer (query volume, semantic clustering, account behavior, injection patterns), inference layer (output PII rate, hallucination rate, refusal rate, chain-of-thought trace integrity), retrieval layer (embedding query patterns, document access frequency, retrieval relevance drift), agent layer (tool call frequency, delegation chain depth, cross-tenant access attempts, action blast radius), and model layer (performance metric drift, safety refusal rate changes, structured output format deviation). Traditional security tools monitor only the input layer. AI-aware security operations monitor all five.

Security Monitoring and Anomaly Detection Across AI Stacks | Track 3E

Q: How does AI security monitoring differ from traditional security monitoring?

Traditional security monitoring focuses on network bytes, packet headers, access control events, and file system changes. AI security monitoring must also watch the semantic content of queries and outputs, because the attacker's payload is in the language, not the packet. A malicious prompt injection and a benign customer support query arrive through the same authenticated API channel with the same headers and the same byte count. The threat is semantic, not structural. Additionally, the attacker may be an authenticated user (an insider running distillation queries, or a jailbreak attempt from a paying customer), so access control events alone cannot catch the attack.

Q: How should AI systems log security events without capturing sensitive data?

Privacy-preserving AI logging captures signals without capturing content. Log query hashes (SHA-256 of the normalized query) rather than query text: this detects repeated identical queries without storing private content. Log output classification labels (PII detected: true/false, hallucination score: 0.3, injection detected: false) rather than output text. Log timing distributions (p50, p95, p99 latency) rather than individual request timestamps, which can be used to correlate with specific user behaviour. Use structured log schema with machine-readable fields so the signals feed directly into SIEM rules without requiring text parsing.

Q: How does AgentIQ contribute to AI security monitoring?

AgentIQ runs inline at every model output and provides per-request classification signals: PII detected in output (true/false with confidence), hallucination score, prompt injection detected (true/false with injection type), toxicity score, and chain security status for multi-step agent workflows. These per-request signals feed into aggregate monitoring: a rising PII rate in outputs over a rolling window triggers an alert; a hallucination score increase sustained over time indicates model drift; a spike in injection detection indicates an active attack campaign. AgentIQ provides the continuous per-output signal layer that populates the inference monitoring dashboard.

Section 01

Why AI monitoring differs

Traditional security monitoring watches bytes, packets, access control events, and file system changes. These signals are structural: a port scan looks different from normal traffic at the network layer. A privilege escalation leaves an audit trail in the operating system. The threat is visible in the metadata.

AI security threats are different in three fundamental ways that require a different monitoring approach.

The signal is in the semantics, not the bytes. A malicious prompt injection and a benign customer support query arrive through the same authenticated API channel with the same headers and the same byte count. The threat is in the meaning of the text, not the structure of the packet. No traditional network security tool can see this difference.

The attacker is often authenticated. A distillation campaign operates through valid API keys. A jailbreak attempt comes from a paying user. An insider running extraction queries has legitimate access. Access control events show nothing unusual because the attacker is authorized to make requests. The attack is in what they request and what they do with the response.

Model drift is a security event. A model whose behaviour has shifted may have been poisoned through a compromised update pipeline, fine-tuned on adversarial data, or jailbroken by a technique that the deployment team has not detected. A 15-percentage-point drop in safety refusal rates over two weeks is not a performance metric. It is a potential security incident.

Traditional security monitoring watches

Network bytes, packet headers, port traffic

Authentication events: logins, token issuance, failures

File system changes: creation, modification, deletion

Process execution and privilege escalation

API call volume, rate limits, error rates

Misses: semantic content of queries and outputs

Misses: authenticated attacker behaviour patterns

Misses: model behaviour drift over time

AI-aware security monitoring also watches

Query semantic similarity clustering across accounts

Output PII rate, hallucination rate, refusal rate

Prompt injection detection signals per output

Retrieval pattern anomalies in vector database queries

Agent tool call patterns, delegation chain depth

Model safety refusal rate on standard probe sets

Population-level query distribution across accounts

Chain-of-thought trace integrity for agentic workflows

Section 02

The five monitoring layers

A production AI system is not one thing. It has a query intake layer, an inference layer, a retrieval layer (if it uses RAG), an agent layer (if it uses autonomous workflows), and a model layer. Each layer has distinct security signals that are invisible to the others. Missing any layer leaves a monitoring blind spot.

Five-layer AI monitoring stack with MITRE ATLAS coverage

1

Input layer · Query intake

Query volume, semantic similarity, account behavioral fingerprint, injection pattern signatures, rate per account per period

AML.T0024

2

Inference layer · Model output

PII rate in outputs, hallucination score distribution, refusal rate, chain-of-thought trace completeness, output format conformance

AML.T0048

3

Retrieval layer · Vector database

Embedding query clustering, document access frequency, retrieval relevance score drift, cross-namespace access attempts, query timing anomalies

AML.T0024

4

Agent layer · Autonomous actions

Tool call frequency, delegation chain depth, blast radius per agent session, cross-tenant access attempts, spawn rate of child agents, failed authorization rate

AML.T0053

5

Model layer · Behaviour over time

Safety refusal rate on probe sets, capability regression on held-out eval, structured output format deviation, response length distribution shift

AML.T0005

Most organisations monitor only layers 1 and 2. Rate limiting and output scanning are the default monitoring posture for deployed LLMs. Layers 3, 4, and 5 are almost universally absent. The retrieval layer is where VectaX-protected RAG systems need specific monitoring. The agent layer is where AgentID token and delegation chain events need to feed into the security dashboard. The model layer is where DiscoveR's scheduled probes detect drift.

Section 03

Input layer signals

The input layer sits at the front of the AI system: every query passes through it before reaching the model. It is the first place where anomalies can be detected, but also the layer where detection is hardest, because malicious queries are semantically crafted to look benign.

Query rate per account

Total queries per account per hour and per day, normalised to account age. A new account submitting thousands of queries in the first hour is anomalous even if each query looks benign.

Alert: 5x baseline for account age cohort

Semantic similarity clustering

Group queries across accounts by embedding similarity. Distinct accounts whose queries cluster together in semantic space may be coordinated. A natural user population has high semantic diversity.

Alert: inter-account similarity above 0.85 cosine

Topic distribution uniformity

Measure the entropy of query topics per account. A legitimate user asks about many topics. A systematic extractor has unusually uniform topic distribution, covering one domain comprehensively.

Alert: topic entropy below 25th percentile for cohort

Prompt injection pattern match

Match queries against known injection signatures: role-play overrides, instruction separator injections, indirect injection via retrieved content. Pattern matching has high false positive risk without semantic context.

Alert: match on high-confidence injection signature

Query length distribution

Monitor the distribution of query lengths per account. Systematic extraction campaigns often show unusual length patterns: very long queries that embed entire contexts, or unusually short probes at high volume.

Baseline: log normal distribution per app type

Repeated identical queries

Track SHA-256 hashes of normalised queries. High rate of identical or near-identical queries from one account or across accounts indicates programmatic extraction rather than natural use.

Alert: same hash appearing 50+ times across accounts

Section 04

Inference layer signals

The inference layer is where the model produces output. Security signals at this layer are about what comes out, not what went in. AgentIQ runs inline at this layer, classifying every output before it reaches the user and feeding those classifications into the monitoring pipeline.

PII rate in outputs

Fraction of responses containing detected PII (names, emails, phone numbers, account IDs, medical identifiers). A rising PII rate indicates the model is leaking data it should not include in outputs.

Alert: PII rate above 0.5% of responses per hour

Hallucination score

Per-response score measuring factual inconsistency relative to retrieved context. A sustained increase over a rolling window indicates model drift or that the retrieval pipeline is returning degraded content.

Alert: p75 score rises 20% above 30-day baseline

Safety refusal rate

Fraction of responses that trigger a safety refusal. Both rises and drops are signals: a drop may indicate jailbreak success; a spike may indicate an active injection campaign probing safety boundaries.

Alert: 15% deviation from 7-day rolling mean

Prompt injection detection rate

Per-response classification: was this output affected by a detected injection attempt? Rising detection rate indicates an active injection campaign. Sudden drop after a rise may indicate the attacker found a bypass.

Alert: detection rate above 1% for 15-minute window

Chain-of-thought integrity

For agentic deployments: does the visible reasoning chain align with the stated task and the delegated permissions? A chain that justifies actions outside the token scope indicates a compromised workflow.

Alert: any scope deviation in chain reasoning

Response latency distribution

Sudden latency spikes can indicate increased reasoning effort (jailbreak attempts), resource exhaustion attacks, or a compromised inference pipeline. Track p50, p95, p99 per model and per API endpoint.

Alert: p95 latency 3x above 24-hour baseline

Section 05

Retrieval layer signals

In a RAG deployment, the retrieval layer is where user queries reach the vector database. This layer has a specific attack surface: an attacker who can observe or manipulate retrieved documents can inject content into the model's context window. Retrieval anomalies are often the first visible signal of an indirect prompt injection attack.

Retrieval relevance drift

Average similarity score between query embedding and retrieved document embeddings. A sustained drop indicates the vector index has been corrupted, poisoned documents have been inserted, or the embedding model has drifted.

Alert: mean relevance drops 15% below 7-day baseline

Document access frequency

Track which documents are retrieved most frequently. A sudden spike in retrieval of one document across many queries may indicate that document has been poisoned to be highly similar to many query patterns.

Alert: one document retrieved in 30%+ of queries

Cross-namespace access attempts

In a multi-tenant RAG deployment, monitor whether queries from one tenant retrieve documents from another tenant's namespace. Should never happen but is invisible without explicit namespace-level logging.

Alert: any cross-namespace retrieval event

Embedding query clustering

Group retrieval queries by semantic cluster over a rolling window. Natural retrieval shows organic diversity. Systematic crawling of the vector index produces unusually uniform coverage of the semantic space.

Alert: query coverage entropy below cohort baseline

New document insertion rate

Track when new documents are added to the vector index and by whom. Unexpected insertions from service accounts or through the retrieval API (rather than the ingestion pipeline) indicate possible poisoning.

Alert: insertion not from approved ingestion pipeline

Retrieval latency spikes

Unusual latency in vector search can indicate a denial-of-service attempt against the retrieval layer, or an unusually large query embedding that is probing the full index rather than a specific topic area.

Alert: p99 retrieval latency 5x above 24h baseline

VectaX-protected retrieval layers need different monitoring. When embeddings are encrypted using VectaX, the monitoring signals change. You can still monitor retrieval frequency, latency, and namespace access from audit logs. But you cannot monitor raw embedding content, which is the correct security posture: the VectaX audit trail provides the access-level signal while keeping the semantic content of queries and documents encrypted.

Section 06

Agent layer signals

Agentic AI deployments introduce a new category of monitoring signal: the actions an agent takes, not just the text it produces. An agent that calls a payments API, spawns child agents, reads from a database, and sends emails in one session has a much larger security footprint than a chatbot that produces text. The agent layer requires monitoring the actions, not just the outputs.

Tool call frequency

Number of tool calls per agent session, per session type, and per tool. An agent that calls the payments API 50 times in one session is anomalous even if each individual call is within scope.

Alert: tool calls per session 5x above historical median

Delegation chain depth

How many levels of parent-child agent delegation exist in a session. Deep delegation chains that were not anticipated by the policy design can indicate prompt injection redirecting the top-level agent to spawn unauthorized sub-agents.

Alert: delegation depth exceeds configured maximum

Blast radius per session

Aggregate count of distinct resources touched per agent session: number of distinct customer records, number of distinct API endpoints, number of distinct files. A high blast radius in a short time is a signal.

Alert: distinct resources 10x above session baseline

Cross-tenant access attempts

Agent requests for resources belonging to a different tenant than the one in the token. Should be blocked by AgentID gateway enforcement, but monitoring the attempt rate provides an early attack signal.

Alert: any cross-tenant attempt event

Failed authorization rate

Rate of requests rejected by the AgentID gateway per session. A spike in rejections from one agent instance indicates either a misconfigured policy or an agent that has been redirected by a prompt injection.

Alert: rejection rate above 5% for any session

Token reuse after intended task

Detect token usage after the task that should have consumed it is complete. Short-lived tokens mitigate this risk, but monitoring for late reuse catches token theft or session persistence attacks.

Alert: any token use after task completion event

Section 07

Model layer and drift as a security event

Model drift is normally treated as a performance issue. In a security context, it is also a security event. A model that was safe at deployment may not be safe after a fine-tuning update, after a new jailbreak technique becomes public, or after a backdoor was inserted during an upstream supply chain compromise.

The distinction matters: a model whose refusal rate on a standard probe set drops from 92 percent to 74 percent over two weeks has drifted by 18 percentage points. That is detectable before the model causes harm if you are running the probes. NIST's evaluation found that a named frontier model responded to 94 percent of malicious requests under common jailbreaking techniques. The model had drifted from its designed safety posture.

Model drift monitoring requires a held-out evaluation set that does not change between measurements. If your evaluation set changes between measurements, you cannot distinguish model drift from evaluation drift.

Safety refusal rate on probe set

Run a fixed set of adversarial probes against the deployed model on a schedule (daily or after each update). Track the refusal rate. A sustained drop is a security signal requiring investigation before the next deployment.

Alert: drop of more than 10 percentage points from baseline

Capability regression

Track performance on a held-out evaluation set across deployments. Unexpected capability drops alongside safety metric drops suggest the model update compromised both safety and utility, which is a supply chain concern.

Alert: more than 5% accuracy drop on any evaluation domain

Structured output conformance

For models deployed with structured output requirements (JSON schema, specific format), track what fraction of outputs conform. Drift in format conformance often indicates a model update has changed the output distribution.

Alert: conformance drops below 95% for 1-hour window

Response length distribution shift

Track the distribution of response lengths over time. A sustained shift to shorter or longer responses often indicates a model update, prompt change, or that a jailbreak is triggering different output paths.

Alert: mean length shifts 25% from 30-day rolling baseline

Post-update comparison scan

Run the full DiscoveR adversarial test suite immediately after every model update, including fine-tuning. Compare pass rates on all attack categories against the previous deployment. Any regression is a blocker for the update.

Alert: any category regression after model update

Jailbreak success rate tracking

Track the rate at which known jailbreak techniques succeed against the deployed model over time. A technique that failed last month and succeeds this month indicates a new vulnerability introduced by an update.

Alert: any previously-failed technique now succeeding

Section 08

Distillation attack detection

Distillation attacks extract the reasoning capabilities of a frontier model through large-scale systematic querying. The attacker builds a training dataset of (prompt, reasoning, response) triples from your model, then trains a student model on this dataset. Done at scale, the student model approximates the frontier model's capabilities at a fraction of the development cost.

In February 2026, Anthropic documented over 16 million exchanges generated by three Chinese AI laboratories across roughly 24,000 fake accounts targeting its Claude models. One proxy network managed more than 20,000 simultaneous fraudulent accounts, mixing extraction traffic with legitimate queries to camouflage the operation. OpenAI and Google's Threat Intelligence Group made similar disclosures in the same period.

16M+

exchanges documented by Anthropic, February 2026

24,000

fake accounts used across one documented campaign

20,000+

simultaneous fraudulent accounts in one proxy network

3

named Chinese AI laboratories in Anthropic's February 2026 disclosure

What distillers actually steal: reasoning capability (chain-of-thought traces teach students how to decompose problems and verify intermediate steps), safety properties (distilled models inherit capability but shed safety alignment), and architectural insight (systematic extraction reveals how the model structures its reasoning across domains).

The monitoring challenge: detection approaches that focus on individual queries all fail. Any transformation that preserves a response as useful to a human also preserves its training signal for a student model. The defense must operate at the population level.

Detection approachWhat it catchesWhat it missesEffectiveness Per-query anomaly detection
Checking individual queries for signs of extraction Obvious probing patterns, known extraction templates Sophisticated queries designed to look organic. Misses the pattern entirely for insiders. Weak Rate limiting per account
Capping queries per API key Bulk harvesters using few accounts Distributed campaigns across many accounts. Insiders with legitimate volume allowances. Partial Account clustering analysis
Group accounts by query pattern similarity Coordinated campaigns where accounts show similar query patterns Single sophisticated attacker. Well-designed query diversity. Partial Population-level topic coverage
Track semantic coverage of the model's capability space Systematic coverage-oriented extraction campaigns Targeted domain extraction. Random sampling strategies. Strong Response similarity clustering
Cluster outputs that look like they come from the same query region Accounts that are systematically extracting the same output region Sophisticated diversified extraction. Does not catch novel query paths. Strong

📋 Mirror Blog · The Distillation Problem Has a New Answer: Make the Harvest Worthless

Section 09

Population-level vs individual signals

The most important insight in AI security monitoring: most attacks are invisible at the individual query level and visible only at the population level. A single distillation query, a single injection attempt that did not succeed, and a single account with slightly higher than normal query volume are all noise. The signal emerges across thousands of queries, across many accounts, over days or weeks.

Why individual inspection fails: what detection actually requires

✖ Individual query inspection

Query: "Explain the steps for solving a constraint satisfaction problem"

Response: Correct, detailed, helpful

Account: Registered user, API key active, no rate limit breach

Network: Normal HTTPS request, no anomalous headers

Classification: Benign. No detection possible at this level.

The same query appears in 1,800 different accounts over 3 days with slight rephrasing. Invisible from this view.

✓ Population-level analysis

Accounts 1-1800: All submit semantically similar constraint-solving queries

Topic coverage: These accounts collectively cover 94% of the model's documented reasoning domains in 3 days

Account age: 78% of accounts are less than 7 days old

Inter-account similarity: Cosine similarity of query embeddings: 0.87 (well above 0.85 threshold)

Classification: High-confidence coordinated extraction campaign. Alert triggered.

Building population-level monitoring requires aggregating signals over time windows (rolling 24-hour, 7-day, and 30-day windows for different signal types), across accounts (grouping by account age cohort, IP range, and query embedding cluster), and across the semantic space of the model (tracking which capability regions of the model have been queried and how uniformly).

This kind of monitoring infrastructure is not built into any standard API gateway. It requires custom telemetry pipeline design: query embeddings must be computed and stored (not the queries themselves), account-level aggregates must be maintained in real-time, and alert thresholds must be calibrated against the genuine user population baseline before anomalies become meaningful.

Section 10

Privacy-preserving logging

AI security monitoring requires logging, but AI systems process sensitive data that must not appear in logs. The solution is to log security signals rather than content: derived metadata that tells you what happened without telling you what was said.

This also makes the logs useful for security analysis. A log file full of raw query text is hard to analyse statistically. A log file full of structured fields (hash, embedding cluster, PII flag, hallucination score, injection detected) is directly queryable by a SIEM.

Recommended AI security log schema: what to log instead of query content

query_hash

"sha256:a3f8c..."

SHA-256 of normalized query. Detects repeated identical queries across accounts without storing query text.

query_embedding_cluster

"cluster_047"

Nearest semantic cluster ID, not the raw embedding. Enables topic distribution analysis without storing vectors.

query_length_bucket

"medium_256-512"

Bucketed length range, not exact token count. Detects length distribution anomalies without enabling content reconstruction.

pii_detected

true

Boolean: did AgentIQ detect PII in the output? Not the PII itself. Feeds the PII rate metric.

hallucination_score

0.23

Normalized 0-1 score. Feeds the hallucination rate distribution. Not the output text.

injection_detected

false

Boolean with optional injection_type field. Feeds injection detection rate. Not the injected content.

refusal

false

Boolean: did the model refuse this request? Feeds refusal rate metric. Not the refusal text.

latency_ms

312

Exact latency for this request. Aggregated into p50/p95/p99 in the monitoring pipeline.

token_id_hash

"sha256:b91d..."

Hash of the AgentID token, not the token itself. Links to delegation chain context without exposing credentials.

retrieval_doc_ids

["doc_0442", "doc_1107"]

Document IDs retrieved, not their content. Feeds document access frequency monitoring.

Never log raw query or output text in AI security logs. AI system logs have broad access in most organisations (engineers, security teams, SREs). Raw query text in logs can itself become a data breach if an employee accesses logs for debugging and the queries contain PII. The security signal you need is in the derived fields, not the raw text.

Section 11

Key metrics and thresholds

A minimal viable AI security monitoring dashboard covers at least one metric per layer with an alert threshold. The thresholds below are starting points. Calibrate against your actual user population baseline before activating alerts: a threshold that is correct for a consumer chatbot will generate constant false positives for a developer API.

Input layer

Query rate (per account, per hour)

Warn: 5x cohort medianAlert: 20x

Inter-account query similarity

Warn: 0.80 cosineAlert: 0.85

Injection pattern match rate

Alert: any high-confidence match

Inference layer

PII rate in outputs (% of responses)

Warn: 0.2%Alert: 0.5%

Safety refusal rate (7-day deviation)

Warn: 8% dropAlert: 15% drop

Injection detection rate (15-min window)

Warn: 0.5%Alert: 1%

Retrieval layer

Retrieval relevance score (7-day baseline)

Warn: -10%Alert: -15%

Cross-namespace access attempts

Alert: any event

Single document retrieval share

Warn: 20%Alert: 30%

Agent and model layers

Tool calls per session

Warn: 3x medianAlert: 5x

Safety refusal rate (probe set)

Warn: -5% from baselineAlert: -10%

Delegation chain depth

Alert: exceeds configured max

Threshold calibration takes time. Set initial thresholds conservatively (high sensitivity, expect false positives) for the first 30 days. Use the false positive rate to tune thresholds toward the actual user population baseline. An alert threshold calibrated against a developer API's baseline will be completely wrong for a consumer chatbot with a different query volume and distribution.

Section 12

MITRE ATLAS mapping

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) provides a framework for classifying AI-specific attack techniques. Each technique has a monitoring signal that can be instrumented. Mapping your monitoring coverage to ATLAS tells you which attack techniques you can detect and which you cannot.

AML.T0024

Exfiltration via ML Inference API

Attacker queries the model repeatedly to extract information about training data (membership inference) or to reconstruct the model's capabilities (distillation). Covered in D1 and this module.

Monitor: query volume per account, inter-account semantic similarity, topic coverage entropy, response clustering

AML.T0005

Backdoor ML Model

Attacker inserts a trigger into the model during training or fine-tuning such that inputs containing the trigger produce attacker-chosen outputs. Covered in D4 (federated learning poisoning context).

Monitor: safety refusal rate on fixed probe set after every update, DiscoveR post-update scan, jailbreak success rate tracking

AML.T0040

ML Supply Chain Compromise

Attacker compromises the model update pipeline, a third-party model component, or training data to insert malicious behaviour into a deployed model.

Monitor: post-update comparison scan with DiscoveR, capability regression on held-out eval, unauthorized model file modification events

AML.T0048

LLM Prompt Injection

Attacker embeds instructions in user input or retrieved content to override the model's intended behaviour, redirect agent actions, or exfiltrate context window contents.

Monitor: AgentIQ injection detection rate per output, delegation chain depth anomalies, agent blast radius spikes, refusal rate changes

AML.T0053

LLM Jailbreak

Attacker constructs inputs designed to bypass the model's safety guardrails and produce harmful or unauthorized outputs, often through elaborate role-playing scenarios or instruction override techniques.

Monitor: safety refusal rate drops, DiscoveR jailbreak category pass rates, output toxicity score spikes, chain-of-thought reasoning alignment checks

Section 13

AgentIQ on the monitoring layer

AgentIQ runs inline at the inference layer, classifying every model output before it reaches the user or triggers a downstream agent action. Each classification produces a structured event record that feeds directly into the security monitoring pipeline described in this module.

The per-output events from AgentIQ are the foundation of inference-layer monitoring. Without them, monitoring the inference layer requires either logging raw outputs (which creates a privacy problem) or building a separate output scanning pipeline (which adds latency and infrastructure). AgentIQ produces the inference-layer monitoring signal as a side effect of its inline enforcement role.

In aggregate, AgentIQ events answer the questions that the inference layer metrics require. What fraction of outputs contained PII in the last hour? Has the hallucination score distribution shifted this week? Is there an active injection campaign: how many outputs have been flagged as injection-affected in the last 15 minutes? These are all derived from AgentIQ's per-output classification stream.

AgentIQ's chain security validation is specifically relevant for agentic monitoring. In multi-step workflows, it checks whether each step in the agent's reasoning chain is consistent with the delegated task and the AgentID token scope. A chain that attempts to justify an out-of-scope action is flagged before the action reaches the Resource Gateway, providing defence in depth: the chain security check catches the problem at the reasoning layer, and the gateway enforces it at the action layer.

Section 14

DiscoveR for model drift monitoring

DiscoveR provides the model-layer monitoring function described in Section 07. It runs structured adversarial tests against your deployed model on a schedule and after model updates, comparing results against the previous scan to detect drift.

The core monitoring workflow: run a DiscoveR baseline scan against the model before deployment. Store the per-category pass rates as the baseline. Run the same scan after every model update and on a weekly schedule. Compare new pass rates against the baseline. Any category where the pass rate has dropped is a potential security regression that blocks the update or triggers an investigation.

The correlation_id feature links scans across remediation cycles. If a DiscoveR scan finds a jailbreak vulnerability and the engineering team deploys a fix, the next scan with the same correlation_id compares only the tests that failed in the previous scan. This confirms that the specific vulnerabilities were addressed and not just that overall pass rates stayed constant while new vulnerabilities appeared.

For continuous monitoring between updates, DiscoveR can be run on a schedule against production endpoints. This catches two things that post-update scanning misses: vulnerabilities introduced by prompt changes (not model changes) and drift that accumulates gradually rather than appearing suddenly after an update.

Section 15

Frequently asked questions

How does AI security monitoring differ from traditional security monitoring?

Traditional security monitoring watches network bytes, access events, and file system changes. The threat is in the packet structure. AI security monitoring must also watch the semantic content of queries and outputs, because the attacker's payload is in the language. A prompt injection and a benign query look identical at the network layer. The attacker is often authenticated: distillers use valid API keys, jailbreak attempts come from paying users. Model drift is a security event: a model that was safe last month may not be safe today if its refusal rates have dropped. None of these signals are visible to traditional security tools.

What are the five monitoring layers in an AI stack?

Input layer: query volume, semantic similarity clustering, injection pattern detection, account behavioral fingerprinting. Inference layer: PII rate in outputs, hallucination score, safety refusal rate, injection detection, chain-of-thought integrity. Retrieval layer: retrieval relevance drift, document access frequency, cross-namespace access attempts, embedding query clustering. Agent layer: tool call frequency, delegation chain depth, blast radius per session, failed authorization rate. Model layer: safety refusal rate on probe sets, capability regression, structured output conformance, jailbreak success rate tracking. Most organisations monitor only the first two layers, leaving the other three as blind spots.

How do you detect distillation attacks through monitoring?

Distillation attacks cannot be detected by inspecting individual queries. A distiller's query is indistinguishable from a legitimate researcher's query at the individual level. Detection requires population-level statistics: accounts that show above-baseline semantic similarity to each other (coordinated extraction), accounts whose query topic distribution is unusually uniform (systematic coverage), query rates abnormally high normalized to account age, response content clustering across multiple accounts. Anthropic's February 2026 disclosure documented 16 million exchanges across 24,000 fake accounts. Individual queries looked legitimate. The population pattern did not.

How should AI systems log security events without capturing sensitive content?

Log security signals rather than content. Log query hashes (SHA-256 of normalized query text) not query text: detects repeated identical queries without storing private content. Log output classification labels (PII detected: true, hallucination score: 0.23) not output text. Log nearest semantic cluster IDs not raw query embeddings. Log bucketed length ranges not exact lengths. Use structured log schema so all fields are directly queryable by SIEM tools without text parsing. Never log raw query or output text: AI system logs have broad access in most organisations and raw query text in logs can itself become a data breach.

What is model drift and why is it a security event?

Model drift is a change in a deployed model's behaviour over time. It is a security event because it can indicate the model has been poisoned through a compromised update pipeline, adversarially fine-tuned, or that a new jailbreak technique is reliably bypassing its safety guardrails. A model whose refusal rate on a standard probe set drops from 92% to 74% has drifted by 18 percentage points. This is detectable before harm occurs if you run scheduled probes. Track safety refusal rate on a fixed probe set, jailbreak success rates, and run DiscoveR after every model update to catch regression before it reaches production.

How does AgentIQ contribute to AI security monitoring?

AgentIQ runs inline at every model output and produces per-request structured classification events: PII detected in output, hallucination score, prompt injection detected with injection type, toxicity score, and chain security status for agentic workflows. These per-request events feed aggregate monitoring: rising PII rate triggers an alert, sustained hallucination score increase indicates model drift, spike in injection detection indicates an active attack campaign. AgentIQ produces the inference-layer monitoring signal as a side effect of its inline enforcement role, without requiring a separate output scanning pipeline.

Security Monitoring and Anomaly Detection

Why AI monitoring differs

The five monitoring layers

Input layer signals

Inference layer signals

AgentIQ instruments the inference layer inline

Retrieval layer signals

Agent layer signals

Model layer and drift as a security event

Distillation attack detection

Population-level vs individual signals

Privacy-preserving logging

Key metrics and thresholds

MITRE ATLAS mapping

AgentIQ on the monitoring layer

Inference layer monitoring signals, generated inline

DiscoveR for model drift monitoring

Model drift detection through continuous adversarial testing

Frequently asked questions

AgentIQ instruments the inference layer. DiscoveR monitors the model layer.