E2: Security Monitoring and Anomaly Detection Across AI StacksAI security monitoring differs from traditional monitoring: signal lives in semantics not bytes, attacker may be authenticated, model drift is a security event. Five layers to monitor: input layer (query volume, similarity clustering, behavioral fingerprinting, injection patterns), inference layer (PII rate in output, hallucination rate, refusal rate, chain-of-thought integrity), retrieval layer (embedding query patterns, document access frequency, retrieval relevance drift), agent layer (tool call frequency, delegation chain depth, cross-tenant access attempts, action blast radius), model layer (performance drift, safety refusal rate changes, structured output format deviation). Distillation monitoring: Mirror Security blog documented Anthropic's February 2026 disclosure of 16 million exchanges across 24000 fake accounts. Detection requires population statistics not individual query inspection: account query similarity clustering, topic distribution uniformity, query rate normalized to account age, response content clustering. Individual query detection fails because distiller and researcher queries look identical. Input-level defenses fail against insiders with legitimate access. Privacy-preserving logging: log query hashes not query content, log output classification labels not output text, log timing distributions not individual latencies, structured log schema for SIEM. Key metrics by layer with alert thresholds. MITRE ATLAS: AML.T0024 exfiltration via inference API, AML.T0040 ML supply chain compromise, AML.T0005 backdoor ML model. AgentIQ provides inline per-output signals: PII detection, hallucination score, injection detection, toxicity, chain security status. DiscoveR provides scheduled adversarial testing for model drift and new attack surface detection. Model drift monitoring: track refusal rate on standard probe set, output format conformance, capability regression on held-out evaluation set. A model whose refusal rate drops from 92 percent to 74 percent has drifted and may have been backdoored or adversarially fine-tuned.PT40MIntermediatetrueen2026-04-07Mirror Academy
Module E2 of 5 · Track 3E: Security Operations for AI
The threat is in the language, not the packet.
Security Monitoring and Anomaly Detection
Traditional security tools watch bytes and access events. AI security monitoring must also watch what the model sees and says. This module covers the five monitoring layers of an AI stack, why distillation attacks require population-level statistics to detect, how to log security signals without logging sensitive content, and how AgentIQ and DiscoveR instrument the monitoring layer.
Traditional security monitoring watches bytes, packets, access control events, and file system changes. These signals are structural: a port scan looks different from normal traffic at the network layer. A privilege escalation leaves an audit trail in the operating system. The threat is visible in the metadata.
AI security threats are different in three fundamental ways that require a different monitoring approach.
The signal is in the semantics, not the bytes. A malicious prompt injection and a benign customer support query arrive through the same authenticated API channel with the same headers and the same byte count. The threat is in the meaning of the text, not the structure of the packet. No traditional network security tool can see this difference.
The attacker is often authenticated. A distillation campaign operates through valid API keys. A jailbreak attempt comes from a paying user. An insider running extraction queries has legitimate access. Access control events show nothing unusual because the attacker is authorized to make requests. The attack is in what they request and what they do with the response.
Model drift is a security event. A model whose behaviour has shifted may have been poisoned through a compromised update pipeline, fine-tuned on adversarial data, or jailbroken by a technique that the deployment team has not detected. A 15-percentage-point drop in safety refusal rates over two weeks is not a performance metric. It is a potential security incident.
File system changes: creation, modification, deletion
Process execution and privilege escalation
API call volume, rate limits, error rates
Misses: semantic content of queries and outputs
Misses: authenticated attacker behaviour patterns
Misses: model behaviour drift over time
AI-aware security monitoring also watches
Query semantic similarity clustering across accounts
Output PII rate, hallucination rate, refusal rate
Prompt injection detection signals per output
Retrieval pattern anomalies in vector database queries
Agent tool call patterns, delegation chain depth
Model safety refusal rate on standard probe sets
Population-level query distribution across accounts
Chain-of-thought trace integrity for agentic workflows
Section 02
The five monitoring layers
A production AI system is not one thing. It has a query intake layer, an inference layer, a retrieval layer (if it uses RAG), an agent layer (if it uses autonomous workflows), and a model layer. Each layer has distinct security signals that are invisible to the others. Missing any layer leaves a monitoring blind spot.
Five-layer AI monitoring stack with MITRE ATLAS coverage
1
Input layer · Query intake
Query volume, semantic similarity, account behavioral fingerprint, injection pattern signatures, rate per account per period
AML.T0024
2
Inference layer · Model output
PII rate in outputs, hallucination score distribution, refusal rate, chain-of-thought trace completeness, output format conformance
Safety refusal rate on probe sets, capability regression on held-out eval, structured output format deviation, response length distribution shift
AML.T0005
Most organisations monitor only layers 1 and 2. Rate limiting and output scanning are the default monitoring posture for deployed LLMs. Layers 3, 4, and 5 are almost universally absent. The retrieval layer is where VectaX-protected RAG systems need specific monitoring. The agent layer is where AgentID token and delegation chain events need to feed into the security dashboard. The model layer is where DiscoveR's scheduled probes detect drift.
Section 03
Input layer signals
The input layer sits at the front of the AI system: every query passes through it before reaching the model. It is the first place where anomalies can be detected, but also the layer where detection is hardest, because malicious queries are semantically crafted to look benign.
Query rate per account
Total queries per account per hour and per day, normalised to account age. A new account submitting thousands of queries in the first hour is anomalous even if each query looks benign.
Alert: 5x baseline for account age cohort
Semantic similarity clustering
Group queries across accounts by embedding similarity. Distinct accounts whose queries cluster together in semantic space may be coordinated. A natural user population has high semantic diversity.
Alert: inter-account similarity above 0.85 cosine
Topic distribution uniformity
Measure the entropy of query topics per account. A legitimate user asks about many topics. A systematic extractor has unusually uniform topic distribution, covering one domain comprehensively.
Alert: topic entropy below 25th percentile for cohort
Prompt injection pattern match
Match queries against known injection signatures: role-play overrides, instruction separator injections, indirect injection via retrieved content. Pattern matching has high false positive risk without semantic context.
Alert: match on high-confidence injection signature
Query length distribution
Monitor the distribution of query lengths per account. Systematic extraction campaigns often show unusual length patterns: very long queries that embed entire contexts, or unusually short probes at high volume.
Baseline: log normal distribution per app type
Repeated identical queries
Track SHA-256 hashes of normalised queries. High rate of identical or near-identical queries from one account or across accounts indicates programmatic extraction rather than natural use.
Alert: same hash appearing 50+ times across accounts
Section 04
Inference layer signals
The inference layer is where the model produces output. Security signals at this layer are about what comes out, not what went in. AgentIQ runs inline at this layer, classifying every output before it reaches the user and feeding those classifications into the monitoring pipeline.
PII rate in outputs
Fraction of responses containing detected PII (names, emails, phone numbers, account IDs, medical identifiers). A rising PII rate indicates the model is leaking data it should not include in outputs.
Alert: PII rate above 0.5% of responses per hour
Hallucination score
Per-response score measuring factual inconsistency relative to retrieved context. A sustained increase over a rolling window indicates model drift or that the retrieval pipeline is returning degraded content.
Alert: p75 score rises 20% above 30-day baseline
Safety refusal rate
Fraction of responses that trigger a safety refusal. Both rises and drops are signals: a drop may indicate jailbreak success; a spike may indicate an active injection campaign probing safety boundaries.
Alert: 15% deviation from 7-day rolling mean
Prompt injection detection rate
Per-response classification: was this output affected by a detected injection attempt? Rising detection rate indicates an active injection campaign. Sudden drop after a rise may indicate the attacker found a bypass.
Alert: detection rate above 1% for 15-minute window
Chain-of-thought integrity
For agentic deployments: does the visible reasoning chain align with the stated task and the delegated permissions? A chain that justifies actions outside the token scope indicates a compromised workflow.
Alert: any scope deviation in chain reasoning
Response latency distribution
Sudden latency spikes can indicate increased reasoning effort (jailbreak attempts), resource exhaustion attacks, or a compromised inference pipeline. Track p50, p95, p99 per model and per API endpoint.
Alert: p95 latency 3x above 24-hour baseline
Mirror Security · AgentIQ
AgentIQ instruments the inference layer inline
AgentIQ classifies every model output before it reaches the user: PII detection, hallucination score, prompt injection detection, toxicity score, and chain security status. These per-response signals feed directly into your monitoring dashboard. No separate scraping of model outputs needed.
In a RAG deployment, the retrieval layer is where user queries reach the vector database. This layer has a specific attack surface: an attacker who can observe or manipulate retrieved documents can inject content into the model's context window. Retrieval anomalies are often the first visible signal of an indirect prompt injection attack.
Retrieval relevance drift
Average similarity score between query embedding and retrieved document embeddings. A sustained drop indicates the vector index has been corrupted, poisoned documents have been inserted, or the embedding model has drifted.
Alert: mean relevance drops 15% below 7-day baseline
Document access frequency
Track which documents are retrieved most frequently. A sudden spike in retrieval of one document across many queries may indicate that document has been poisoned to be highly similar to many query patterns.
Alert: one document retrieved in 30%+ of queries
Cross-namespace access attempts
In a multi-tenant RAG deployment, monitor whether queries from one tenant retrieve documents from another tenant's namespace. Should never happen but is invisible without explicit namespace-level logging.
Alert: any cross-namespace retrieval event
Embedding query clustering
Group retrieval queries by semantic cluster over a rolling window. Natural retrieval shows organic diversity. Systematic crawling of the vector index produces unusually uniform coverage of the semantic space.
Track when new documents are added to the vector index and by whom. Unexpected insertions from service accounts or through the retrieval API (rather than the ingestion pipeline) indicate possible poisoning.
Alert: insertion not from approved ingestion pipeline
Retrieval latency spikes
Unusual latency in vector search can indicate a denial-of-service attempt against the retrieval layer, or an unusually large query embedding that is probing the full index rather than a specific topic area.
VectaX-protected retrieval layers need different monitoring. When embeddings are encrypted using VectaX, the monitoring signals change. You can still monitor retrieval frequency, latency, and namespace access from audit logs. But you cannot monitor raw embedding content, which is the correct security posture: the VectaX audit trail provides the access-level signal while keeping the semantic content of queries and documents encrypted.
Section 06
Agent layer signals
Agentic AI deployments introduce a new category of monitoring signal: the actions an agent takes, not just the text it produces. An agent that calls a payments API, spawns child agents, reads from a database, and sends emails in one session has a much larger security footprint than a chatbot that produces text. The agent layer requires monitoring the actions, not just the outputs.
Tool call frequency
Number of tool calls per agent session, per session type, and per tool. An agent that calls the payments API 50 times in one session is anomalous even if each individual call is within scope.
Alert: tool calls per session 5x above historical median
Delegation chain depth
How many levels of parent-child agent delegation exist in a session. Deep delegation chains that were not anticipated by the policy design can indicate prompt injection redirecting the top-level agent to spawn unauthorized sub-agents.
Alert: delegation depth exceeds configured maximum
Blast radius per session
Aggregate count of distinct resources touched per agent session: number of distinct customer records, number of distinct API endpoints, number of distinct files. A high blast radius in a short time is a signal.
Agent requests for resources belonging to a different tenant than the one in the token. Should be blocked by AgentID gateway enforcement, but monitoring the attempt rate provides an early attack signal.
Alert: any cross-tenant attempt event
Failed authorization rate
Rate of requests rejected by the AgentID gateway per session. A spike in rejections from one agent instance indicates either a misconfigured policy or an agent that has been redirected by a prompt injection.
Alert: rejection rate above 5% for any session
Token reuse after intended task
Detect token usage after the task that should have consumed it is complete. Short-lived tokens mitigate this risk, but monitoring for late reuse catches token theft or session persistence attacks.
Alert: any token use after task completion event
Section 07
Model layer and drift as a security event
Model drift is normally treated as a performance issue. In a security context, it is also a security event. A model that was safe at deployment may not be safe after a fine-tuning update, after a new jailbreak technique becomes public, or after a backdoor was inserted during an upstream supply chain compromise.
The distinction matters: a model whose refusal rate on a standard probe set drops from 92 percent to 74 percent over two weeks has drifted by 18 percentage points. That is detectable before the model causes harm if you are running the probes. NIST's evaluation found that a named frontier model responded to 94 percent of malicious requests under common jailbreaking techniques. The model had drifted from its designed safety posture.
Model drift monitoring requires a held-out evaluation set that does not change between measurements. If your evaluation set changes between measurements, you cannot distinguish model drift from evaluation drift.
Safety refusal rate on probe set
Run a fixed set of adversarial probes against the deployed model on a schedule (daily or after each update). Track the refusal rate. A sustained drop is a security signal requiring investigation before the next deployment.
Alert: drop of more than 10 percentage points from baseline
Capability regression
Track performance on a held-out evaluation set across deployments. Unexpected capability drops alongside safety metric drops suggest the model update compromised both safety and utility, which is a supply chain concern.
Alert: more than 5% accuracy drop on any evaluation domain
Structured output conformance
For models deployed with structured output requirements (JSON schema, specific format), track what fraction of outputs conform. Drift in format conformance often indicates a model update has changed the output distribution.
Alert: conformance drops below 95% for 1-hour window
Response length distribution shift
Track the distribution of response lengths over time. A sustained shift to shorter or longer responses often indicates a model update, prompt change, or that a jailbreak is triggering different output paths.
Alert: mean length shifts 25% from 30-day rolling baseline
Post-update comparison scan
Run the full DiscoveR adversarial test suite immediately after every model update, including fine-tuning. Compare pass rates on all attack categories against the previous deployment. Any regression is a blocker for the update.
Alert: any category regression after model update
Jailbreak success rate tracking
Track the rate at which known jailbreak techniques succeed against the deployed model over time. A technique that failed last month and succeeds this month indicates a new vulnerability introduced by an update.
Alert: any previously-failed technique now succeeding
Section 08
Distillation attack detection
Distillation attacks extract the reasoning capabilities of a frontier model through large-scale systematic querying. The attacker builds a training dataset of (prompt, reasoning, response) triples from your model, then trains a student model on this dataset. Done at scale, the student model approximates the frontier model's capabilities at a fraction of the development cost.
In February 2026, Anthropic documented over 16 million exchanges generated by three Chinese AI laboratories across roughly 24,000 fake accounts targeting its Claude models. One proxy network managed more than 20,000 simultaneous fraudulent accounts, mixing extraction traffic with legitimate queries to camouflage the operation. OpenAI and Google's Threat Intelligence Group made similar disclosures in the same period.
16M+
exchanges documented by Anthropic, February 2026
24,000
fake accounts used across one documented campaign
20,000+
simultaneous fraudulent accounts in one proxy network
3
named Chinese AI laboratories in Anthropic's February 2026 disclosure
What distillers actually steal: reasoning capability (chain-of-thought traces teach students how to decompose problems and verify intermediate steps), safety properties (distilled models inherit capability but shed safety alignment), and architectural insight (systematic extraction reveals how the model structures its reasoning across domains).
The monitoring challenge: detection approaches that focus on individual queries all fail. Any transformation that preserves a response as useful to a human also preserves its training signal for a student model. The defense must operate at the population level.
Detection approach
What it catches
What it misses
Effectiveness
Per-query anomaly detection Checking individual queries for signs of extraction
Obvious probing patterns, known extraction templates
Sophisticated queries designed to look organic. Misses the pattern entirely for insiders.
Weak
Rate limiting per account Capping queries per API key
Bulk harvesters using few accounts
Distributed campaigns across many accounts. Insiders with legitimate volume allowances.
Partial
Account clustering analysis Group accounts by query pattern similarity
Coordinated campaigns where accounts show similar query patterns
Single sophisticated attacker. Well-designed query diversity.
Partial
Population-level topic coverage Track semantic coverage of the model's capability space
Systematic coverage-oriented extraction campaigns
Targeted domain extraction. Random sampling strategies.
Strong
Response similarity clustering Cluster outputs that look like they come from the same query region
Accounts that are systematically extracting the same output region
Sophisticated diversified extraction. Does not catch novel query paths.
The most important insight in AI security monitoring: most attacks are invisible at the individual query level and visible only at the population level. A single distillation query, a single injection attempt that did not succeed, and a single account with slightly higher than normal query volume are all noise. The signal emerges across thousands of queries, across many accounts, over days or weeks.
Why individual inspection fails: what detection actually requires
✖ Individual query inspection
Query: "Explain the steps for solving a constraint satisfaction problem"
Response: Correct, detailed, helpful
Account: Registered user, API key active, no rate limit breach
Network: Normal HTTPS request, no anomalous headers
Classification: Benign. No detection possible at this level.
The same query appears in 1,800 different accounts over 3 days with slight rephrasing. Invisible from this view.
✓ Population-level analysis
Accounts 1-1800: All submit semantically similar constraint-solving queries
Topic coverage: These accounts collectively cover 94% of the model's documented reasoning domains in 3 days
Account age: 78% of accounts are less than 7 days old
Building population-level monitoring requires aggregating signals over time windows (rolling 24-hour, 7-day, and 30-day windows for different signal types), across accounts (grouping by account age cohort, IP range, and query embedding cluster), and across the semantic space of the model (tracking which capability regions of the model have been queried and how uniformly).
This kind of monitoring infrastructure is not built into any standard API gateway. It requires custom telemetry pipeline design: query embeddings must be computed and stored (not the queries themselves), account-level aggregates must be maintained in real-time, and alert thresholds must be calibrated against the genuine user population baseline before anomalies become meaningful.
Section 10
Privacy-preserving logging
AI security monitoring requires logging, but AI systems process sensitive data that must not appear in logs. The solution is to log security signals rather than content: derived metadata that tells you what happened without telling you what was said.
This also makes the logs useful for security analysis. A log file full of raw query text is hard to analyse statistically. A log file full of structured fields (hash, embedding cluster, PII flag, hallucination score, injection detected) is directly queryable by a SIEM.
Recommended AI security log schema: what to log instead of query content
query_hash
"sha256:a3f8c..."
SHA-256 of normalized query. Detects repeated identical queries across accounts without storing query text.
query_embedding_cluster
"cluster_047"
Nearest semantic cluster ID, not the raw embedding. Enables topic distribution analysis without storing vectors.
query_length_bucket
"medium_256-512"
Bucketed length range, not exact token count. Detects length distribution anomalies without enabling content reconstruction.
pii_detected
true
Boolean: did AgentIQ detect PII in the output? Not the PII itself. Feeds the PII rate metric.
hallucination_score
0.23
Normalized 0-1 score. Feeds the hallucination rate distribution. Not the output text.
injection_detected
false
Boolean with optional injection_type field. Feeds injection detection rate. Not the injected content.
refusal
false
Boolean: did the model refuse this request? Feeds refusal rate metric. Not the refusal text.
latency_ms
312
Exact latency for this request. Aggregated into p50/p95/p99 in the monitoring pipeline.
token_id_hash
"sha256:b91d..."
Hash of the AgentID token, not the token itself. Links to delegation chain context without exposing credentials.
retrieval_doc_ids
["doc_0442", "doc_1107"]
Document IDs retrieved, not their content. Feeds document access frequency monitoring.
Never log raw query or output text in AI security logs. AI system logs have broad access in most organisations (engineers, security teams, SREs). Raw query text in logs can itself become a data breach if an employee accesses logs for debugging and the queries contain PII. The security signal you need is in the derived fields, not the raw text.
Section 11
Key metrics and thresholds
A minimal viable AI security monitoring dashboard covers at least one metric per layer with an alert threshold. The thresholds below are starting points. Calibrate against your actual user population baseline before activating alerts: a threshold that is correct for a consumer chatbot will generate constant false positives for a developer API.
Input layer
Query rate (per account, per hour)
Warn: 5x cohort medianAlert: 20x
Inter-account query similarity
Warn: 0.80 cosineAlert: 0.85
Injection pattern match rate
Alert: any high-confidence match
Inference layer
PII rate in outputs (% of responses)
Warn: 0.2%Alert: 0.5%
Safety refusal rate (7-day deviation)
Warn: 8% dropAlert: 15% drop
Injection detection rate (15-min window)
Warn: 0.5%Alert: 1%
Retrieval layer
Retrieval relevance score (7-day baseline)
Warn: -10%Alert: -15%
Cross-namespace access attempts
Alert: any event
Single document retrieval share
Warn: 20%Alert: 30%
Agent and model layers
Tool calls per session
Warn: 3x medianAlert: 5x
Safety refusal rate (probe set)
Warn: -5% from baselineAlert: -10%
Delegation chain depth
Alert: exceeds configured max
Threshold calibration takes time. Set initial thresholds conservatively (high sensitivity, expect false positives) for the first 30 days. Use the false positive rate to tune thresholds toward the actual user population baseline. An alert threshold calibrated against a developer API's baseline will be completely wrong for a consumer chatbot with a different query volume and distribution.
Section 12
MITRE ATLAS mapping
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) provides a framework for classifying AI-specific attack techniques. Each technique has a monitoring signal that can be instrumented. Mapping your monitoring coverage to ATLAS tells you which attack techniques you can detect and which you cannot.
AML.T0024
Exfiltration via ML Inference API
Attacker queries the model repeatedly to extract information about training data (membership inference) or to reconstruct the model's capabilities (distillation). Covered in D1 and this module.
Attacker inserts a trigger into the model during training or fine-tuning such that inputs containing the trigger produce attacker-chosen outputs. Covered in D4 (federated learning poisoning context).
Monitor: safety refusal rate on fixed probe set after every update, DiscoveR post-update scan, jailbreak success rate tracking
AML.T0040
ML Supply Chain Compromise
Attacker compromises the model update pipeline, a third-party model component, or training data to insert malicious behaviour into a deployed model.
Monitor: post-update comparison scan with DiscoveR, capability regression on held-out eval, unauthorized model file modification events
AML.T0048
LLM Prompt Injection
Attacker embeds instructions in user input or retrieved content to override the model's intended behaviour, redirect agent actions, or exfiltrate context window contents.
Attacker constructs inputs designed to bypass the model's safety guardrails and produce harmful or unauthorized outputs, often through elaborate role-playing scenarios or instruction override techniques.
AgentIQ runs inline at the inference layer, classifying every model output before it reaches the user or triggers a downstream agent action. Each classification produces a structured event record that feeds directly into the security monitoring pipeline described in this module.
The per-output events from AgentIQ are the foundation of inference-layer monitoring. Without them, monitoring the inference layer requires either logging raw outputs (which creates a privacy problem) or building a separate output scanning pipeline (which adds latency and infrastructure). AgentIQ produces the inference-layer monitoring signal as a side effect of its inline enforcement role.
In aggregate, AgentIQ events answer the questions that the inference layer metrics require. What fraction of outputs contained PII in the last hour? Has the hallucination score distribution shifted this week? Is there an active injection campaign: how many outputs have been flagged as injection-affected in the last 15 minutes? These are all derived from AgentIQ's per-output classification stream.
AgentIQ's chain security validation is specifically relevant for agentic monitoring. In multi-step workflows, it checks whether each step in the agent's reasoning chain is consistent with the delegated task and the AgentID token scope. A chain that attempts to justify an out-of-scope action is flagged before the action reaches the Resource Gateway, providing defence in depth: the chain security check catches the problem at the reasoning layer, and the gateway enforces it at the action layer.
AgentIQ produces per-output structured events: PII detected, hallucination score, injection detected with type, toxicity score, chain security status, refusal classification. Feed these directly into your SIEM for inference-layer monitoring without separate output scanning infrastructure.
DiscoveR provides the model-layer monitoring function described in Section 07. It runs structured adversarial tests against your deployed model on a schedule and after model updates, comparing results against the previous scan to detect drift.
The core monitoring workflow: run a DiscoveR baseline scan against the model before deployment. Store the per-category pass rates as the baseline. Run the same scan after every model update and on a weekly schedule. Compare new pass rates against the baseline. Any category where the pass rate has dropped is a potential security regression that blocks the update or triggers an investigation.
The correlation_id feature links scans across remediation cycles. If a DiscoveR scan finds a jailbreak vulnerability and the engineering team deploys a fix, the next scan with the same correlation_id compares only the tests that failed in the previous scan. This confirms that the specific vulnerabilities were addressed and not just that overall pass rates stayed constant while new vulnerabilities appeared.
For continuous monitoring between updates, DiscoveR can be run on a schedule against production endpoints. This catches two things that post-update scanning misses: vulnerabilities introduced by prompt changes (not model changes) and drift that accumulates gradually rather than appearing suddenly after an update.
Mirror Security · DiscoveR
Model drift detection through continuous adversarial testing
Run a DiscoveR scan as your model-layer monitoring baseline. Schedule weekly scans and post-update scans. The per-category pass rate comparison between scans shows exactly which attack categories have regressed, not just whether overall performance changed. Drift detected before it causes harm.
How does AI security monitoring differ from traditional security monitoring?
Traditional security monitoring watches network bytes, access events, and file system changes. The threat is in the packet structure. AI security monitoring must also watch the semantic content of queries and outputs, because the attacker's payload is in the language. A prompt injection and a benign query look identical at the network layer. The attacker is often authenticated: distillers use valid API keys, jailbreak attempts come from paying users. Model drift is a security event: a model that was safe last month may not be safe today if its refusal rates have dropped. None of these signals are visible to traditional security tools.
What are the five monitoring layers in an AI stack?
Input layer: query volume, semantic similarity clustering, injection pattern detection, account behavioral fingerprinting. Inference layer: PII rate in outputs, hallucination score, safety refusal rate, injection detection, chain-of-thought integrity. Retrieval layer: retrieval relevance drift, document access frequency, cross-namespace access attempts, embedding query clustering. Agent layer: tool call frequency, delegation chain depth, blast radius per session, failed authorization rate. Model layer: safety refusal rate on probe sets, capability regression, structured output conformance, jailbreak success rate tracking. Most organisations monitor only the first two layers, leaving the other three as blind spots.
How do you detect distillation attacks through monitoring?
Distillation attacks cannot be detected by inspecting individual queries. A distiller's query is indistinguishable from a legitimate researcher's query at the individual level. Detection requires population-level statistics: accounts that show above-baseline semantic similarity to each other (coordinated extraction), accounts whose query topic distribution is unusually uniform (systematic coverage), query rates abnormally high normalized to account age, response content clustering across multiple accounts. Anthropic's February 2026 disclosure documented 16 million exchanges across 24,000 fake accounts. Individual queries looked legitimate. The population pattern did not.
How should AI systems log security events without capturing sensitive content?
Log security signals rather than content. Log query hashes (SHA-256 of normalized query text) not query text: detects repeated identical queries without storing private content. Log output classification labels (PII detected: true, hallucination score: 0.23) not output text. Log nearest semantic cluster IDs not raw query embeddings. Log bucketed length ranges not exact lengths. Use structured log schema so all fields are directly queryable by SIEM tools without text parsing. Never log raw query or output text: AI system logs have broad access in most organisations and raw query text in logs can itself become a data breach.
What is model drift and why is it a security event?
Model drift is a change in a deployed model's behaviour over time. It is a security event because it can indicate the model has been poisoned through a compromised update pipeline, adversarially fine-tuned, or that a new jailbreak technique is reliably bypassing its safety guardrails. A model whose refusal rate on a standard probe set drops from 92% to 74% has drifted by 18 percentage points. This is detectable before harm occurs if you run scheduled probes. Track safety refusal rate on a fixed probe set, jailbreak success rates, and run DiscoveR after every model update to catch regression before it reaches production.
How does AgentIQ contribute to AI security monitoring?
AgentIQ runs inline at every model output and produces per-request structured classification events: PII detected in output, hallucination score, prompt injection detected with injection type, toxicity score, and chain security status for agentic workflows. These per-request events feed aggregate monitoring: rising PII rate triggers an alert, sustained hallucination score increase indicates model drift, spike in injection detection indicates an active attack campaign. AgentIQ produces the inference-layer monitoring signal as a side effect of its inline enforcement role, without requiring a separate output scanning pipeline.
Mirror Security · Full platform
AgentIQ instruments the inference layer. DiscoveR monitors the model layer.
AgentIQ produces per-output security events inline: PII, hallucination, injection, toxicity, chain security. DiscoveR validates model behaviour through scheduled adversarial tests. Together they instrument the two hardest monitoring layers in the AI stack. VectaX audit logs cover the retrieval layer. AgentID audit logs cover the agent and identity layers.