Multi-Agent 安全架構 Deep Dive：Output Sandboxing / Data Flow Diagram / Human-in-the-Loop 實作

Multi-Agent 系統最危險嘅 security fallacy 唔係「模型會 hallucinate」，而係「我信任自己嘅 Agent」。現實係：你寫嘅 Agent A 會 call Agent B，Agent B 會 call External API，External API 嘅 response 未經 sanitise 就直接 feed 返畀 Agent A 做 execution —— 你以爲自己寫緊協作流程，其實係開緊一個任意程式碼執行嘅後門。多 agent 嘅不安全，九成都唔係 LLM 本身問題，而係 developer 冇意識到每個 agent 之間都係一條 hostile network boundary。

呢篇文章由三個角度切入：Output Sandboxing 點樣防止 malign output 擴散、Data Flow Diagram 點樣幫你喺設計階段見到 security hole、同埋 Human-in-the-Loop 唔係「叫人睇多次」咁簡單，而係要 design 一個有效嘅 attention 分配機制。

Output Sandboxing：唔好信任何 Agent 嘅 output

大部份人對 Output Sandboxing 嘅理解停留喺「validate JSON schema」或者「filter 關鍵字」。呢啲係必要但遠遠唔夠。真正嘅 Output Sandboxing 係要 assume 每一個 agent 嘅 output 都係 malicious，直到被證明安全。

我喺 production 用嘅 pattern 係三層 sandbox。第一層係 Structural Validation：output 必須符合預先定義嘅 schema，唔只係 type check，而係 semantic constraint。舉例，如果一個 agent 聲稱要寫 file，path 必須符合 whitelist regex，唔可以包含 ..，唔可以指向 /etc、/proc 或者任何 symlink 可以 resolve 到嘅敏感位置。第二層係 Semantic Sandbox：將 agent output 放入一個 stripped-down runtime 執行。例如要 execute code，唔好直接喺 host python 行，而係 spawn 一個無 network、無 persistent storage、有 memory limit 嘅 subprocess。我地用 Firecracker microVM 做呢層 isolation，每個 agent call 對應一個短生命週期嘅 VM，用完即毁。第三層係 Output Classification：用一個獨立嘅 classifier model（甚至可以係一個簡單嘅 heuristic engine）去判斷 output 有無異常 pattern——例如突然嘗試 import os、subprocess，或者 output 入面出現同 task 無關嘅 base64 encoded string 呢啲明顯 red flag。

呢三層 sandbox 嘅 latency overhead 大約係 50-150ms，對於大多數 agent workflow 嚟講完全可以接受。重點係呢個 sandbox 唔係 optional，而係每個 agent-to-agent call 嘅 mandatory middleware，同你寫 API gateway middleware 一樣自然。

實際落地中最大嘅挑戰唔係技術，而係 developer mindset。你嘅 agent framework 可能用咗 LangGraph、CrewAI 或者自己寫嘅 DAG scheduler，但底層一樣：只要有一個 agent 嘅 output 可以直接影響另一個 agent 嘅 execution path，你就需要 sandbox。冇例外。

Data Flow Diagram：可視化不信任邊界

話「我嘅 system 好安全」之前，先畫一張 Data Flow Diagram。唔係 UML 嗰種 formal diagram，而係一張簡單嘅方塊圖：每個 agent 係一個 box，每條 arrow 係 data flow，每個 external system 係一個 cylinder。然後 open your eyes。

我喺每次 design review 都會做呢件事，幾乎每次都見到相同嘅問題：unidirectional arrow 變咗 bidirectional trust。你畫咗 Agent A -> Agent B，但 Agent B 嘅 output 可以 call Agent A 嘅 function？咁呢條 arrow 實際上係雙向嘅，而你嘅 threat model 只考慮咗單向。另一個常見 pattern 係 data flow bypass：你話 user input 會經過 sanitisation layer，但 tracing 條 flow 發現某個 agent 可以直接 call database 而跳過咗 sanitisation。呢啲嘢喺 code review 好難見到，但畫出嚟一眼就睇到。

具體做法：用 Mermaid.js 或者 draw.io，將每個 agent 嘅 input/output port 標清楚。唔好寫「data」，要寫具體嘅 type：UserMessage、SQLQuery、FileWriteRequest、ExternalHTTPResponse。呢啲 type 標籤會逼你思考每個 boundary 傳緊乜嘢。

然後喺每條 arrow 上面標註三個屬性：Encryption（傳輸中加密？）、Validation（有無 schema check？）、Sandbox（有無行 output sandbox？）。如果你有超過一條 arrow 嘅 validation 或者 sandbox 係 empty，你就知道邊度要補。

我嘅經驗係，一張有齊所有 label 嘅 DFD 大概需要 30-45 分鐘去畫。但呢 45 分鐘可以慳返你呢個月嘅 security incident response time。唔好偷懶。

Human-in-the-Loop：Attention 係稀缺資源，唔好浪費

Human-in-the-Loop（HITL）係最多人誤解嘅 security pattern。大部份人嘅實作係：「當 agent 想做敏感操作嗰陣，send 一個 message 去 Slack 等人 approve。」然後你發現：第一日 team mate 會認真睇，第三日就開始 blind approve，一個星期後 HITL 完全係 rubber stamp，冇任何實際 security value。

呢個現象叫 alert fatigue，而佢嘅根源唔係人性軟弱，而係你嘅 HITL design 有問題。有效嘅 HITL 唔係叫人「睇多次」，而係設計一個令 reviewer 可以低認知負擔咁做出正確判斷嘅 interface。

我嘅做法係將 HITL request 分三級。Level 1——Log Only：低風險操作（例如 read-only query、已知 domain 嘅 API call），唔需要人 approve，但 log 低以備 audit。Level 2——Structured Approval：需要人 approve，但唔係畀一段自然語言 description，而係一個 pre-filled form 顯示所有 relevant context：邊個 agent 發起、target resource 係乜、具體 action 係咩、同埋類似 action 喺歷史有無 anomaly。Reviewer 嘅工作係 scan 呢個 form，唔係閱讀 agent 嘅 reasoning chain。Level 3——Constrained Execution：人唔直接 approve，而係俾幾個 predefined option 揀。例如 agent 想 delete 一個 database record，Level 3 嘅選項係「1) 唔刪 2) 軟刪除（soft delete）3) 只係做 audit log」。呢個 constraint 迫使用家喺安全範圍內做決定。

呢個三級制嘅核心 insight：人唔應該做 verification，人應該做 decision。Verification 係 machine 嘅工作——檢查 schema、檢查 path whitelist、檢查 anomaly score——呢啲全部可以自動化。人要參與嘅係涉及 business context、trade-off、同埋例外情況嘅 decision。

Production 數據：implementation 咗三級 HITL 之後，我哋嘅 false positive approval rate（即係人 approve 咗本來應該 reject 嘅 request）由 37% 降到 4%。唔係因為人變醒咗，而係因為 interface 逼佢哋做正確嘅嘢。

將三者串連：一個 Concrete Architecture

Output Sandboxing、Data Flow Diagram、Human-in-the-Loop 唔係三個獨立嘅 practice，而係同一套 security architecture 嘅三個層面。

我而家嘅 standard stack 係咁：開發初期先用 DFD 畫出 all data flows 同 trust boundaries，呢個階段會發現大概 60% 嘅潛在問題。然後喺每個 boundary 加上 output sandbox——structural validation 行喺 agent framework level，semantic sandbox 行喺 container/VM level，classification 行喺 sidecar pattern。最後，針對 high-risk flows（即係 DFD 入面標記咗 red 嘅 arrow），加上三級 HITL，確保每次跨 boundary 嘅高危操作都有人做 decision 而唔係 verification。

呢個 stack 行咗超過一年，處理過百萬級嘅 agent calls，真正嘅 security incident 係零。唔係因為我哋叻，而係因為我哋 assume 自己一定會犯錯，然後用 architecture 去 contain 呢啲錯誤。

對香港嘅創業者同開發者，我嘅建議係：唔好等出事先做 security。你嘅 Multi-Agent system 越大，修補嘅成本就越高。下個 sprint 開始之前，畫張 DFD，加個 output sandbox，重新 design 你嘅 HITL flow。唔使一個星期，慳嘅唔只係 security incident，仲有 developer productivity——因為你唔使再驚自己寫嘅 agent 會炸咗成個 system。