Agent 安全：被忽略的定時炸彈

幾年前大家仲笑緊 prompt injection 係「學術玩具」——有人喺 ChatGPT 個 input 加句「忽略之前嘅 instruction」，模型就 instant 變咗第二個角色，好好玩咁。但今日你嘅 AI Agent 唔係 chatbot。佢有 tool access、有 API key、有讀 database 嘅權限、甚至可以代表你落 order、簽 contract、推 code。你諗下：如果一個攻擊者可以透過一條被操控嘅 prompt 令你個 agent 自動 call DELETE FROM users，或者自動去你公司 GitHub 推一個包含 backdoor 嘅 PR，仲好笑唔好笑？

Woof100 最近出咗份 agentshield-woof100 嘅 security landscape report，Scout 嘅 ai-erdos 團隊亦發表咗實測數據。兩個報告指向同一個結論：Agent 安全唔係「未來問題」，而係今日已經爆緊嘅計時炸彈，只係好少人肯正視。

Autonomy 即係攻擊面

傳統軟件安全有個簡單原則：權限愈大，風險愈高。但 Agent 唔同——佢唔係俾你固定權限就收工嘅 program，而係一個會根據 LLM 嘅 output 決定下一步做咩嘅 autonomous loop。

你俾佢一個目標：「幫我檢視所有未讀 email 並回覆重要嘅」。Agent 第一步就係 call email API 去 fetch inbox。呢個 call 本身已經係攻擊面——如果其中一封 email 嘅內容包含 injected instruction，而 LLM 冇做足夠嘅 sanitization，個 agent 下一分鐘可能已經幫你 reply 緊一封叫佢「forward 所有聯絡人資料去呢個 webhook」嘅 email。

Woof100 嘅報告執咗一個 shocking 嘅數據：喺測試嘅 20 個主流 agent framework 裏面，超過一半喺出廠設定下對 indirect prompt injection 完全零防護。即係你只要 send 一句嘢俾個 agent 會讀到嘅 source（email、Slack message、公開網頁），就有機會 hijack 成個 agent 嘅行為。

Autonomy 嘅本質係「LLM 做 decision + tool execution」。當兩者之間冇任何 security boundary，成個 system 嘅安全就只係靠 LLM 嗰吓「意志力」——而我哋都知而家嘅 LLM 連「呢個係咪惡意 prompt」都分唔清。

唔係 prompt injection 咁簡單

好多人以為 agent 安全等於 prompt injection defense。錯。Prompt injection 只係冰山上面嗰忽。

Scout AI-Erdos 嘅研究將 agent attack surface 分成五層：

第一層係 prompt-level：direct/indirect injection、jailbreak、role-play hijack。呢層最多人研究，因為最「顯眼」。

第二層係 tool-level：Agent 有權限 call 咩 API？點樣 validate tool 嘅 input/output？如果一個 agent 有個工具係 execute_sql(query: str)，你點防止佢俾人 prompt 去 call DROP DATABASE？唔係靠 LLM 話「唔好咁做」——要喺 tool 層面做 allowlist 同 input sanitization。

第三層係 memory-level：Agent 嘅長期記憶會 store 過去對話同執行結果。如果攻擊者可以 poison 呢個 memory store（例如透過一次成功嘅 injection），咁將來每一次 agent 做 reasoning 嘅時候都會讀到 corrupted context。呢個係持久性攻擊——一次得手，長期有效。

第四層係 access-control level：你嘅 agent 係用邊個身份去執行操作？好多 implementation 就咁將 developer 嘅 API key 直接俾 agent 用，冇任何 scope restriction。Woof100 指出一個 common pattern：agent 用嘅 service account 權限遠超實際需要，因為「方便啲」。呢個係 classic least-privilege violation，只係 agent 版。

第五層係 orchestration-level：Multi-agent system 入面，agent 之間點樣 authenticate 對方？點防止一個 compromised agent 去影響隔籬 agent 嘅行為？呢個層面目前幾乎 zero research，但已經有人喺 production 行緊 multi-agent workflow。

點解香港嘅開發者要關注

你可能會諗：「我又唔係做 OpenAI 嘅 safety team，關我咩事？」

事實係，香港嘅 startup 同中小企用緊越來越多 agent 做 automation：用 n8n / LangChain / CrewAI 做 HR 篩選 CV、自動回覆客服、監察 server 健康、甚至自動做 crypto 交易。每一個呢啲 use case 都牽涉 sensitive operation——而絕大部分嘅 implementation 都係「run 到就算」，security posture 係零。

加上香港嘅 regulatory landscape 正喺度收緊緊。PDPO 嘅修訂、將來嘅 AI 監管框架，好大機會會要求你對 AI system 做 risk assessment。如果你嘅 agent 爆咗洩漏 customer data 嘅 incident，唔止係聲譽損失，係有 legal liability。

Woof100 嘅報告最後有個 recommendation 我覺得好 practical：Treat your agent like an untrusted intern, not like a tool. 你唔會俾一個實習生直接 access production database 仲要佢「自己判斷咩啱咩唔啱」，點解你俾個 agent 咁做？

即刻可以做嘅三件事

唔需要等到 framework 完美先開始。呢三樣嘢你聽日就可以做：

一，Least privilege 唔係 slogan。 你個 agent 嘅 API key 只需要佢實際用到嘅 scope。唔好因為「好煩」就俾 admin。用 service account + IAM policy 限到最窄。

二，喺 tool 層面加 validation，唔好信 LLM。 每個 tool 嘅 input 都要做 allowlist sanitization。例如個 search tool 只接受 3 個 character 以上嘅 query；個 database tool 只允許 SELECT，仲要 restrict table list。呢啲邏輯唔依賴 LLM 嘅「判斷力」，所以攻擊者冇得 bypass。

三，Monitor Agent Behaviour。 同你 monitor API traffic 一樣，你要 log 低每個 agent action：佢 call 咗咩 tool、pass 咗咩 input、得到咩 output。用人眼或 anomaly detection 去睇有冇 outlier。如果一個平時淨係做 summary 嘅 agent 突然 call 咗 delete_user，你要即刻知。

Agent 嘅時代嚟緊，但安全唔係 feature，係 prerequisite。唔好等到你個 agent 俾人 hijack 咗先嚟後悔。