香港 startup 用 infra 效率慳 40% 成本

香港嘅 startup 創辦人，十個有九個會同你講：「我哋 infra 成本好高，但冇辦法，要快就要洗錢。」呢句說話我聽過無數次，但每次都覺得好可惜，因為佢哋搞錯咗個因果關係。快同平唔係 trade-off，而係同一樣嘢嘅兩個面向：你 infrastructure 越快，idle 時間越少，每一蚊成本嘅產出就越高。挪威最近用華為 2PB 全快閃倉儲做咗個完美示範，而香港 startup 完全可以用同一套邏輯慳返 40% 成本。

挪威用華為 2PB 全快閃倉儲嘅真正教訓

挪威喺 2025 年決定用華為 2PB 全快閃儲存方案取代傳統 HDD-based 數據中心 infrastructure。表面上呢個係國家級 project，同香港 startup 無關，但核心邏輯完全一致——挪威發現 I/O bottleneck 唔只係速度問題，而係成本問題。傳統 HDD 方案表面睇平，但 latency 高、throughput 低，令到 CPU 同 GPU 長期處於等待狀態，而呢啲等待時間就係錢。

用全快閃之後，IOPS 提升 20 倍，latency 由毫秒級降到微秒級，結果係同一批 compute resource 可以處理多 3 倍嘅 workload。呢個 multiplier effect 先係關鍵。對於香港 startup 嚟講，啟示好直接：你 infra 嘅 real cost 唔係 monthly cloud bill，而係每 unit compute 產出咗幾多 actual value。如果你用緊嘅 storage latency 係 bottleneck，你實際上喺度 burning cash on idle compute。

最諷刺嘅係，我見過好多 startup 寧願每個月畀十幾萬 cloud bill，都唔肯花一個月嘅 infra cost 去優化 storage 同 networking。佢哋覺得「優化 infra」係大公司先要做嘅嘢，但現實係愈細嘅 startup 愈需要效率，因為你冇 cash cushion 去 absorb waste。挪威案例嘅重點係：佢哋唔係為快而快，而係 faster data access 直接 translates to lower total cost of ownership。呢個 mindset shift——由「呢個方案 monthly 幾錢」轉為「呢個方案 per unit of useful work 幾錢」——係香港 startup 最急需嘅嘢。

GPU Idle Time 管理實戰

講到 infra 效率，最慘烈嘅 waste 通常嚟自 GPU idle time。AI startup 尤其嚴重：你 training 完一個 model，GPU 就喺度等下一條 job，一等可能係幾粒鐘甚至幾日。Monthly GPU bill 係固定嘅，但 utilisation 可能得 30-40%，呢個 gap 就係白白燒咗嘅錢。Mesh LLM 呢類 project 俾咗一個好靚嘅示範——用 dynamic scheduling 同 workload aggregation 去 maximize GPU utilisation。佢哋嘅思路係咁：唔好當 GPU 係 dedicated resource，而係 shared pool，根據 workload 動態分配。

實戰上有四個具體做法。第一，implement preemption-aware scheduling。唔同 workload 有唔同 priority，training job 可以 preempt inference job，但你需要一個 scheduler 識得 handle priority。Kubernetes 加上 volcano scheduler 就做到。第二，adopt elastic training。唔係所有 training job 都需要 full cluster，dynamic scale cluster size 根據 job 需求，當 utilisation drop 低過 threshold 就 auto-scale down 或者將 GPU 改做 inference。第三，implement GPU sharing 同 time-slicing。NVIDIA MPS 同 MIG 技術可以俾一個 GPU serve 多個 workload，關鍵係 orchestration layer 要 support 呢啲 feature。第四，build a job queue 同 batching system。將細嘅 inference request 同 training job batch 埋一齊，減少 GPU context switching overhead。Mesh LLM 嘅 batching strategy 令 throughput 提升咗成倍。

最實際嘅建議：裝個 monitoring tool（DCGM 加 Prometheus），睇住 GPU utilisation timeline。如果 utilisation 長期低過 60%，你已經有至少 30% 嘅成本係 pure waste。用 auto-scaling 同 spot instance 去 handle peak，而唔係用 dedicated GPU 去 idle。

效率文化先係長遠競爭優勢

Infra 效率唔單止係技術問題，而係組織文化問題。我觀察到一個 pattern：識得慳 infra 嘅 startup，通常喺其他方面都好 efficient。原因好簡單——infra 效率需要你精準理解實際需求，而呢種精準理解會 spill over 到 product decision 同 hiring decision。當你開始 monitor GPU utilisation，你會自然開始 monitor developer productivity、monitor CAC、monitor everything。個 muscle 係同一樣嘢：data-driven efficiency。

香港 startup 有一個通病：覺得 growth 先係一切，cost 後話。呢個 mindset 喺 2021 年仲 work，但 2026 年嘅市場已經唔容許你咁樣 burn cash。投資者睇嘅唔再係 pure growth，而係 unit economics 同 efficiency ratio。我見過一間香港 AI startup，花咗兩個禮拜做 infra optimisation——migrate 去更 efficient storage、implement GPU sharing、set up auto-scaling——結果 infra cost 由每月 80 萬降至 48 萬，慳咗 40%，完全冇影響 product speed 或 quality。佢哋點做到？就係肯停低嚟諗清楚自己真正需要啲咩。

效率唔係 constraint，而係 advantage。你愈 efficient，愈可以 affordable 去 experiment。你 experiment 得愈多，愈快揾到 product-market fit。呢個 loop 先係 infra efficiency 嘅終極價值。

你而家即刻可以做嘅五件事

與其睇完讚好然後繼續 over-provision，不如即刻做以下幾件事。第一，呢個星期內做一次 full infra audit：用成本分析工具睇清楚 each resource 嘅 utilisation rate。GPU utilisation 低過 50%？Storage latency 高過 compute tolerance？你已經有 improvement opportunity。第二，adopt per-unit-of-work costing：唔好再睇 monthly cloud bill，開始 track cost per training run、per inference request、per user request。呢啲 metric 先係真正幫你做 decision 嘅數據。第三，implement GPU time-slicing 同 sharing：無論你用 K8s 定 SLURM，裝個 volcano scheduler，確保 GPU 資源係 shared 而唔係 dedicated。第四，review storage architecture：你嘅 storage 同 compute 之間有冇 bottleneck？你需唔需要全快閃？唔一定，但至少要知道 I/O pattern 同 bottleneck 喺邊。好多 case 下一個 middleware cache layer 已經可以解決大部分問題。第五，將 efficiency 放入 OKR：每個 quarter 定一個 infra efficiency target，例如「GPU utilisation 由 40% 提升到 70%」或者「cost per inference 降低 30%」，然後追住呢個數做。

記住一個原則：infra 效率係 competitive advantage，唔係 cost centre。當你對手仲 waste 緊 40% infra cost 嘅時候，你用同樣 budget 可以行多一倍 experiment、train 多一倍 model、serve 多一倍 user。呢個差距只會愈拉愈大。香港 startup 要喺 global market 競爭，你一定要用有限 resource 做到最多 output。挪威嘅 case 證明了呢點，Mesh LLM 嘅實踐 confirm 咗呢點。仲等？你嘅 GPU 而家可能正喺度 idle 緊。