/ Cloudflare for SaaS
dual-write to production.

Cloudflare 那邊產品已經 enable 了——但 VibeHost 一行對接 code 都沒有。這份 brief 描述把這條路接通到「資料齊備、隨時可切流量」的狀態，過程中使用者跟現有流量完全不受影響。

中範圍。七個 PR。Caddy 不動。

CF API on prod

reachable

vibehost.com · vibehost.space

Fallback origin

unset

CF code 1551

Verified domains

backfill 對象

Code touching CF Hostnames

0 lines

這份 brief 要解的

· · · · ·

01 Where we are now

跑完 hack/check-cf-for-saas.sh 對兩個 prod zone，狀態如下。能用、半設好、沒接：

Check	vibehost.com	vibehost.space	解讀
Token verify	active	active	runtime token 有效
Zone status	Free Website	Free Website	quota=100 hostnames, HTTP DCV only
Custom Hostnames API	reachable	reachable	SaaS 真的開了——出乎意料
Fallback origin	1551 missing	1551 missing	沒 origin = CF 邊緣不知道流量送哪
cname.<zone> record	missing	missing	客戶要 CNAME 指過來的目標

Production 資料庫端 custom_domains 五筆 row：2 verified（有人在用）、2 unverified（建了沒設 DNS）、1 軟刪。所有流量目前都走 GCE 的 Caddy + Let's Encrypt，CF Analytics 看不到。

· · · · ·

02 Scope of this slice

做

API 在 verify 成功 / soft-delete 時雙寫進 CF Custom Hostnames + Workers KV。
新增 dispatcher Worker 部到 cname.vibehost.space。P1 階段故意只回 503 stub——不接流量、避免誤切。
DB schema 多六個 cf_* 欄位（migration 0019）。
Pulumi 補 fallback origin、KV namespace、CNAME 記錄。
把現有 2 筆 verified domain backfill 進 CF（使用者 0 動作）。
Reconciler cron 補打 fail-open 留下的 dirty row。

不做

不切流量——使用者 DNS 還是 TXT 驗證、CNAME 還是指 Caddy。
不改 CLI、不改 dashboard UI。
不拆 Caddy。
不接 access gates（visibility / password / share-link）到 dispatcher——這留給「大範圍」，dispatcher P1 故意 503 是要在工程上保證後續切流量前必須先做這件事。
不接 CF cert renewal webhook。

· · · · ·

03 Architecture

新增的元件用 ★ 標。雙寫的兩條 fan-out 線（CF Hostnames API + Workers KV）顯示為 fail-open——要嘛同時成功，要嘛任一失敗都不擋主流程，由 reconciler 補。

                    ┌─────────────────────────────────────┐
                    │ apps/api  (existing)                │
                    │                                     │
                    │ routes/custom-domains.ts            │
                    │   add() / verify() / remove()       │
                    │       │                             │
                    │       ▼                             │
                    │ services/custom-domains.ts          │
                    │   commitVerify  ─────┐              │
                    │   softDelete    ─────┤  fail-open   │
                    │   commitReclaim ─────┘  dual-write  │
                    │                  │                  │
                    │   ┌──────────────┴──────────────┐   │
                    │   ▼                             ▼   │
                    │ services/cloudflare/                │
                    │   ★ custom-hostname.ts             │
                    │   ★ kv-hostname-map.ts             │
                    │                                     │
                    │ ★ jobs/cf-sync-reconciler.ts        │
                    │   cron */5 * * * *                  │
                    └────────┬─────────────────┬──────────┘
                             │ HTTPS           │ HTTPS
                             ▼                 ▼
                     ┌──────────────┐  ┌────────────────┐
                     │ CF Custom    │  │ Workers KV     │
                     │ Hostnames    │  │ HOSTNAME_KV    │
                     │ API          │  │ host → meta    │
                     └──────┬───────┘  └────────┬───────┘
                            │                   │
                            │ ssl/dcv lifecycle │ read
                            ▼                   ▼
                 ┌──────────────────────────────────────┐
                 │ CF Edge                              │
                 │  fallback_origin = dispatcher worker │
                 │  cname.vibehost.space  ─→  ★         │
                 │                              │       │
                 │             ┌────────────────┘       │
                 │             ▼                        │
                 │  ★ apps/dispatcher (Worker)          │
                 │    1. host = X-Forwarded-Host        │
                 │    2. kv.getWithMetadata(host)       │
                 │    3. env.DISPATCHER.get(name)       │
                 │       .fetch(req)                    │
                 └──────────────────────────────────────┘

API pod CF data plane CF edge / dispatcher ★ = new in this slice

· · · · ·

04 Flow · verify + dual-write

關鍵設計：主交易先 commit，雙寫放在交易外、fail-open。如果 CF 或 KV 抖一下，使用者 row 還是有 verified_at、Caddy 那邊照常服務，只是 cf_sync_state='failed' 等下次 reconciler 補。

01CLI

vibehost domain add blog.example.com
使用者觸發；CLI 不變、TXT 流程維持。

02API · POST

SQLINSERT custom_domains (verify_token=…, cf_sync_state='pending')
RES回傳 verifyToken；使用者去 DNS 後台設 TXT。

03API · verify

DNSTXT lookup vs verify_token，比對成功才繼續。

04Main TX

SQLBEGIN; UPDATE row SET verified_at = now(); COMMIT;
這一步 commit 後使用者 verify 已成功——之後出什麼事都不該影響它。

05Dual-write
fail-open

CFcf.create(hostname) → 拿到 cf_id、ssl_status
KVkv.put(hostname, {workerName, channelId})
SQLUPDATE row SET cf_id, cf_ssl_status, cf_sync_state='ok', cf_synced_at
任一步丟例外 → catch → row 標 cf_sync_state='failed' + cf_last_error，主流程繼續回 ok。

06API · response

RES{ok: true, verified: true}
流量繼續走 Caddy。CF 邊緣 ~60s 後 ssl_status 變 active，但 0 流量打進來——直到大範圍切 DNS。

為什麼是 fail-open

使用者選擇「不加 feature flag」、「上就是上」。代表雙寫的容錯必須在 service 層就吸收掉所有 CF/KV 暫時性錯誤——不能讓使用者的 verify 因為 CF API 抖一下而失敗。代價是：dirty row（state=failed）必須有 reconciler 補。沒 reconciler 的 fail-open = bug 工廠。

· · · · ·

05 Flow · delete & reclaim

Soft delete

01Main TX

SQLBEGIN; UPDATE row SET deleted_at = now(); COMMIT;

02Reverse cleanup
fail-open

KVkv.delete(hostname) 邊緣立刻 404
CFcf.delete(cf_id) 清掉占 quota 的 record
失敗 → log；reconciler query 2 掃 deleted_at IS NOT NULL AND cf_id IS NOT NULL 補刪。

Reclaim · Vercel-style 跨 workspace 拿回 hostname

現有 schema 已支援 reclaim（reclaim_of 欄位 + commitReclaim() 一次 TX 內升級新 row、軟刪舊 row）。雙寫怎麼接它：

01Main TX

SQLcommitReclaim 既有邏輯：軟刪舊 normal row、清 reclaim_of 把新 row 升級。

02新 row · CF

CFcf.create(hostname) → 1414/1415（hostname 早在帳號裡） → cf.findByHostname(hostname) 拿回現有 id
SQLUPDATE newRow SET cf_id, cf_sync_state='ok'
KVkv.put(hostname, {新 workerName, 新 channelId}) last-write-wins

03舊 row · CF

SQLUPDATE oldRow SET cf_id = NULL
不呼叫 cf.delete——hostname 還在用、record 屬於 VibeHost、只是換了 app 而已。把舊 row 的 cf_id 設 NULL 是為了避免 reconciler 之後誤刪這條 record。

· · · · ·

06 Error matrix

每個分支都有測試覆蓋（§7 Testing）。Hover row 會反白方便對讀。

Scenario	Outcome
`cf.create` 5xx / network	row → cf_sync_state='failed', `cf_last_error`, `cf_retry_count++`。Verify 主流程回 ok。Reconciler 5 分鐘內補。
`cf.create` 1414 / 1415（hostname 已在本帳號）	`findByHostname` 取回 id → 寫進 row。Sync state → ok。發生在 reclaim 與重複新增。
`cf.create` 1416（hostname 在別人 CF 帳號）	row → cf_sync_state='blocked'。Reconciler 跳過，需人工。
KV put 失敗但 CF create 成功	row → cf_sync_state='failed'，`cf_id` 仍寫入。Reconciler 重跑——CF 那步走 1414 分支變成 idempotent。
Pulumi fallback origin 還沒 active	不影響雙寫。CF record + cert 照樣 issue；只是真切 DNS 過去前邊緣不通。中範圍流量本就不切，無關痛癢。
Token 缺 SSL/TLS:Edit	所有 cf 呼叫 401/403 → 全部 row failed。Pulumi post-deploy smoke 直接擋下來，部署不會通過。
Reclaim 但 CF hostname 還在舊 owner KV metadata	`commitReclaim` 後直接 `kv.put` 新 metadata 覆寫——last-write-wins，邊緣 ≤60s 一致。
軟刪後 CF delete 失敗	KV 已刪 → 邊緣立刻 404。CF 還留 record → reconciler query 2 補刪。

· · · · ·

07 PR breakdown

七個 PR。點任一條展開細節。每個都跑既有的 ship-feature 流程：CI claude-review ≥ 7.5 → squash merge → staging E2E → prod。

feat(infra): cf-for-saas fallback origin + dispatcher worker scaffold Pulumi 建 KV、PUT fallback_origin、CNAME。Dispatcher worker 故意只回 503 stub。

low risk

新增 cloudflare.WorkersKvNamespace vibehost-hostname-map-{env}，wire 到 api Deployment + dispatcher binding。
新增 cloudflare.CustomHostnameFallbackOrigin on vibehost.space，origin = dispatcher worker host。
新增 cloudflare.Record proxied CNAME cname.vibehost.space → dispatcher。
新增 apps/dispatcher/——故意只回 503 cf-for-saas dispatcher: not yet serving traffic 加 log line。
Pulumi post-deploy smoke：GET /custom_hostnames?per_page=1 必須成功，否則整個 deploy fail。

feat(api): custom-hostname + kv-hostname-map services 純兩個新檔 + unit test。不接進 verify 流程，只是寫好等 P4 用。

none

CloudflareCustomHostnameService · create / get / findByHostname / delete。
HostnameKvMap · put / delete（用 KV metadata，dispatcher 一個 round trip 拿到）。
CfHostnameError 區分 1414/1415/1416 vs 其他。
fetch mock 矩陣：200 / 1414 / 1415 / 1416 / 5xx / network。

feat(db): custom_domains cf sync columns Migration 0019_custom_domains_cf_sync.sql。六個 nullable 欄位、預設 sentinel。

none

cf_id text、cf_ssl_status text、cf_sync_state text DEFAULT 'pending'、cf_last_error text、cf_synced_at timestamptz、cf_retry_count int DEFAULT 0。
schema.ts 同步、tsc 通過。
Index on (cf_sync_state) WHERE deleted_at IS NULL 給 reconciler 用。

feat(api): dual-write custom domains to cloudflare 第一個真的會打 CF 的 PR。 commitVerify / softDelete / commitReclaim 三個切點接上 P2。

medium risk

Staging 先 deploy → 手動建一筆 domain → 看 CF dashboard 有沒有 record、ssl_status 是不是會跑起來。
Unit tests 蓋三個 catch 分支：成功、5xx fail-open、1414 recover、1416 blocked。
commitReclaim 的 1414 + KV 覆寫測試獨立一條。

feat(api): cf-sync reconciler cron job 5 分鐘掃 cf_sync_state='failed' 補打、掃孤兒 cf_id 補刪。

low risk

BullMQ 既有風格、不引入新基礎設施。
cf_retry_count < 10 cap、超過停下來等人工。
blocked state 永遠跳過。

chore(api): backfill existing verified custom domains 一次性 script，--dry-run 先跑、再實跑。Staging + prod 各 2 筆 row。

low risk

跟 verify-time 雙寫共用同一個 helper——共用一致性。
Idempotent：cf_id 已寫的 row 跳過。
對稱寫 --cleanup 模式（萬一要 rollback 用）。

docs: cf-for-saas adr + dispatcher access-gate gap 把 dispatcher 還沒接 access gate 這件事寫進 ADR——下一階段切流量前必補。

none

狀態紀錄：CF 那邊 record 齊、cert 都有、KV 寫好；流量還在 Caddy。
明列「大範圍 prerequisites」：access gates 三檢查移植、CLI 改 CNAME instructions、舊 domain DNS 引導、Caddy 退役。
Roadmap 補一條 ticket。

· · · · ·

08 Hard-rule alignment

VibeHost CLAUDE.md 列了九條 hard rule。本次改動觸到三條：

Rule #1 · 不在 API hot path 跑 build

本次只是 HTTP 呼叫 + DB 寫入，沒 build、沒 wrangler 子程序、沒 npm install。無關。

Rule #4 · secrets 不入 repo

沿用既有 vibehost:cfApiToken Pulumi config key。如果現有 token scope 不夠 SSL/TLS:Edit，原地換新 token，不引入新 env var。Pulumi smoke step 會在部署當下確認 scope。

Rule #8 · access gate AND 不 OR

Dispatcher worker 在 P1–P6 不評估 visibility / password / share-link。三個檢查在現有 custom-domain-proxy.ts middleware 裡，dispatcher 沒移植。P1 故意 503 stub 是為了在工程上保證之後切流量前必須先做這件事（否則切過去會直接繞過 access gate）。P7 把這個 gap 寫進 ADR。

· · · · ·

09 Out of scope · 大範圍要做的

等中範圍 ship 完、CF 那邊資料齊備之後，「大範圍」會做：

把 visibility / password / share-link 三個 access gate 從 middleware/custom-domain-proxy.ts 移植到 dispatcher worker（或 sidecar）。
CLI / dashboard 把 DNS 指引從 TXT → CNAME（或雙軌、TXT 列為 legacy）。
CF cert renewal lifecycle webhook receiver。
Caddy on GCE 退役。
需要的話，zone 升級到 SSL-for-SaaS Advanced（quota > 100 hostnames）。

/ Cloudflare for SaaS dual-write to production.

01 Where we are now

02 Scope of this slice

做

不做

03 Architecture

04 Flow · verify + dual-write

05 Flow · delete & reclaim

Soft delete

Reclaim · Vercel-style 跨 workspace 拿回 hostname

06 Error matrix

07 PR breakdown

08 Hard-rule alignment

09 Out of scope · 大範圍要做的

/ Cloudflare for SaaS
dual-write to production.