文件系统 · 2026-05-31

KyteStore 的 Filesystem 元数据方案选型

本文使用大模型，基于实际的代码实现自动生成。

KyteStore 元数据设计原则： 尽可能不依赖外部独立数据库系统，避免把网络和中心数据库变成元数据瓶颈；尽可能让常规路径在本地分片内直接快速完成；尽可能把元数据去中心化分散到整个集群的 DataServer 节点中，避免单点吞吐和可用性瓶颈。

本文大纲

文件系统的逻辑结构
文件系统元数据与 POSIX 操作语义
跨分片操作的难点
主流方案取舍与 KyteStore 的选择
KyteStore 的 inode bucket 架构与 4K bucket 规模
3 副本 WAL 与 batch group commit
跨 bucket rename 的 transaction 模型
性能测试与横向扩展

1. 文件系统的逻辑结构

理解文件系统元数据，先要把“名字”“文件实体”和“数据块”分开。以 Linux / Ext4 的常见模型为例：VFS 层用 dentry 表达路径名绑定，Ext4 磁盘上用 directory entry 保存目录中的名字到 inode 编号的映射；inode 保存文件类型、权限、大小、时间戳等属性，并通过 extents 描述文件逻辑偏移到物理 data blocks 的映射。

Ext4 风格的文件系统逻辑结构

这张图里，上半部分是文件系统用于管理自己的 metadata，下半部分是一次路径访问如何落到数据块。读写用户数据时，最终要访问 data blocks；但在此之前，系统必须先通过 dentry 找到 inode，再从 inode 中拿到 extents。元数据路径慢，数据路径再快也很难被应用完整利用。

2. 文件系统元数据到底管什么

文件系统元数据不只是文件名。一次普通的 open("/a/b/c") 会经历路径拆分、逐级 lookup、dentry 到 inode 的映射、inode attr 读取，随后数据读写还要依赖 extent map 找到文件内容所在的位置。目录遍历需要稳定的 list 视图；mkdir、unlink、rename 则要更新目录可见性、inode 生命周期和恢复状态。

在 AI 数据集和湖仓工作负载里，文件可能很大，也可能非常碎。大文件读写会把瓶颈推到数据路径，而大量小文件、checkpoint publish、训练脚本反复扫描目录时，元数据路径往往更早成为上限。因此 KyteStore 的 FS Subsystem 必须把 lookup/create/rename 这类操作当成一等性能路径，而不是把它们放到一个中心数据库里“能跑就行”。

POSIX 元数据操作	核心语义	常见实现方案
lookup / open	按路径逐级解析目录项，找到 dentry 指向的 inode。	目录 hash table、B+Tree / LSM 本地索引，或集中式 KV 保存 parent/name 到 inode 的映射。
getattr / stat	读取 inode 类型、权限、size、mtime、link count 等属性。	inode table、local metadata shard、集中式 metadata service 或本地 materialized view。
readdir / list	稳定列出一个目录下的 dentry，并支持 cursor / pagination。	目录分片扫描后 merge，或单目录 shard 内顺序遍历；大目录通常需要 fanout。
create / mkdir	在 parent 目录中新增 dentry，并分配新的 child inode。	单分片事务、WAL-first mutation、集中式事务，或 dentry shard + inode range allocator。
unlink / rmdir	删除目录项，更新 inode 生命周期；目录删除还要校验 empty。	dentry tombstone、reference/link count、后台 GC；强一致系统通常先写 WAL 再更新索引。
rename	原子地把 old dentry 切换到 new dentry，保持 inode identity 不变。	同分片单 WAL；跨分片需要 lock/intent、transaction WAL、recovery roll-forward 或集中式事务。

Dentry 可以理解为“目录里的名字绑定关系”：它描述某个 parent inode 下，一个具体 name 指向哪个 child inode。例如 <parent_inode=100, name="ckpt-42"> -> inode 822。因此 dentry 解决的是“这个路径名字当前指向谁”的问题，也是 lookup、readdir、rename、unlink 的主要操作对象。

Inode 则描述“文件或目录实体本身”：文件类型、权限、size、mtime、link count，以及数据 extent map 等都挂在 inode 上。多个名字可以指向同一个 inode，文件被 rename 后 inode identity 也应保持不变，这就是 POSIX 里 open file handle、hard link、atomic rename 能成立的基础。

主流文件系统都会区分 dentry 和 inode，本质原因是“名字”和“实体”有不同生命周期。名字会频繁创建、删除、移动；实体则承载属性、数据位置和已经打开的文件状态。把二者拆开后，rename 可以只改名字绑定而不复制数据，unlink 可以先删除名字再按引用计数回收实体，目录分片也可以只围绕 dentry namespace 做扩展。

3. 真正困难的部分：跨分片操作

如果所有文件元数据都在一个节点上，rename 的正确性比较直接；但这个模型扩展性很差。KyteStore 把目录项按 parent inode + name 路由到不同 dentry bucket，大目录可以拆到多个 inode bucket 上。这样 lookup 和 create 可以横向扩展，但 rename 会变复杂。

一个容易被忽略的点是：跨 bucket 并不等于跨目录。同一个目录下的两个文件名，因为 hash 不同，也可能落在不同 bucket。POSIX rename 又要求 inode identity 不变，不能用“创建一个新 inode 指向旧 extent，再删除旧 inode”的方式代替，否则会破坏 open file handle、inode number、并发读写和 crash recovery 语义。

所以元数据系统的难点不是单条 dentry 的增删改，而是当 old binding 和 new binding 位于不同 bucket、甚至不同 DS 时，如何让用户看到一次原子切换，并且在 coordinator 崩溃后仍然能根据 durable state 完成或回滚。

从事务角度看，跨分片意味着一次用户操作会拆成多个分片上的子操作。只要其中一个分片成功、另一个分片失败，系统就会进入中间态：老名字可能已经被删除，新名字却没有创建；或者新名字已经可见，老名字还没有删除。更麻烦的是，失败不一定会立刻暴露，可能是网络超时、DS 崩溃、coordinator 崩溃或 WAL 写入后返回前进程退出。没有 transaction state、fencing 和 recovery rule，就很难判断恢复时应该回滚还是继续向前完成。

跨分片部分失败风险

4. 主流方案取舍

集中式管理 Redis/FDB 统一保存 filename 到 inode、inode attr 和 extent map。实现简单，事务边界清晰，但 create、lookup、readdir 和 rename 都会把尾延迟和吞吐上限带到中心层。

去中心化事务 多个 metadata shard 各自服务本地索引，跨 shard 操作通过 intent、lock、WAL 和 recovery 协议收敛。扩展性好，但实现复杂度集中在 fencing 和故障恢复。

精简语义 拒绝跨目录 rename、拒绝 target replace、弱化 readdir 或 directory mutation。短期容易上线，但会把实现边界暴露给用户和上层框架。

KyteStore 的选择 控制面只保存 owner、epoch、range 和 fencing；热路径下沉到 DS inode bucket，用本地索引提供吞吐，用多副本 WAL 保证恢复。

KyteStore 不把完整 file metadata 放进 MetaServer/FDB。MS/FDB 更适合做 FS binding、inode bucket owner map、owner epoch、inode range allocation 和少量 transaction lock。高频 dentry、inode attr、extent map 则由 DS FSSubsystem 的 inode bucket owner 承担。这相当于在“集中式控制面”和“去中心化数据面”之间做切分：控制面保持小而稳定，元数据热路径按 bucket 横向扩展。

5. KyteStore 的 inode bucket 架构

KyteStore 的恢复和扩展单位是 inode bucket。一个 inode bucket 拥有自己的 owner epoch、WAL、checkpoint、materialized index 和恢复边界。FE/SDK 或 DS path RPC 根据 DirectoryLayout 将 parent inode + filename 映射到具体 dentry bucket，再发给当前 owner DS。

这个模型把 MS/FDB 从普通文件 I/O 热路径中移出去：lookup、getattr、readdir、create、unlink、mkdir 的主要路径是访问 DS 本地 materialized index；MS/FDB 只在 owner 变更、range 分配、bucket materialize 或跨 DS transaction lock 时参与。

目录布局与分桶路由

这两层映射是理解 inode bucket 的关键：目录的可见命名空间由 dentry 组成，dentry 的 key 是 <parent_inode, name>，value 指向 child inode；而 dentry 存在哪个 physical bucket，不是由 DS 直接决定，而是先由该目录自己的 DirectoryLayout 计算 virtual bucket index，再映射到 physical bucket 和 owner DS。这样目录扩容只需要调整 virtual bucket range 到 physical bucket 的映射，DS 故障或 rebalance 只需要迁移 physical bucket owner，两件事不会互相绑死。

KyteStore 默认给单个目录配置 4K 个 virtual buckets。这不是说一开始就要物化 4096 个独立 DS owner，而是给目录命名空间预留足够细的 hash 切分粒度。以单目录 10 亿文件为例，平均每个 virtual bucket 承担约 1,000,000,000 / 4096 = 244,141 条 dentry，这对单个 physical bucket 背后的本地 RocksDB/LocalMetaKV 索引是可以接受的量级；如果业务要支持更大的超级目录，可以在系统启动时把 bucket 数配置得更大。

目录规模	Virtual Bucket 数	平均每 Bucket dentry	设计取舍
1 亿文件	4,096	约 24,414	常规大目录，单 bucket 压力较低。
10 亿文件	4,096	约 244,141	默认配置下的目标级别，适合通过多 DS 分摊 owner。
100 亿文件	4,096	约 2,441,406	单 bucket 索引明显变重，应提高 bucket 数。
100 亿文件	65,536	约 152,588	更适合超大目录，但 layout 元数据和 fanout 管理成本更高。

设计约束：为了简化路由、迁移和 crash recovery，系统启动后不支持直接修改 virtual bucket 数量。原因是 hash(parent_inode, name) % virtual_bucket_count 一旦变化，已有 dentry 会被路由到新位置；若要在线改变，必须迁移或重写整个目录的 dentry 映射。KyteStore 当前选择固定 bucket 数，通过调整 virtual bucket range 到 physical bucket 的映射来完成 split 和 rebalance。

元数据热路径

6. 3 副本 WAL 与 batch group commit

KyteStore 的 metadata mutation 遵循一个简单但很关键的顺序：先校验本地 index，构造 WAL record，把 WAL 追加到 durable log；只有 WAL durable 之后，才把变更 apply 到本地 materialized index 并返回成功。如果 DS 在 WAL durable 后、index apply 前崩溃，恢复流程会加载 checkpoint 并重放 WAL，把 index 收敛回来。

这里的 WAL 没有依赖外部 Redis 或 FDB，而是依托 KyteStore 自己的对象写入路径：metadata WAL segment 作为内部隐藏对象写到 WRITE_BUFFER namespace，底层通过 ChunkSubSystem 的 AppendChunkReplicated 进入 ChunkServer 多副本路径。当前性能测试使用 3 副本配置，因此每条已提交的 metadata mutation 都落在本地多副本 WAL 上，再由 checkpoint 周期性压缩恢复成本。

如果每个 create 都单独产生一个对象写，延迟会被小 I/O 放大。KyteStore 因此引入 DS-level WAL 和 group commit：多个逻辑 metadata records 会被聚合为一个物理 WAL block，写入同一个 active segment object；每个 record 仍保留自己的 inode_bucket_id、bucket-local LSN 和 owner epoch。这样既保留 per-bucket 顺序和恢复边界，也能把一次 WAL I/O 的成本摊到一批请求上。

元数据 WAL、三副本与本地索引

一个重要细节：group commit 不靠固定等待凑批。worker 会自然 drain 已经排队的请求；当上一轮 WAL I/O 正在进行时，新请求继续排队，下一轮一起提交。这样可以降低均摊延迟，同时避免为了追求 batch size 人为增加前台等待。

7. 跨 bucket rename 的 transaction 模型

在同一个 dentry bucket 内，rename 仍然是快路径：写一条 WAL record，删除 old dentry，写入 new dentry，inode 不变，不访问 MS lock。只有 old bucket 与 new bucket 不同时，才进入 transaction slow path。

最新方案把跨 bucket rename 建模成可恢复 metadata transaction。WAL 中有 transaction descriptor，记录 txn_id、参与 bucket、受影响 dentry、受影响 inode、old dentry 和 new dentry；状态机包含 prepared、commit decided、applied、aborted、finished。index 层会为 pending transaction 安装 dentry/inode fence，checkpoint 也会保存 pending transaction，保证恢复后能继续完成或回滚。

跨分桶重命名时序

这个设计的关键是把复杂度限制在 rename slow path 里。create、unlink、mkdir、lookup、readdir 不访问 MS lock，不因为跨 bucket rename 的存在而退化；same-bucket rename 继续使用单 WAL 快路径。MS lock 只做 fencing，不承载业务 metadata 内容。

8. 性能测试

下面是最新 FS metadata operation perf 数据。测试场景使用 bthread 协程向单个 DataServer 发起约 2000 到 4000 个并发请求，持续执行 metadata mutation。这里的 avg_us/op 是按总吞吐折算的均摊延迟，Batch WAL P50/P99 是 WAL batch flush 的完成延迟，二者不是同一粒度。

多 DS 场景下，metadata QPS 可以按 inode bucket owner 横向扩展。不同 DS 之间没有共享的 metadata 写入瓶颈，也不需要在普通 create/unlink/mkdir 路径上互相协调，因此做到数百万级 metadata QPS 相对直接。WAL 延迟看起来在毫秒级，主要来源于三副本 Chunk 写入：一次 batch WAL 必须等待全部 replica append 成功后才能返回；但 group commit 会把这次成本摊到大量前台 metadata 操作上。

OP	QPS	avg_us/op	Batch WAL P50	Batch WAL P99
create	379,324	2.64	2,114	6,175
unlink	270,512	3.70	5,973	12,690
rename（跨目录）	237,571	4.21	3,881	13,024
mkdir	230,950	4.33	5,299	13,327

从数据看，WAL 的单次 batch flush 仍然在毫秒级，但因为 group commit 的 batch size 足够大，前台操作可以获得微秒级均摊成本。这也是 KyteStore 没有把“可靠写”和“高吞吐”做成二选一的原因：可靠性靠三副本 WAL，吞吐靠 batch、bthread 并发和 bucket/DS 维度的并行。

本文聚焦元数据方案选型。数据读写路径、extent map 与对象后端联动会在后续文章单独展开。

Filesystem · 2026-05-31

KyteStore Filesystem Metadata Design Choices

Generated with a large language model based on the actual code implementation.

KyteStore metadata design principles: avoid depending on an external standalone database system whenever possible, so metadata does not bottleneck on network hops or a central database; keep normal-path operations fast and local to metadata shards; decentralize metadata across DataServer nodes to avoid single-node throughput and availability limits.

Outline

Filesystem logical structure
Filesystem metadata and POSIX operation semantics
Why cross-shard operations are hard
Main design choices and KyteStore's decision
KyteStore inode bucket architecture and 4K bucket sizing
3-replica WAL and batch group commit
Cross-bucket rename transaction model
Performance and horizontal scaling

1. Filesystem Logical Structure

To understand filesystem metadata, separate names, file entities, and data blocks. In a typical Linux / Ext4 model, the VFS layer uses dentries to represent pathname bindings, while Ext4 stores on-disk directory entries that map names to inode numbers. Inodes hold attributes such as type, permissions, size, and timestamps, and extents map logical file ranges to physical data blocks.

Ext4-Style Filesystem Logical Structure

In this diagram, the upper layer is filesystem metadata, while the lower path shows how a pathname eventually reaches data blocks. User reads and writes ultimately access data blocks, but the system must first resolve the dentry, load the inode, and follow extents. If the metadata path is slow, the data path cannot be fully utilized.

2. What Filesystem Metadata Does

Filesystem metadata is more than filenames. A normal open("/a/b/c") walks path components, resolves dentry to inode, loads inode attributes, and eventually uses extent maps to find data. Directory listing needs a stable view. mkdir, unlink, and rename must update visibility, inode lifecycle, and recovery state.

In AI and lakehouse workloads, files may be large, but they may also be extremely numerous. Large files stress the data path. Small files, checkpoint publish patterns, and repeated directory scans often hit the metadata path first. KyteStore therefore treats lookup, create, and rename as first-class performance paths instead of delegating them to a central database by default.

POSIX Metadata Op	Core Semantics	Common Implementation
lookup / open	Walk path components and resolve parent/name dentries to inodes.	Directory hash table, B+Tree / LSM index, or centralized KV mapping.
getattr / stat	Read inode type, permissions, size, mtime, link count, and related attributes.	Inode table, local metadata shard, metadata service, or materialized local view.
readdir / list	List dentries under a directory with cursor or pagination semantics.	Single-shard scan for small directories, fanout plus merge for large directories.
create / mkdir	Create a new dentry under the parent and allocate a child inode.	Single-shard transaction, WAL-first mutation, or dentry shard plus inode range allocator.
unlink / rmdir	Remove a dentry and update inode lifecycle; rmdir also validates directory emptiness.	Dentry tombstone, link count, background GC, usually protected by WAL.
rename	Atomically switch old dentry to new dentry while preserving inode identity.	Single WAL record in one shard; lock/intent plus transaction WAL across shards.

A dentry is the name binding inside a directory: under a parent inode, a specific name points to a child inode. For example, <parent_inode=100, name="ckpt-42"> -> inode 822. Dentries answer the question "which object does this path name currently point to" and are the main objects touched by lookup, readdir, rename, and unlink.

An inode represents the file or directory entity itself: type, permissions, size, mtime, link count, and extent maps belong to the inode. Multiple names can point to the same inode, and rename should keep inode identity unchanged. This is the foundation for POSIX open file handles, hard links, and atomic rename.

Mainstream filesystems separate dentries and inodes because names and entities have different lifecycles. Names are created, removed, and moved frequently; entities carry attributes, data locations, and already-open file state. This separation lets rename update only the name binding, unlink remove a name before reclaiming the entity by reference count, and directory sharding scale the dentry namespace independently.

3. The Hard Part: Cross-Shard Operations

With one metadata node, rename correctness is straightforward, but the model does not scale. KyteStore routes dentries by parent inode + name into dentry buckets, allowing large directories to spread across inode buckets. Lookup and create can scale out, but rename becomes harder.

Cross-bucket does not necessarily mean cross-directory. Two names in the same directory can hash into different buckets. POSIX rename also requires inode identity to remain unchanged. Creating a new inode that points to old extents and then deleting the old inode would break inode numbers, open file handles, concurrent access, and crash recovery semantics.

From a transaction perspective, cross-shard means that one user operation becomes multiple shard-local sub-operations. If one shard succeeds while another fails, the system enters an intermediate state: the old name may already be removed while the new name is not created, or the new name may be visible while the old name still exists. The failure may be a network timeout, DS crash, coordinator crash, or process exit after a WAL append but before the response. Without transaction state, fencing, and recovery rules, recovery cannot safely decide whether to roll back or roll forward.

Cross-Shard Partial Failure Risk

4. Main Design Choices

Centralized Redis or FDB stores filename-to-inode, inode attributes, and extents. This is easy to reason about, but lookup, create, readdir, and rename inherit central write throughput and tail latency.

Decentralized Metadata shards serve local indexes, while cross-shard operations use intents, locks, WAL, and recovery. Scalability is better, but fencing and recovery become the hard parts.

Reduced Semantics Reject cross-directory rename, target replace, or complex directory mutation. This ships quickly, but exposes storage implementation boundaries to users and frameworks.

KyteStore Choice Keep the control plane small: owner, epoch, ranges, and fencing. Move hot metadata to DS inode buckets, use local indexes for throughput, and replicated WAL for recovery.

KyteStore does not put full file metadata in MetaServer/FDB. MS/FDB tracks FS bindings, inode bucket owners, owner epochs, inode range allocation, and lightweight transaction locks. Hot dentry, inode attribute, and extent state belongs to the DS FSSubsystem owner. In other words, KyteStore combines a small centralized control plane with a decentralized metadata data plane.

5. KyteStore Inode Buckets

The recovery and scaling unit is the inode bucket. Each bucket has its own owner epoch, WAL, checkpoint, materialized index, and recovery boundary. FE/SDK or DS path RPC uses DirectoryLayout to map parent inode + filename to a dentry bucket and then sends the operation to the owner DS.

This keeps MS/FDB out of normal file I/O: lookup, getattr, readdir, create, unlink, and mkdir mainly hit the DS local materialized index. MS/FDB participates in owner transfer, range allocation, bucket materialization, and cross-DS transaction locking.

DirectoryLayout has two levels: a stable virtual bucket index produced by hash(parent_inode, filename, seed) % virtual_bucket_count, and a mapping from virtual bucket ranges to physical buckets. Physical buckets have owners, WALs, checkpoints, and local indexes. Directory split changes virtual-to-physical mapping; failover or rebalance changes physical bucket owners.

The default single-directory layout uses 4K virtual buckets. For a directory with one billion files, the average load is about 1,000,000,000 / 4096 = 244,141 dentries per virtual bucket, which is a reasonable level for the local RocksDB/LocalMetaKV index behind a physical bucket. Larger directories can use a larger bucket count configured at system startup.

Directory Size	Virtual Buckets	Avg Dentries / Bucket	Tradeoff
100M files	4,096	about 24,414	Low pressure for common large directories.
1B files	4,096	about 244,141	Target scale for the default layout, spreadable across DS owners.
10B files	4,096	about 2,441,406	Heavy per-bucket index; increase bucket count for this scale.
10B files	65,536	about 152,588	Better for very large directories, at the cost of more layout and fanout management.

Design constraint: to keep routing, migration, and crash recovery simple, KyteStore does not support changing the virtual bucket count after the system starts. Changing the modulo would route existing dentries to new locations unless all related dentries are migrated or rewritten.

6. 3-Replica WAL And Batch Group Commit

KyteStore metadata mutations follow a strict order: validate against the local index, build a WAL record, append it to a durable log, and only then apply it to the materialized index. If a DS crashes after WAL durability but before index apply, recovery loads the latest checkpoint and replays the WAL.

The WAL is not stored in external Redis or FDB. It is written as hidden objects in a KyteStore WRITE_BUFFER namespace. Under the hood, ChunkSubSystem uses AppendChunkReplicated to write through ChunkServer's multi-replica path. The latest benchmark path uses 3 replicas, so committed metadata mutations are protected by a local replicated WAL before checkpoint compaction.

To avoid turning every create into a tiny object write, KyteStore uses a DS-level WAL with group commit. Multiple logical metadata records are packed into one physical WAL block in an active segment object. Each record still carries its own inode_bucket_id, bucket-local LSN, and owner epoch, preserving per-bucket ordering and recovery boundaries.

Conceptually, foreground metadata operations first enter a batch WAL block, then the block is appended to three Chunk replicas. Only after all replicas confirm the append does the DS apply the mutation to the local RocksDB metadata index. Recovery does the reverse: load checkpoint, replay durable WAL records, and rebuild the materialized index.

Important detail: group commit does not wait artificially for a target batch size. A worker drains requests already queued; while one WAL I/O is running, new requests accumulate and are committed in the next batch. This keeps amortized latency low without adding unnecessary foreground delay.

7. Cross-Bucket Rename Transactions

Same-bucket rename remains the fast path: one WAL record moves the dentry while keeping the inode unchanged, without touching MS locks. Only when old and new dentries fall into different buckets does KyteStore enter the transaction slow path.

The latest model treats cross-bucket rename as a recoverable metadata transaction. The WAL carries a transaction descriptor with txn_id, participants, affected dentries, affected inodes, old dentry, and new dentry. The phase state moves through prepared, commit decided, applied, aborted, and finished. Pending transactions install local dentry/inode fences and are included in checkpoints, so recovery can roll forward or abort deterministically.

The key guardrail is that transaction complexity stays in the rename slow path. Create, unlink, mkdir, lookup, and readdir do not touch MS locks, and same-bucket rename continues to use the single-record WAL fast path.

8. Performance Results

The latest FS metadata benchmark uses bthread coroutines to issue roughly 2,000 to 4,000 concurrent metadata requests to a single DataServer. avg_us/op is an amortized value derived from total throughput, while Batch WAL P50/P99 measures WAL batch flush completion latency, so the two numbers are intentionally different granularities.

With multiple DS nodes, metadata QPS scales by inode bucket owner. Ordinary create/unlink/mkdir paths do not share a cross-DS write bottleneck, so reaching millions of metadata QPS is relatively straightforward. The high WAL latency mainly comes from the three-replica Chunk write: a batch WAL can return only after every replica append succeeds, while group commit amortizes that cost across many foreground operations.

OP	QPS	avg_us/op	Batch WAL P50	Batch WAL P99
create	379,324	2.64	2,114	6,175
unlink	270,512	3.70	5,973	12,690
rename, cross-directory	237,571	4.21	3,881	13,024
mkdir	230,950	4.33	5,299	13,327

The data shows that individual WAL batch flushes are still in milliseconds, but group commit makes the foreground operation cost microsecond-level on average. This is the core tradeoff: reliability comes from the three-replica WAL, while throughput comes from batching, bthread concurrency, and bucket/DS-level parallelism.

This article focuses on metadata design. The data path, extent maps, and remote object backend integration will be covered separately.