在同一个 dentry bucket 内,rename 仍然是快路径:写一条 WAL record,删除 old dentry,写入 new dentry,inode 不变,不访问 MS lock。只有 old bucket 与 new bucket 不同时,才进入 transaction slow path。
Generated with a large language model based on the actual code implementation.
KyteStore metadata design principles:
avoid depending on an external standalone database system whenever possible, so metadata does not bottleneck on network hops or a central database; keep normal-path operations fast and local to metadata shards; decentralize metadata across DataServer nodes to avoid single-node throughput and availability limits.
Outline
Filesystem logical structure
Filesystem metadata and POSIX operation semantics
Why cross-shard operations are hard
Main design choices and KyteStore's decision
KyteStore inode bucket architecture and 4K bucket sizing
3-replica WAL and batch group commit
Cross-bucket rename transaction model
Performance and horizontal scaling
1. Filesystem Logical Structure
To understand filesystem metadata, separate names, file entities, and data blocks. In a typical Linux / Ext4 model, the VFS layer uses dentries to represent pathname bindings, while Ext4 stores on-disk directory entries that map names to inode numbers. Inodes hold attributes such as type, permissions, size, and timestamps, and extents map logical file ranges to physical data blocks.
Ext4-Style Filesystem Logical Structure
In this diagram, the upper layer is filesystem metadata, while the lower path shows how a pathname eventually reaches data blocks. User reads and writes ultimately access data blocks, but the system must first resolve the dentry, load the inode, and follow extents. If the metadata path is slow, the data path cannot be fully utilized.
2. What Filesystem Metadata Does
Filesystem metadata is more than filenames. A normal open("/a/b/c") walks path components, resolves dentry to inode, loads inode attributes, and eventually uses extent maps to find data. Directory listing needs a stable view. mkdir, unlink, and rename must update visibility, inode lifecycle, and recovery state.
In AI and lakehouse workloads, files may be large, but they may also be extremely numerous. Large files stress the data path. Small files, checkpoint publish patterns, and repeated directory scans often hit the metadata path first. KyteStore therefore treats lookup, create, and rename as first-class performance paths instead of delegating them to a central database by default.
POSIX Metadata Op
Core Semantics
Common Implementation
lookup / open
Walk path components and resolve parent/name dentries to inodes.
Read inode type, permissions, size, mtime, link count, and related attributes.
Inode table, local metadata shard, metadata service, or materialized local view.
readdir / list
List dentries under a directory with cursor or pagination semantics.
Single-shard scan for small directories, fanout plus merge for large directories.
create / mkdir
Create a new dentry under the parent and allocate a child inode.
Single-shard transaction, WAL-first mutation, or dentry shard plus inode range allocator.
unlink / rmdir
Remove a dentry and update inode lifecycle; rmdir also validates directory emptiness.
Dentry tombstone, link count, background GC, usually protected by WAL.
rename
Atomically switch old dentry to new dentry while preserving inode identity.
Single WAL record in one shard; lock/intent plus transaction WAL across shards.
A dentry is the name binding inside a directory: under a parent inode, a specific name points to a child inode. For example, <parent_inode=100, name="ckpt-42"> -> inode 822. Dentries answer the question "which object does this path name currently point to" and are the main objects touched by lookup, readdir, rename, and unlink.
An inode represents the file or directory entity itself: type, permissions, size, mtime, link count, and extent maps belong to the inode. Multiple names can point to the same inode, and rename should keep inode identity unchanged. This is the foundation for POSIX open file handles, hard links, and atomic rename.
Mainstream filesystems separate dentries and inodes because names and entities have different lifecycles. Names are created, removed, and moved frequently; entities carry attributes, data locations, and already-open file state. This separation lets rename update only the name binding, unlink remove a name before reclaiming the entity by reference count, and directory sharding scale the dentry namespace independently.
3. The Hard Part: Cross-Shard Operations
With one metadata node, rename correctness is straightforward, but the model does not scale. KyteStore routes dentries by parent inode + name into dentry buckets, allowing large directories to spread across inode buckets. Lookup and create can scale out, but rename becomes harder.
Cross-bucket does not necessarily mean cross-directory. Two names in the same directory can hash into different buckets. POSIX rename also requires inode identity to remain unchanged. Creating a new inode that points to old extents and then deleting the old inode would break inode numbers, open file handles, concurrent access, and crash recovery semantics.
From a transaction perspective, cross-shard means that one user operation becomes multiple shard-local sub-operations. If one shard succeeds while another fails, the system enters an intermediate state: the old name may already be removed while the new name is not created, or the new name may be visible while the old name still exists. The failure may be a network timeout, DS crash, coordinator crash, or process exit after a WAL append but before the response. Without transaction state, fencing, and recovery rules, recovery cannot safely decide whether to roll back or roll forward.
Cross-Shard Partial Failure Risk
4. Main Design Choices
CentralizedRedis or FDB stores filename-to-inode, inode attributes, and extents. This is easy to reason about, but lookup, create, readdir, and rename inherit central write throughput and tail latency.
DecentralizedMetadata shards serve local indexes, while cross-shard operations use intents, locks, WAL, and recovery. Scalability is better, but fencing and recovery become the hard parts.
Reduced SemanticsReject cross-directory rename, target replace, or complex directory mutation. This ships quickly, but exposes storage implementation boundaries to users and frameworks.
KyteStore ChoiceKeep the control plane small: owner, epoch, ranges, and fencing. Move hot metadata to DS inode buckets, use local indexes for throughput, and replicated WAL for recovery.
KyteStore does not put full file metadata in MetaServer/FDB. MS/FDB tracks FS bindings, inode bucket owners, owner epochs, inode range allocation, and lightweight transaction locks. Hot dentry, inode attribute, and extent state belongs to the DS FSSubsystem owner. In other words, KyteStore combines a small centralized control plane with a decentralized metadata data plane.
5. KyteStore Inode Buckets
The recovery and scaling unit is the inode bucket. Each bucket has its own owner epoch, WAL, checkpoint, materialized index, and recovery boundary. FE/SDK or DS path RPC uses DirectoryLayout to map parent inode + filename to a dentry bucket and then sends the operation to the owner DS.
This keeps MS/FDB out of normal file I/O: lookup, getattr, readdir, create, unlink, and mkdir mainly hit the DS local materialized index. MS/FDB participates in owner transfer, range allocation, bucket materialization, and cross-DS transaction locking.
DirectoryLayout has two levels: a stable virtual bucket index produced by hash(parent_inode, filename, seed) % virtual_bucket_count, and a mapping from virtual bucket ranges to physical buckets. Physical buckets have owners, WALs, checkpoints, and local indexes. Directory split changes virtual-to-physical mapping; failover or rebalance changes physical bucket owners.
The default single-directory layout uses 4K virtual buckets. For a directory with one billion files, the average load is about 1,000,000,000 / 4096 = 244,141 dentries per virtual bucket, which is a reasonable level for the local RocksDB/LocalMetaKV index behind a physical bucket. Larger directories can use a larger bucket count configured at system startup.
Directory Size
Virtual Buckets
Avg Dentries / Bucket
Tradeoff
100M files
4,096
about 24,414
Low pressure for common large directories.
1B files
4,096
about 244,141
Target scale for the default layout, spreadable across DS owners.
10B files
4,096
about 2,441,406
Heavy per-bucket index; increase bucket count for this scale.
10B files
65,536
about 152,588
Better for very large directories, at the cost of more layout and fanout management.
Design constraint: to keep routing, migration, and crash recovery simple, KyteStore does not support changing the virtual bucket count after the system starts. Changing the modulo would route existing dentries to new locations unless all related dentries are migrated or rewritten.
6. 3-Replica WAL And Batch Group Commit
KyteStore metadata mutations follow a strict order: validate against the local index, build a WAL record, append it to a durable log, and only then apply it to the materialized index. If a DS crashes after WAL durability but before index apply, recovery loads the latest checkpoint and replays the WAL.
The WAL is not stored in external Redis or FDB. It is written as hidden objects in a KyteStore WRITE_BUFFER namespace. Under the hood, ChunkSubSystem uses AppendChunkReplicated to write through ChunkServer's multi-replica path. The latest benchmark path uses 3 replicas, so committed metadata mutations are protected by a local replicated WAL before checkpoint compaction.
To avoid turning every create into a tiny object write, KyteStore uses a DS-level WAL with group commit. Multiple logical metadata records are packed into one physical WAL block in an active segment object. Each record still carries its own inode_bucket_id, bucket-local LSN, and owner epoch, preserving per-bucket ordering and recovery boundaries.
Conceptually, foreground metadata operations first enter a batch WAL block, then the block is appended to three Chunk replicas. Only after all replicas confirm the append does the DS apply the mutation to the local RocksDB metadata index. Recovery does the reverse: load checkpoint, replay durable WAL records, and rebuild the materialized index.
Important detail: group commit does not wait artificially for a target batch size. A worker drains requests already queued; while one WAL I/O is running, new requests accumulate and are committed in the next batch. This keeps amortized latency low without adding unnecessary foreground delay.
7. Cross-Bucket Rename Transactions
Same-bucket rename remains the fast path: one WAL record moves the dentry while keeping the inode unchanged, without touching MS locks. Only when old and new dentries fall into different buckets does KyteStore enter the transaction slow path.
The latest model treats cross-bucket rename as a recoverable metadata transaction. The WAL carries a transaction descriptor with txn_id, participants, affected dentries, affected inodes, old dentry, and new dentry. The phase state moves through prepared, commit decided, applied, aborted, and finished. Pending transactions install local dentry/inode fences and are included in checkpoints, so recovery can roll forward or abort deterministically.
The key guardrail is that transaction complexity stays in the rename slow path. Create, unlink, mkdir, lookup, and readdir do not touch MS locks, and same-bucket rename continues to use the single-record WAL fast path.
8. Performance Results
The latest FS metadata benchmark uses bthread coroutines to issue roughly 2,000 to 4,000 concurrent metadata requests to a single DataServer. avg_us/op is an amortized value derived from total throughput, while Batch WAL P50/P99 measures WAL batch flush completion latency, so the two numbers are intentionally different granularities.
With multiple DS nodes, metadata QPS scales by inode bucket owner. Ordinary create/unlink/mkdir paths do not share a cross-DS write bottleneck, so reaching millions of metadata QPS is relatively straightforward. The high WAL latency mainly comes from the three-replica Chunk write: a batch WAL can return only after every replica append succeeds, while group commit amortizes that cost across many foreground operations.
OP
QPS
avg_us/op
Batch WAL P50
Batch WAL P99
create
379,324
2.64
2,114
6,175
unlink
270,512
3.70
5,973
12,690
rename, cross-directory
237,571
4.21
3,881
13,024
mkdir
230,950
4.33
5,299
13,327
The data shows that individual WAL batch flushes are still in milliseconds, but group commit makes the foreground operation cost microsecond-level on average. This is the core tradeoff: reliability comes from the three-replica WAL, while throughput comes from batching, bthread concurrency, and bucket/DS-level parallelism.
This article focuses on metadata design. The data path, extent maps, and remote object backend integration will be covered separately.