Generated with a large language model based on the actual code implementation.
For a storage system, reliability is not simply whether processes stay alive. The harder requirement is that the system must not return an incorrect data state during failures. KyteStore's local chaos runner encodes that rule in its oracle: retryable unavailable responses are acceptable during a fault window, but successfully written objects must not disappear, deleted objects must not reappear, and LIST must not omit objects that are present in the model ledger.
Outline
What chaos testing means
How chaos testing differs from unit and integration tests
How KyteStore's chaos test works
Current limitations in KyteStore's chaos testing
1. What Chaos Testing Means
Chaos testing is a controlled experiment. It defines steady state and verifiable invariants, injects disturbances such as process exits, node offline events, and rolling updates, and then uses automated checks to decide whether the system still satisfies its contract. The goal is not random disruption; it is to bring realistic failures into a reproducible and diagnosable test environment.
This is especially important for distributed storage. Correctness spans FE, MetaServer, DataServer, replication, recovery, and sometimes a remote object backend. A module can pass its own tests and still fail when restart, failover, topology refresh, and foreground writes overlap.
Controlled Experiment Loop
A useful chaos test needs three properties. Faults must be controlled, so the test does not become unexplained noise. The oracle must be strict enough that data loss cannot be explained away as a transient. The artifacts must make failures reproducible, with ledgers, operation history, fault history, and logs kept for diagnosis.
2. How Chaos Testing Differs from Unit and Integration Tests
Unit tests, integration tests, and chaos tests serve different layers. Unit tests verify functions, classes, and local state machines. Integration tests verify API semantics across a small group of components. Chaos tests verify that the running system still preserves high-level invariants under real faults and continuous traffic.
Test Layers and Coverage Boundary
Test Type
Main Target
Fault Model
Oracle
Finds
Unit test
Single function, class, or local module.
Mocks, failpoints, boundary inputs.
Internal assertions and return values.
Algorithm bugs, state transition bugs, boundary issues.
Integration test
API behavior across a small component set.
Controlled dependencies and limited exception paths.
KyteStore still relies on unit and integration tests for fast diagnosis. Chaos testing is the final system-level layer: it validates object storage semantics through real server processes, real FE HTTP traffic, and a real local cluster topology.
3. How KyteStore's Chaos Test Works
The local runner lives at scripts/tests/chaos/local_chaos.py. By default it starts a local cluster with one FDB, three MetaServers, three DataServers, and three FrontEnds, plus a 2x node pool for recovery and scale scenarios. Workload traffic goes through the FE HTTP surface instead of bypassing product code paths.
KyteStore Local Chaos Pipeline
The core is a model ledger. Every successful write records key, generation, size, checksum, and object state in SQLite. Test bodies are deterministically generated from seed, key, and generation, so later GET, HEAD, Range GET, and LIST verification can be strict. When an overwrite times out or returns an ambiguous result, the ledger keeps candidate versions and converges later using actually visible bytes.
Object Ledger and Accepted Responses
On the fault side, the default mix is DataServer-heavy: restart, kill-start, stop-start, update, rolling update, offline-ds, and dead-ds are covered. MetaServer and FrontEnd restart, kill/start, stop/start, and update are covered as well. The runner executes one chaos action at a time and gates disruptive DS actions on recovery, so the single-DS failure contract is not conflated with multiple independent simultaneous failures.
Module
Current Implementation
Why It Matters
Cluster bootstrap
Generates local config, defaults to 3MS / 3DS / 3FE and a 2x node pool.
The test does not depend on a hand-built environment, and failures are reproducible.
Workload
24 workers run PUT, overwrite, GET, HEAD, Range, DELETE, LIST, and related object operations by default.
Covers normal object read/write and enumeration semantics.
State ledger
model.sqlite stores object states, operations, faults, invariants, and failure records.
Correctness moves from log inspection to a queryable model.
Recovery gate
After DS disruption, waits for process state, registration, namespace health, and write-buffer under-replicated chunks to converge.
The next DS fault is injected only after the previous repair has completed.
Artifacts
Keeps command logs, verification summaries, metrics snapshots, report.md, and the local run directory.
Failures can be traced from object-level evidence to cluster-level state.
KyteStore Practice Rules
Validate user-visible semantics. The runner uses the FE HTTP object interface and checks externally visible GET, HEAD, LIST, and write behavior.
Wrong data is worse than unavailability. Retryable unavailable is acceptable during a fault window; wrong 404, wrong bytes, or a deleted object returning 2xx fails the run.
Every successful write enters the ledger. Object bodies are deterministic, so later verification can compare size, checksum, and range data exactly.
Fault injection needs recovery gates. For DS failures, KyteStore waits for namespace health and replica repair state before injecting the next DS disruption.
Keep long runs sustainable. Live bytes, live object count, and delete watermarks prevent test data from growing without bound.
4. Current Limitations in KyteStore's Chaos Testing
The current local chaos test covers process-level faults, DS lifecycle transitions, rolling updates, object semantics, and recovery gates, but it is not yet a complete distributed fault experimentation platform. The largest gaps are network and disk IO. In production, many failures are not clean process exits; they are latency, one-way packet loss, slow fsync, disk full, IO errors, or cgroup pressure.
Current Coverage and Next Directions
Gap
Why It Matters
Next Direction
Network faults
Distributed systems often fail through latency, loss, half-open links, and one-way reachability issues.
Add tc/netem, iptables, or eBPF-level injection for FE-MS, MS-DS, and DS-DS links.
Disk IO faults
WAL, RocksDB, and chunk append depend on local storage; slow fsync or disk-full events directly affect writes.
Add disk throttling, delay, ENOSPC, EIO, and fsync jitter injection.
Resource pressure
Foreground requests compete with recovery, compaction, and replication for CPU, memory, and IO.
Use cgroups, stress tools, and workload profiles to test recovery convergence under contention.
Precise failpoints
Process-level faults do not reliably hit specific commit points such as after WAL persistence but before response.
Add a runtime failpoint control plane with pause, error, delay, and one-shot triggers.
These gaps do not reduce the value of the current runner, but they define the next level of coverage. KyteStore already treats object semantic correctness as a hard gate. The next step is to extend the fault model from process lifecycle events to network, disk, and resource degradation, so the tests better match real production failure modes.
Jepsen: a well-known practice of fault injection and model-based verification for distributed systems.
This article focuses on KyteStore's current local chaos testing framework. Network, disk IO, and resource-pressure faults will be updated as the test framework evolves.