Understanding Storage Distribution

Introduction

When you create a VM on OpenShift Virtualization with ODF, the experience is simple: you define a DataVolume, the VM boots, and the disk “just works.” But between “I created a VM” and “where does the data actually live?” there is an invisible layer of distributed storage doing something fundamentally different from what traditional VM platforms do.

On VMware, a VM’s disk is a VMDK file sitting on a single datastore, hosted by a specific storage array or disk group. You can point to it. On ODF, the same 20 GiB disk is split into thousands of small objects, spread across every node in the cluster, with multiple replicas on different hardware. No single node holds the complete disk. No single disk failure loses any data.

This document walks through real output from a 3-node, 24-OSD cluster to show exactly what that distribution looks like — first at the cluster level, then traced through a single VM’s disk.

Cluster-Wide View

The show-node-distribution.sh script provides a top-down view of how storage is distributed across the cluster. The following sections walk through its output from a live cluster with 100 cloned VMs.

Per-Node Storage Summary

  Host                                                      Capacity       Used  %Used  OSDs
  kube-...-000001d7                                        23.29 TiB  288.7 GiB   1.2%     8
  kube-...-000002b1                                        23.29 TiB  275.6 GiB   1.2%     8
  kube-...-00000311                                        23.29 TiB  271.4 GiB   1.1%     8

  TOTAL                                                    69.86 TiB  835.7 GiB   1.2%    24

This is the highest-level view: three nodes, each contributing 23.29 TiB of raw capacity via eight OSDs (Object Storage Daemons — the Ceph processes that own physical disks). Usage is nearly identical across all three nodes — 271 to 289 GiB each — because Ceph’s CRUSH algorithm distributes data by design, not by accident.

The total cluster holds ~70 TiB of raw capacity with 836 GiB used (1.2%). Even with 100 cloned VMs and their golden image, usage is minimal because CoW clones share data with the parent.

Per-OSD Breakdown

Each node’s eight OSDs carry a roughly equal share of the data:

  kube-...-000001d7
    osd.1        2.91 TiB   36.2 GiB   1.2%   PGs: 22
    osd.4        2.91 TiB   38.7 GiB   1.3%   PGs: 25
    osd.7        2.91 TiB   36.0 GiB   1.2%   PGs: 22
    osd.10       2.91 TiB   39.0 GiB   1.3%   PGs: 24
    osd.13       2.91 TiB   38.5 GiB   1.3%   PGs: 24
    osd.16       2.91 TiB   33.0 GiB   1.1%   PGs: 20
    osd.19       2.91 TiB   31.5 GiB   1.1%   PGs: 19
    osd.22       2.91 TiB   35.9 GiB   1.2%   PGs: 22

  kube-...-000002b1
    osd.2        2.91 TiB   32.1 GiB   1.1%   PGs: 19
    osd.5        2.91 TiB   35.9 GiB   1.2%   PGs: 22
    ...

  kube-...-00000311
    osd.0        2.91 TiB   28.6 GiB   1.0%   PGs: 18
    osd.3        2.91 TiB   29.5 GiB   1.0%   PGs: 18
    ...

An OSD is a Ceph daemon responsible for a single physical disk (or partition). Each OSD manages a set of Placement Groups (PGs) — logical buckets that Ceph uses to distribute data. Objects are assigned to PGs via a hash, and PGs are mapped to OSDs by the CRUSH algorithm.

The PG counts here range from 17 to 25 per OSD. That slight variation is normal — CRUSH optimizes for even distribution but does not guarantee identical PG counts. What matters is that no single OSD is dramatically overloaded compared to its peers.

Balance Assessment

  Highest node utilization:  1.2%
  Lowest  node utilization:  1.1%
  Spread:                    0.1 percentage points

  WELL BALANCED
  Data is evenly distributed across all nodes.
  No action needed.

A 0.1 percentage-point spread across three nodes means the data is almost perfectly balanced. In practice, you would start investigating imbalance if the spread exceeded a few percentage points, which can happen with very small pools or unusual CRUSH rules.

PG Distribution

  OSD           PGs  Distribution
  osd.0          18  ██████████████████
  osd.1          22  ██████████████████████
  osd.4          25  █████████████████████████
  osd.9          17  █████████████████
  osd.12         24  ████████████████████████
  ...

  Total PGs across OSDs: 512  (each PG is replicated, so counted per replica)
  Average PGs per OSD:   21.3
  PG spread:             Moderate (38% variance)

This section visualizes how the 512 PG replicas in this pool are spread across OSDs. The bar chart makes hotspots easy to spot — in this cluster, the bars are all roughly the same length, confirming even distribution. The “38% variance” sounds high but is normal for a pool of this size; the absolute difference between the busiest OSD (25 PGs) and the quietest (17 PGs) is small.

Pool Activity

  Client I/O:
    Read:  6.9 MiB/s  (1627 ops/s)
    Write: 366.5 MiB/s  (517 ops/s)
  Recovery:  None (cluster is clean)

A snapshot of live I/O at the moment the script ran. The write-heavy pattern (366 MiB/s writes vs. 6.9 MiB/s reads) is characteristic of the drift simulation phase, where each VM is writing unique data. “Recovery: None” confirms the cluster is healthy — no OSDs are down, no data is being rebalanced.

Summary

  Balance:   Storage is well balanced across your 3 nodes.
             Each node is carrying a similar share of the data.
  Hotspots:  No hotspots detected. PG distribution is even across OSDs.
  Capacity:  1.2% used — plenty of headroom.
             69.05 TiB available across the cluster.
  Pool:      Pool 'nrt-2' is healthy with no active recovery.

The script distills the raw numbers into actionable observations. For a cluster running 100 cloned VMs with simulated drift, 1.2% usage and perfect balance is exactly what you would expect from CoW clones on a well-configured Ceph cluster.

Single-VM Deep Dive

The show-vm-placement.sh script traces a single VM’s disk from the Kubernetes layer all the way down to individual RADOS objects on specific OSDs and nodes. The following output traces clone-vm-001.

The Mapping Chain

  VM                  clone-vm-001
   └─ PVC             clone-vm-001-disk
       └─ PV           pvc-793342cd-d14a-4cda-9aa2-2b39d4a88b2c
           └─ RBD Image   nrt-2/csi-vol-7e7002ab-fe50-4d4f-bfd5-4509bacab74f

  Disk size:        20.00 GiB
  Object size:      4.0 MiB  (each RADOS object)
  Total objects:    5,120
  Actual usage:     5.20 GiB

Four layers of abstraction connect the VM to physical storage:

VM — the KubeVirt VirtualMachine resource.
PVC — the PersistentVolumeClaim, which is how Kubernetes requests storage.
PV — the PersistentVolume, bound to the PVC by the CSI driver.
RBD Image — the actual Ceph block device image in the nrt-2 pool.

The disk is 20 GiB in size but only consumes 5.20 GiB (26%) of actual storage. The remaining 74% is shared with the golden image parent via copy-on-write — it exists as pointers, not as duplicated data.

Ceph splits this 20 GiB disk into 5,120 objects of 4 MiB each. Each object is an independent unit that can be placed on any OSD in the cluster.

Clone Lineage

  Parent image:  nrt-2/csi-vol-...-temp@csi-vol-...
       │
       ▼  (CoW snapshot)
  This clone:    nrt-2/csi-vol-7e7002ab-fe50-4d4f-bfd5-4509bacab74f

This confirms the clone is a CoW child of the golden image. The parent is an RBD snapshot (the @ delimiter denotes a snapshot in Ceph). When the VM reads a block it has never written to, Ceph follows the parent pointer to serve the data from the golden image’s snapshot — no duplication required. Only when the VM writes to a block does Ceph copy that 4 MiB chunk into the clone’s own storage.

Data Anatomy

  Object naming:
    Prefix:  rbd_data.c4efa441f544
    Pattern: rbd_data.c4efa441f544.<16-hex-digit offset>

  Examples:
    rbd_data.c4efa441f544.0000000000000000  <- first 4.0 MiB of disk
    rbd_data.c4efa441f544.0000000000000a00  <- middle of disk
    rbd_data.c4efa441f544.00000000000013ff  <- last chunk of disk

Every object has a deterministic name: a prefix unique to this RBD image, followed by a hex offset identifying which 4 MiB slice of the disk it represents. Ceph hashes these names to determine which PG (and therefore which OSDs) each object lands on. This is why data distribution is automatic — the hash function ensures a roughly even spread without any manual placement decisions.

Sample Placement Trace

The script samples 20 of the 5,120 objects to show where they physically land:

  Object                             PG           Primary    Replicas         Node
  rbd_data.c4efa441f544.00000000..   6.b00c5805   osd.22     osd.6            kube-...-000001d7
  rbd_data.c4efa441f544.00000000..   6.7ec23c01   osd.2      osd.3            kube-...-000002b1
  rbd_data.c4efa441f544.00000000..   6.b029b42b   osd.0      osd.16           kube-...-00000311
  rbd_data.c4efa441f544.00000000..   6.2e48f850   osd.3      osd.1            kube-...-00000311
  rbd_data.c4efa441f544.00000000..   6.8d660014   osd.6      osd.4            kube-...-00000311
  rbd_data.c4efa441f544.00000000..   6.8ca0fe23   osd.23     osd.9            kube-...-000002b1
  ...

Each row is one 4 MiB chunk of the VM’s disk. Notice:

PG — the placement group this object belongs to. Each PG hash is different, so objects land on different OSDs.
Primary — the OSD that handles reads and writes for this object. The primaries are scattered across many different OSDs (0, 2, 3, 4, 5, 6, 9, 12, 14, 15, 19, 22, 23).
Replicas — additional OSDs holding copies of this object on different nodes, ensuring data survives a node failure.
Node — the node hosting the primary OSD. All three nodes appear as primary for different objects.

This is the core insight: a single VM’s 20 GiB disk is not “on” any one node. It is spread across the entire cluster.

Node Coverage

  Node                                                     Primary  + Replica  Total
  kube-...-000001d7                                              6          8     14
  kube-...-000002b1                                              6          4     10
  kube-...-00000311                                              8          8     16

  This VM's data touches ALL 3 nodes in the cluster.
  Unique PGs in sample: 20 | OSDs touched: 19 of 24

From just 20 sampled objects, the data already touches 19 of the 24 OSDs and all 3 nodes. Extrapolate to the full 5,120 objects, and the disk is effectively spread across every OSD in the cluster. There is no concept of “this VM lives on node 2” — the VM’s compute runs on one node, but its storage is everywhere.

How Drift Affects the Efficiency Metric

The test report shows storage efficiency declining as clones accumulate unique writes (drift). Each drift level writes a new file of random data to every clone; previous files are kept, so the data is additive. Efficiency naturally declines as clones diverge from the golden image, but CoW continues to provide significant savings at every drift level.

The progression

Phase	PVCs	Stored (GB)	Full-Clone Cost (GB)	Efficiency
After cloning (no drift)	101	5.73	579	101.0x
+200 MB drift (1%)	101	32.4	606	18.7x
+1 GB drift (5%)	101	116	689	5.9x
+2 GB drift (10%)	101	217	790	3.6x
+5 GB drift (25%)	101	524	1,097	2.1x

The efficiency ratio drops with each drift level because each clone is writing more unique data that cannot be shared. But notice the Full-Clone Cost column grows too — this reflects the fact that drift data would exist regardless of cloning strategy.

How the formula works

The report calculates efficiency as:

drift_total     = actual_stored - post_clone_stored     # data added since cloning finished
full_clone_cost = pvc_count × baseline_stored + drift_total
efficiency      = full_clone_cost / actual_stored

First term (clone cost): The cost of making full copies of the golden image for every clone — 101 × 5.734 GB = 579 GB. This represents the data that CoW avoids duplicating.
Second term (drift): The total new data written since cloning finished. This data is unique to each clone and would exist whether clones are CoW or full copies, so it appears in both the numerator and denominator.
Denominator (actual stored): The real storage consumed by the pool.

At 25% drift, actual storage is 524 GB but full copies would cost 1,097 GB. The 2.1x ratio means CoW is still saving roughly half the storage even after significant divergence.

What CoW is saving at 25% drift

At the highest drift level, CoW saves ~573 GB (1,097 − 524). Here is why:

75% of each clone’s golden-image data was never written to. Those blocks are still shared with the parent and stored exactly once.
Only the 25% of blocks that each clone actually modified required new storage.
The drift data (~518 GB across all clones) is unique and must be stored regardless of cloning strategy — but the unmodified golden-image blocks are still shared.

VMware linked clones behave the same way

This pattern is not unique to Ceph or ODF. VMware linked clones exhibit the same behavior: delta disks grow with writes, and the ratio of shared-to-unique data shrinks over time. The storage savings from any CoW mechanism are greatest immediately after cloning and diminish as clones diverge from the parent. The key question is not “what is the efficiency ratio?” but “how much divergence do you expect in your workload?” — and that depends on the use case, not the storage platform.

Key Takeaways

Data is distributed, not localized. A single VM’s disk is split into thousands of 4 MiB objects, hashed into placement groups, and spread across every node and OSD in the cluster. No single node holds a complete copy of any VM’s disk.
No single point of failure. Every object is replicated across multiple nodes. A node failure does not cause data loss — Ceph serves reads from surviving replicas and automatically re-replicates to restore the desired replica count.
CoW clones share data. A cloned VM starts by pointing to the golden image’s data. Only blocks the clone writes to consume additional storage. In the example above, a 20 GiB disk uses only 5.20 GiB (26%) of actual storage — the other 74% is shared with the parent at zero cost.
This is different from VMware’s storage model. On VMware, a VM’s VMDK sits on a single datastore backed by a specific storage device. On ODF, data is automatically spread and replicated across the cluster. There is no single-datastore bottleneck, and storage capacity scales by adding nodes rather than expanding individual arrays.
Balance is automatic. Ceph’s CRUSH algorithm distributes data without manual intervention. The cluster in this example shows a 0.1 percentage-point spread across three nodes — nearly perfect balance with no tuning required.