VMware Cloud Native Storage: Be aware of dangling virtual disks and snapshots!

Introduction

When using VMware Cloud Native Storage, you can end up with unmanaged dangling virtual disks and snapshots.
This blog will cover the Kubernetes storage and VMware CNS basics and how to identify and remove dangling disks and snapshots.

VMware Cloud Native Storage (CNS)

With vSphere version 6.7 U3, VMware introduced Cloud Native Storage (or in short VMware CNS): providing virtual storage capabilities for Kubernetes, using a Container Storage Interface (CSI).

In a nutshell: VMware CNS provides a Container Storage Interface (CSI-)plugin which maps Kubernetes Persistent Volumes(PVs) to VMware virtual disks (for example: disks hosted on VSAN) leveraging Storage Policy Based Management (SPBM) for operational convenience.
VMware CNS also provide the vSphere admin the operational handles to manage these virtual disks from the well-known vSphere interface.
With VMware CNS both Kubernetes users and vSphere admins can operate/manage the same storage objects using their own well-known toolset, by which separation of duties can be implemented.

Virtual disks deployed and managed by VMware CNS are called First Class Disks (FCDs) and are stored together with normal VM-virtual disks on the same (VSAN) datastore.
FCD disk are deployed in the "FCD" top-level folder.

FCDs have the same VMDK file-format as their VM-virtual disks equivalents: The only difference is that FCDs are managed by VMware CNS and not via a VM (VMX) configuration-files.

Mapping PVs and FCDs using vSphere CSI plugin

Kubernetes is managed via API-resource objects, allowing every aspect to be programmatically managed.
In this case PVs and PVCs are API-resource objects, providing programmatic storage capabilities to Kubernetes PODs.
A PVC dictates which and how a PV is connected to a POD, were the PV contains (a reference to) the actual storage.

Instead of PVs and PVCs, VMware CNS works with Container Volumes and (FCD-)Backing-disks.

PVs can be either statically or dynamically configured:
When statically configured, the Kubernetes Cluster administrator pre-provisions PV's which can be claimed using PVCs by Kubernetes users.
For the dynamic creation of PVs a StorageClass policy is conducted, which defines the underlying storage system:
In the case of vSphere CSI, a StorageClass policy is created which is configured with vSphere CSI driver, a reference to a VSAN datastore or Virtual Volume (VVOL) and reference to the vSphere Storage Profile.
Additionally, for VolumeSnapshots (which are virtual disk-snapshots) a VolumeSnapshotClass-policy is created by the Kubernetes Cluster administrator, which defines the VolumeSnapshotContent DeletionPolicy-configuration.

The vSphere CSI driver syncs configurations between vSphere CNS and Kubernetes objects: The vSphere CSI driver itself is not in the data path.

Were PVs are Kubernetes objects, Container Volumes are vSphere objects which are highly integrated but are hosted on different systems: The configuration is being synchronized between both systems using the vSphere CSI driver.
PVs and Container Volumes share the same configuration information: only the FCD (which is the Container Volume Backing object) contains the actual data and is solely linked to a Container Volume.

It's the responsibility of the vSphere CSI drivers and sidecar helper containers to map and sync configuration objects: More on these sidecar helper containers later.

Reclaiming Policy

For the context of this blog I want to highlight the StorageClass reclaiming policy.
A reclaiming policy dictates what to do with the FCD when a PV is no longer in use by a K8s user: It can be configured as either retain or delete.
When set to "retain" and a PV is not longer needed, the PV and Container Volume are removed but the FCD will be retained and therefor available towards the future.
A unmanaged "dangling FCD' can consume datastore capacity for a undefined period of time, as the VMDK file will be left on the datastore with no PV/Container Volume referencing to it.
When the reclaiming policy is set to "delete" a unused FCD is deleted from the datastore, cleaning up the datastore automatically.

Snapshots

When the vSphere CSI plugin is installed into your Kubernetes cluster, you have the ability the configure VolumeSnapshots: which are FCD-disk based snapshots.

The vSphere CSI has a sidecar helper container, called the csi-snapshotter. This snapshot-controller syncs Kubernetes VolumeSnapshot- (and VolumeSnapshotContent-) objects to FCD-snapshots using VMware CNS:
When a VolumeSnapshot is created within Kubernetes, the snapshot controller informs the vSphere CSI to create a snapshot (VolumeSnapshotContent).
The vSphere CSI communicates with VMware CNS and in the background a FCD-snapshot is created and snapshot information is returned back to vSphere CSI, which updates the VolumeSnapshot- and VolumeSnapshotContent-objects within Kubernetes.

So, comparable to the relationship between PV and CNS Container Volumes: Within Kubernetes, VolumeSnapshotContent-objects are related to FCD-snapshots.
Also, comparable to the reclaiming policy, a VolumeSnapshot has a DeletionPolicy which can be configured as either delete or retain using the VolumeSnapshotClass policy (as explained in the beginning of this blog).
When configured as delete and the VolumeSnapshot-object is deleted, the VolumeSnapshot object and it's corresponding VolumeSnapshotContent-objects and underlying FCD snapshot are automatically deleted as well.
When configured as retain and the VolumeSnapshot object is deleted, the VolumeSnapshot object is deleted but the VolumeSnapshotContent-object and the underlying FCD snapshot are retained: Basically leaving them unmanageable state, consuming unnecessary datastore capacity.

Be aware that these disk (redo log) snapshots have a negative affect on the performance of the underlying storage systems: This VMware blog perfectly explains how redo log snapshots work and what their impact is.

The best practice is to have a maximum of 3 snapshots per PV/FCD for no longer than 72 hours!
The vSphere CSI configuration can enforce a maximum number of snapshots per PV.
So, ideally no more than 3 snapshots should exist per PV/FCD, right? well ...

Things are about the get ugly

To quickly summarize, we have two platforms: Kubernetes and VMware, which are stitched together via the vSphere CSI plugin and VMware CNS.
When they are working correctly, everyone is happy as both system configurations are in full sync were both Kubernetes admins/users and vSphere admins can do what they needed .. but there are scenarios where it's not working as expected and things can become ugly.

I want to emphasize that I'm talking about a corner-case here.
VMware CNS is a good solution and, in general, works as expected.
But there are some cases which require some love and attention.

CAUTION! When the ReclaimingPolicy of the StorageClass and/or the DeletionPolicy of the VolumeSnapshotClass are configured to "retain", the storage can be filled up with dangling FCD/snapshots consuming the capacity in an unmanaged manner.
For a vSphere admin this can be a daunting situation as the vSphere CNS graphical interface does not provide any insight into these dangling FCDs and snapshots!
Also vSphere CNS is not able to delete dangling FCDs when they have snapshots attached to them!

It's therefor recommended to set both the ReclaimingPolicy and DeletionPolicy to delete, preventing you from a unwanted situation!

Also I've encountered some stability problems with the vSphere CSI drivers (version 2.7) when a large (excessive) number of objects are simultaneously modified/removed: I'm working with VMware together, to get to the bottom of this. Expect an update about this soon!

Fixing stuff with GOVC

GOVC is a lightweight vSphere CLI client, which is specially helpful for these troubleshooting use-cases.

First download the compiled GOVC version for your operating system using the above link.
Second, configure the following environmental variables:

GOVC_URL, containing the url of your vCenter server.
GOVC_USERNAME, containing the vSphere administrator username.
GOVC_PASSWORD, containing the password (I know, this isn't very secure .. don't shoot the messenger)
GOVC_DATASTORE, containing the (VSAN) datastore name.

And you are ready! Yup, that was that easy!

We can optimize this manual process by creating a SetGovcEnv.sh file (for a Linux OS), containing the following information (please, change it to your environment):

# set env vars for govc
export GOVC_URL=https://vcsa.my.domain
export GOVC_USERNAME=my_username
export GOVC_INSECURE=false
export GOVC_DATASTORE=myVsanDatastore
echo -n Password:
read -s password
export GOVC_PASSWORD=$password

Execute ". ./SetGovcEnv.sh", enter your password and you're ready (even faster)!

Check you vCenter server connection by executing the following CLI command:

govc about

You should receive an instant response with version information from the vCenter Server.
Take a notion about the speed the command returns it's information: it's fast!
Now let's start some troubleshooting

First task: Identify the dangling FCDs.
Dangling FCDs are disks without a Container Volume attached to them, consuming unnecessary datastore capacity.
PS ALIGN WITH YOUR KUBERNETES USERS BEFORE REMOVING DANGLING FCDs!

Retrieve a list of FCDs using the following command:

govc disk.ls

It will return a list of FCD backing disk-ids and corresponding PV-names.
Be aware that this list contains all FCD-disk (whether they are attached to a CNS Container Volume or not)!

The Kubernetes dynamic allocated disk-names can be quickly identified, as the naming convention begin with "pvc-<GUID>".
There can be situations where you have multiple container platforms were FCDs don't have this naming convention, therefor it's a good practice to filter out all pvc-named disks using grep:

govc disk.ls | grep pvc

Now run the following command to retrieve the list of Container Volumes:

govc volumes.ls

It will return a list of Container Volumes and corresponding FCD/PV-names.
When you have dangling FCDs, you will see that the Container Volume list is shorter compared to the FCD-disk list.
So let's compare the list of disks and volumes:

IFS=$'\n' # set Internal Field Separator to newline because govc volume.ls prints result as newlines.
# get all the volumes and save it in a array
volumes=($(govc volume.ls)) 

# Get all the disk on the datastore
# filter out only the disks containing the word 'pvc' (all our kubernetes disks are called pvc-{guid})
# save only the first parameter of the output (disk id {guid})
# loop over every disk id.
govc disk.ls | grep pvc | awk '{print $1}' | while read DISK; do 
    # Echo the volume array and check if the disk id is included in it. 
    printf "%s\n" "${volumes[@]}" | grep -q $DISK
    # If the grep output is unsuccessful the disk id is not attached to a container volume.
    if [ $? -ne 0 ]; then
        # print the disk id of the dangling FCD.
        echo $DISK
    fi
done

I want to thank Cormag Hogan, as the writer of the original code: His blog-post was a really helpful!
My work-wife (who wants to stay anonymous) and me have optimized his script, as the govc volume.ls command is now executed only once: Saving you time!

Now we have the list of disk-ids of the dangling FCDs which can be removed.
But before we can remove these disks, we have to remove attached snapshots first: You cannot remove FCD-disks when they have a snapshots attached to them.
Using the disk-id, we can check for attached snapshots on the corresponding disk using the following command:

govc disk.snapshot.ls <disk-id>

When the disk has attached snapshots, it will return a list of snapshot-ids and snapshot-names.
PS When Velero is being used for Kubernetes workload backups, the snapshots are named: "AstrolabeSnapshot"
Using the following command you can remove/consolidate the snapshot:

govc disk.snapshot.rm <disk-id> <snapshot-id>

Depending on the size of the snapshot, this can take from a couple of seconds up to several hours!
Govc will return the command after the removal of the snapshots OR after 30 minutes when the consolidation/removal process is still busy in the background.
After 30 minutes you will see a failed task within the vSphere client, but the task is still working on the ESXi host mentioned in the failed task!
Any subsequential snapshot remove task will fail with a warning, stating that the disk is locked.
You can monitor the process by logging into the mentioned ESXi host and use the VIM-CMD command to retrieve the current status.
This process is quite cumbersome and it's easier to execute the following command to see if the snapshot-removal already has been finished:

govc disk.snapshot.ls <disk-id> | grep <snapshot-id>

When the snapshot-removal process has finished, you don't receive a response otherwise it still busy with the consolidation/removal process.
My recommendation: be patient!

After removing all snapshots we can continue with removing dangling FCD using the following command:

govc disk.rm <disk-id>

All together this is still a cumbersome process, so lets combine all previous steps into one script.
Create a delete_disk_and_snapshots.sh bash shell script, including the following code:

# Delete a Disk from vSphere including all snapshots (if no snapshot only the disk)
# usage: ./delete_disk_and_snapshots.sh {DISKID}
# tip: call from the command line with a ampersand at the end (&) to run in the background and start multiple

# function that waits until the snapshot is removed from the disk.snapshot.ls list. 
# If the vSphere task failes because of a timeout this function ensures it will not continue
# until the snapshot is completely removed.
function wait_until_snapshot_delete_complete() {
    disk=$1
    snapshot=$2

    while true
    do
        govc disk.snapshot.ls $disk | grep -q $snapshot
        if [ $? -ne 0 ]; then
            return;
        else
            echo "snapshot not yet deleted, waiting for next check."
            sleep 60
        fi
    done
}
IFS=$'\n' # set Internal Field Seperator to newline because govc volume.ls prints result as newlines.
disk=$1 # take first call argument of the script
snapshots=($(govc disk.snapshot.ls $disk | awk '{print $1}')) # Get all the snapshots from the diskid and save them in a array.

# Loop over every snapshot for deletion
for snapshot in "${snapshots[@]}"
do
    # start the delete task of the snapshot. (this call waits until ether the deletion is completed or a error occures (eg. timeout after 30 minutes))
    govc disk.snapshot.rm $disk $snapshot

    # Wait until the snapshot is removed completely in case of a operation timeout.
    wait_until_snapshot_delete_complete $disk $snapshot
done
# at last remove the disk now that there are no more snapshots attached to it.
govc disk.rm $disk

This script will remove all attached snapshots from the provided disk-id sequentially and when all snapshots have been removed, it will remove the dangling-FCD disk from the datastore!
Be aware: Use this script is at your own risk!

Conclusion

Currently, from a vSphere storage point-of-view we have to deal with virtual disks which can consume storage capacity when Kubernetes StorageClass and VolumeSnapshot policies are not defined correctly.
Solving this problem, requires some govc- and scripting knowledge as there is not out-of-the-box solution available which can deal with these "dangling FCDs and snapshots".
I recommend to regularly check your environment for dangling FCDs and align the outcome with your Kubernetes team .
Use the gained knowledge to keep your environment clean from unwanted dangling FCDs, freeing up some storage space at the same time.

Again thanks to Cormag Hogan and my (anonymous) work-wife, who made this blog-post possible!
Please leave a comment: It's much appreciated!

===== UPDATE 29/11/2023 =====

I've been working with VMware for a long time on this case and they provided a tool which can help with identifying these dangling disks, named CNS Manager.

The CNS Manager is hosted on the kubernetes platform and connects to the vSphere CNS Manager.
You can download the latest CNS Manager at https://github.com/vmware-samples/cloud-native-storage-self-service-manager

7 Comments

halin
November 2, 2023

hi, expert. follow the compare script , I got some dangling disk IDs, but I use the delete script , it show me error the follow message , please tell me what reason. ?
./delete_disk_and_snapshots.sh 047cffcd-ac21-48d6-adbb-f97680d4c926
Deleting 047cffcd-ac21-48d6-adbb-f97680d4c926…Error: The object or item referred to could not be found.
govc: The object or item referred to could not be found.

1. admin
  November 29, 2023
  
  maybe you are using the snapshotid instead of the diskid with the delete script?
  
Jim McCann
November 2, 2023

Thanks for your time in writing up this article. Be nice to see a example kube config.

Jim McCann
November 2, 2023

Wanted to say sometimes you can run into issues with disk.ls

govc disk.ls -R -ds $DATASTORE

should address the issue

1. admin
  November 29, 2023
  
  You can also define the GOVC_DATASTORE global variable as a solution (which is actually the same solution as you provided).
  
Jim McCann
November 2, 2023

Thanks for taking the time in writing this article was very helpful.

Be nice to see a example kube config

Wanted to say sometimes you can run into issues with disk.ls

govc disk.ls -R -ds $DATASTORE

should address the issue

Jeff Towery
June 7, 2024

Do you have an update or resolution to your note regarding “…stability problems with the vSphere CSI drivers (version 2.7) when a large (excessive) number of objects are simultaneously modified/removed: I’m working with VMware together, to get to the bottom of this. Expect an update about this soon!”

Cheers,
Jeff Towery