Linstor, Ceph and Vitastor performance on Proxmox 8

Homelab Nov 27, 2024

Rationale

Why an SDS (Software Defined Storage) system

Almost a year ago now, I started dipping my toes in the waters of kubernetes. In that exploration I quickly learned I needed a place to store data outside of the kubernetes cluster. Longhorn works fine, but I wanted my data to be stored outside of the VM's running the kubernetes cluster. As the main goal for me to use kubernetes over Docker is high-availability, that storage should also be highly-available.

And while I'm at it, I also wanted to use this storage to create highly-availability in Proxmox for some of my virtual machines.

Initially I wanted to use Ceph as it is build into Proxmox and it gets a lot of attention in the homelab community. However, when researching Ceph I noticed most benchmarks are done with really high-end systems. That made me question whether it would work on low-end hardware.

So far I haven't found an definitive answer to that question. I did initially only find a blogpost from 2022 stating you need "1 core per 1000-3000 IOPS"; this is one core of an AMD EPYC 7742 64C/128T! That puts "really high-end systems" in perspective. I then found this article: Comparing Ceph, LINSTOR, Mayastor, and Vitastor storage performance in Kubernetes, which again shows Ceph indeed needs lots of resources, but again with high-end hardware (AMD Ryzen 9 3900).

Through that article I found Linstor and Vitastor. Linstor used DRBD (Distributed Replicated Block Device) under the hood, where Linstor is "just" a handy CLI to configure and monitor DRBD. They promise this solution to be much faster than Ceph and for my small cluster the possible management overhead should be fine. Vitastor is more in line with Ceph. The difference being that it uses block storage at its base instead of object storage and has no webUI (yet).

ℹ️

Later I stumbled across this excellent article explaining why Ceph is "slow": https://yourcmc.ru/wiki/Ceph_performance. A highly recommended read.

Why this performance testing/tuning

When I initially started testing (Linstor), I gathered 6 random SSD's I had laying around. I installed two in each node and configured them in a software RAID-0. Local performance on the first node I tested was amazing (it happened to be a node with 2 Samsung 850 EVO 250GB, more about the drives below).

However, the overall performance was disappointing. Read performance was great, but write performance awful. I tried tuning the DRBD configuration (see sources) but that would not fix the slow writes.

But before I get side-tracked too much let me first explain my cluster layout and the tests I ran. At the end I will show the performance I eventually got out of my setup and my plans moving forward.

💡

Sources:
- https://kb.linbit.com/troubleshooting-performance
- https://kb.linbit.com/benchmarking-network-throughput
- https://kb.linbit.com/tuning-drbd-for-write-performance
- https://kb.linbit.com/tuning-drbds-resync-controller
- https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#ch-throughput
- https://serverfault.com/questions/740311/drbd-terrible-sync-performance-on-10gige
- https://linbit.com/blog/drbd-read-balancing/

Cluster setup

3 storage nodes: jpl-proxmox7, jpl-proxmox8, jpl-proxmox9
1 diskless node: jpl-proxmox6

Storage nodes

CPU: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
RAM: 16GB DDR4 2400 MT/s
Nic: 2 10gbit connections; Mellanox ConnectX-3 Pro, MT27520 or MCX312B-XCCT):
- subnet 10.33.80.0/24 for Proxmox VM storage traffic (Ceph public).
- subnet 192.168.100.0/24 for storage sync traffic (Ceph private).

Diskless node

CPU: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
RAM: 128GB DDR4 2133 MT/s
Nic 1: 40gbit connection; Mellanox ConnectX-3, MT27500 or MCX354A-FCBT:
- subnet 10.33.80.0/24 for Proxmox VM storage traffic.
Nic 2: 10gbit connections; Mellanox ConnectX-3 Pro, MT27520 or MCX312B-XCCT):
- subnet 10.33.80.0/24 for Proxmox VM storage traffic (Ceph public).
- subnet 192.168.100.0/24 for storage sync traffic (Ceph private).

Testing individual drives

The drives

SanDisk_SD7SB2Q512G1001 Sandisk X300 512GB
KINGSTON_SA400S37240G Kingston SA400 240GB
MTFDDAK256MAM-1K12 Micron C400 256GB
Samsung_SSD_850_EVO_250GB Samsung 850 EVO 250GB

The setup

I installed two SSD's in each storage node.
On both disks I created a single partition on which LVM volume groups linstor_vga and linstor_vgb were created. As the names suggest, a was created on sda and b on sdb.

sudo sgdisk -n 1:0:0 /dev/sda
sudo sgdisk -n 1:0:0 /dev/sdb

sudo vgcreate linstor_vga /dev/sda1
sudo vgcreate linstor_vgb /dev/sdb1

On each pool I then created thinpools testa and testb

sudo lvcreate -V 10G --thin -n testa linstor_vga/thinpool
sudo lvcreate -V 10G --thin -n testb linstor_vgb/thinpool

In these thinpools 10GB volumes testa and testb were created and formatted as ext4

sudo lvcreate -V 10GiB -T linstor_vga/thinpool -n testa
sudo lvcreate -V 10GiB -T linstor_vgb/thinpool -n testb

sudo mkfs.ext4 /dev/linstor_vga/testa
sudo mkfs.ext4 /dev/linstor_vgb/testb

These volumes were then mounted and the systems were ready for testing

sudo mkdir /mnt/testa
sudo mkdir /mnt/testb

sudo mount /dev/linstor_vga/testa /mnt/testa
sudo mount /dev/linstor_vgb/testb /mnt/testb

Test script

I used a script by Lawrence Systems which I slightly modified to be able to pass the blocksize as an argument.

#!/bin/bash

# this script requires fio bc & jc

# Directory to test
TEST_DIR=$1
BS=$2

# Parameters for the tests should be representive of the workload you want to simulate
#BS="1M"             # Block size
IOENGINE="libaio"   # IO engine
IODEPTH="16"        # IO depth sets how many I/O requests a single job can handle at once
DIRECT="1"          # Direct IO at 0 is buffered with RAM which may skew results and I/O 1 is unbuffered
NUMJOBS="5"         # Number of jobs is how many independent I/O streams are being sent to the storage
FSYNC="0"           # Fsync 0 leaves flushing up to Linux 1 force write commits to disk
NUMFILES="5"        # Number of files is number of independent I/O threads or processes that FIO will spawn
FILESIZE="1G"       # File size for the tests, you can use: K M G

# Check if directory is provided
if [ -z "$TEST_DIR" ]; then
    echo "Usage: $0 [directory]"
    exit 1
fi

# Function to perform FIO test and display average output
perform_test() {
    RW_TYPE=$1

    echo "Running $RW_TYPE test with block size $BS, ioengine $IOENGINE, iodepth $IODEPTH, direct $DIRECT, numjobs $NUMJOBS, fsync $FSYNC, using $NUMFILES files of size $FILESIZE on $TEST_DIR"

    # Initialize variables to store cumulative values
    TOTAL_READ_IOPS=0
    TOTAL_WRITE_IOPS=0
    TOTAL_READ_BW=0
    TOTAL_WRITE_BW=0

    for ((i=1; i<=NUMFILES; i++)); do
        TEST_FILE="$TEST_DIR/fio_test_file_$i"

        # Running FIO for each file and parsing output
        OUTPUT=$(fio --name=test_$i \
                     --filename=$TEST_FILE \
                     --rw=$RW_TYPE \
                     --bs=$BS \
                     --ioengine=$IOENGINE \
                     --iodepth=$IODEPTH \
                     --direct=$DIRECT \
                     --numjobs=$NUMJOBS \
                     --fsync=$FSYNC \
                     --size=$FILESIZE \
                     --group_reporting \
                     --output-format=json)

        # Accumulate values
        TOTAL_READ_IOPS=$(echo $OUTPUT | jq '.jobs[0].read.iops + '"$TOTAL_READ_IOPS")
        TOTAL_WRITE_IOPS=$(echo $OUTPUT | jq '.jobs[0].write.iops + '"$TOTAL_WRITE_IOPS")
        TOTAL_READ_BW=$(echo $OUTPUT | jq '(.jobs[0].read.bw / 1024) + '"$TOTAL_READ_BW")
        TOTAL_WRITE_BW=$(echo $OUTPUT | jq '(.jobs[0].write.bw / 1024) + '"$TOTAL_WRITE_BW")
    done

   # Calculate averages
    AVG_READ_IOPS=$(echo "$TOTAL_READ_IOPS / $NUMFILES" | bc -l)
    AVG_WRITE_IOPS=$(echo "$TOTAL_WRITE_IOPS / $NUMFILES" | bc -l)
    AVG_READ_BW=$(echo "$TOTAL_READ_BW / $NUMFILES" | bc -l)
    AVG_WRITE_BW=$(echo "$TOTAL_WRITE_BW / $NUMFILES" | bc -l)

    # Format and print averages, omitting 0 results
    [ "$(echo "$AVG_READ_IOPS > 0" | bc)" -eq 1 ] && printf "Average Read IOPS: %'.2f\n" $AVG_READ_IOPS
    [ "$(echo "$AVG_WRITE_IOPS > 0" | bc)" -eq 1 ] && printf "Average Write IOPS: %'.2f\n" $AVG_WRITE_IOPS
    [ "$(echo "$AVG_READ_BW > 0" | bc)" -eq 1 ] && printf "Average Read Bandwidth (MB/s): %'.2f\n" $AVG_READ_BW
    [ "$(echo "$AVG_WRITE_BW > 0" | bc)" -eq 1 ] && printf "Average Write Bandwidth (MB/s): %'.2f\n" $AVG_WRITE_BW

}

# Run tests
perform_test randwrite
perform_test randread
perform_test write
perform_test read
perform_test readwrite

# Clean up
for ((i=1; i<=NUMFILES; i++)); do
    rm "$TEST_DIR/fio_test_file_$i"
done

I created another script to invoke the above script with different inputs:

#!/bin/bash

# this script requires fio bc & jc

echo -e '\n\nSTART 4K A \n'
./drbd-test.sh /mnt/testa 4K
echo -e '\n\nSTART 128K A \n'
./drbd-test.sh /mnt/testa 128K
echo -e '\n\nSTART 1M A \n'
./drbd-test.sh /mnt/testa 1M
echo -e '\n\nSTART 4M A \n'
./drbd-test.sh /mnt/testa 4M

#echo -e '\n\nSTART 4K B \n'
#./drbd-test.sh /mnt/testb 4K
#echo -e '\n\nSTART 128K B \n'
#./drbd-test.sh /mnt/testb 128K
#echo -e '\n\nSTART 1M B \n'
#./drbd-test.sh /mnt/testb 1M
#echo -e '\n\nSTART 4M B \n'
#./drbd-test.sh /mnt/testb 4M

Before running the script, first install dependencies:

sudo apt install bc jq fio

Results

SanDisk_SD7SB2Q512G1001 (ok-ish: read speeds are just a bit slower than the Samsungs)
KINGSTON_SA400S37240G (this one is really sloooooow)
MTFDDAK256MAM-1K12 (bad 4k performance, other than that ok-ish: read speeds are just a bit slower than the Samsungs)
Samsung_SSD_850_EVO_250GB (the winner and I happen to have 3, how convenient)

These results show the write bottleneck clearly. The Kingston was holding the whole array back.

Testing Linstor performance

The setup

I used the Samsung_SSD_850_EVO_250GB for these tests, one in each node.
On each disk I created a single partition with an LVM volume group linstor_vg.
On each pool I then created a thinpool test.
This thinpool was used in Linstor as a storage-pool. On top of that a resource-group pve-rg was created which was imported into Proxmox.

ℹ️

Sources:
- https://wiki.joeplaa.com/tutorials/how-to-install-drbd-with-linstor-on-proxmox
- https://linbit.com/blog/linstor-setup-proxmox-ve-volumes/
- https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#ch-proxmox-linstor

I created two VM's with a minimal Debian install. One on jpl-proxmox6 (remote) and one on jpl-proxmox7 (local). I then ran the benchmarks from within the VM's to simulate the actual use-case.

Single disk (vm running locally)

Linstor is (or at least should be) smart enought to know the VM is running on one of the storage nodes. It can thus write directly to disk, no network in between, which should be fast. So this test should represent a best case scenario when using drbd as the storage backend.

Single disk (vm running remotely)

When running the VM on a diskless node, all data has to be read and written over the network. For reading this should be faster as all 3 nodes can be accessed in a RAID-0 like manner.

Hickups

Although performance of Linstor is great I had a few hickups that don't make me feel confident about using Linstor in production.

I couldn't create a VM backed by Linstor on the diskless node. I could also not migrate a disk to Linstor. The only way was to create/migrate a VM on/to one of the storage nodes, move the disk to Linstor and then migrating it back to the diskless node. A bit of a hassle. This seems to be fixed by properly configuring the network/firewall. I had used two subnets initially (public/private), but it turns out that using just a single vlan/subnet makes it work.
I often couldn't boot a VM on the diskless node after having it shutdown (see logs below). I also couldn't migrate the VM. The only way to recover it was by removing it, recreating it (from backup) on a storage node and migrating it back to the diskless node. That is, if there was a backup!
I also had a weird networking issue, see below. This might be completely caused by me and my specific setup, although this wasn't an issue with Vitastor. However, if someone knows how to fix/mitigate it, please let me know: joep@joeplaa.com.

Diskless node issues

Digging into the logs I found that some settings drbd-options were incompatible. Because the initial import of the VM was slow, I set on-congestion pull-ahead. This option only works if also protocol A is used. However, diskless nodes only work with protocol C. This is nowhere in the docs, only when you dig into the logs, you will find some clues. Also the error message in Proxmox (below) is useless, it only mentions "failed due to an unknown exception", very helpful.

blockdev: cannot open /dev/drbd/by-res/pm-315bfa3c/0: No such file or directory
kvm: -drive file=/dev/drbd/by-res/pm-315bfa3c/0,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=io_uring,detect-zeroes=unmap: Could not open '/dev/drbd/by-res/pm-315bfa3c/0': No such file or directory

NOTICE
  Intentionally removing diskless assignment (pm-315bfa3c) on (jpl-proxmox6).
  It will be re-created when the resource is actually used on this node.
API Return-Code: 500. Message: Could not delete diskless resource pm-315bfa3c on jpl-proxmox6, because:
[{"ret_code":53739522,"message":"Node: jpl-proxmox6, Resource: pm-315bfa3c preparing for deletion.","details":"Node: jpl-proxmox6, Resource: pm-315bfa3c UUID is: 66b9f107-6ce3-441b-a808-177073229bed","obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.013075607Z"},{"ret_code":53739523,"message":"Preparing deletion of resource on 'jpl-proxmox6'","obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.025377733Z"},{"ret_code":-4611686018373647386,"message":"(Node: 'jpl-proxmox8') Failed to adjust DRBD resource pm-315bfa3c","error_report_ids":["674051B6-DA5B0-000023"],"obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.100380166Z"},{"ret_code":-4611686018373647386,"message":"(Node: 'jpl-proxmox7') Failed to adjust DRBD resource pm-315bfa3c","error_report_ids":["674051A7-AAAC5-000023"],"obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.115180897Z"},{"ret_code":-4611686018373647386,"message":"(Node: 'jpl-proxmox9') Failed to adjust DRBD resource pm-315bfa3c","error_report_ids":["674051B2-B5BBE-000023"],"obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.119553561Z"},{"ret_code":-4611686018373647386,"message":"Deletion of resource 'pm-315bfa3c' on node 'jpl-proxmox6' failed due to an unknown exception.","details":"Node: jpl-proxmox6, Resource: pm-315bfa3c","error_report_ids":["67405641-00000-000005"],"obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.122856887Z"}]
 at /usr/share/perl5/PVE/Storage/Custom/LINSTORPlugin.pm line 594.

...more stack trace...

TASK ERROR: start failed: QEMU exited with code 1

Network issue

Initially I had a 40gbit nic in the diskless node. This allowed read-speeds of up to 1500 MB/s from the three storage nodes. Nice! However, write speeds to the diskless nodes was abysmal.

I found a problem in my switch (Brocade ICX6610-24), where a lot of packets got dropped when pushing data from the 40gbit link to a 10gbit link (when writing data from the diskless node to a storage node). I tried flow-control and limiting bandwidth, but nothing worked.

Egress queues:
Queue counters    Queued packets    Dropped Packets
    0            47483309              718783
    1                   0                   0
    2                   0                   0
    3                   0                   0
    4                   0                   0
    5                   0                   0
    6                   0                   0
    7                   0                   0

I then switched the 40gbit nic for a 10gbit one. Now, writing was better, but reading was slow (unfortunately I lost that data). Again a lot of packets got dropped in my switch, now the other way around 3x 10gbit ==> 1x 10gbit. I tried creating an LACP bond of 2 links, but that did nothing.

Testing Vitastor performance

The setup

I used the same Samsung_SSD_850_EVO_250GB, one in each node.
On each disk a single, default OSD was created.
On the three OSD I created a single pool testpool.
This pool was then imported into Proxmox.

ℹ️

Sources:
- https://wiki.joeplaa.com/tutorials/how-to-install-vitastor-on-proxmox
- https://vitastor.io/en/docs/intro/quickstart.html
- https://vitastor.io/en/docs/installation/proxmox.html

I re-used the two VM's with a minimal Debian install. One on jpl-proxmox6 (remote) and one on jpl-proxmox7 (local). I then re-ran the benchmarks from within the VM's.

Of note

The IO delay of the storage host will be ~1 core (in my case 25%). This is expected behavior as the author of Vitastor wrote to me in an email:

...it just represents how Linux kernel calculates iowait for io_uring threads — it could any thread waiting for io_uring events as 100% cpu iowait even though it’s not waiting for any disk event, it just waits for the network. This change was actually made in Linux 6.1.39 and 6.4.4, before these versions it wasn’t counting io_uring as 1 complete core iowait. Here are the discussions:
https://bbs.archlinux.org/viewtopic.php?id=287343
https://bugzilla.kernel.org/show_bug.cgi?id=217700
https://lore.kernel.org/lkml/538065ee-4130-6a00-dcc8-f69fbc7d7ba0@kernel.dk/
Quote from Jens Axboe (io_uring author): Just read the first one, but this is very much expected. It's now just correctly reflecting that one thread is waiting on IO. IO wait being 100% doesn't mean that one core is running 100% of the time, it just means it's WAITING on IO 100% of the time.

Use enterprise SSD's. But this goes for all storage solutions.

Also I highly recommend to not use the whole disk for OSD if it’s a desktop SSD — you better should manually create an empty 50-100 GB partition, blkdiscard it, leave it empty and use the rest of space for OSD. Otherwise you may hit performance issues after some time, because Vitastor doesn’t support SSD TRIM yet and desktop SSDs don’t have extra overprovisioned space.

Testing Ceph performance

The setup

I used the same Samsung_SSD_850_EVO_250GB, one in each node.
On each disk a single, default OSD was created.
On the three OSD I created a single pool testpool.
This pool was then imported into Proxmox.
I re-used the two VM's with a minimal Debian install. One on jpl-proxmox6 (remote) and one on jpl-proxmox7 (local). I then re-ran the benchmarks from within the VM's.

Results

After a lot of hours of testing, I have more questions than answers. Here are some of my findings. For everyone interested I'll attach the raw data and graphs in an Excel sheet below.

drbd-ceph-vitastor

drbd-ceph-vitastor.xlsx

103 KB

Raw throughput numbers for different solutions in MB/s

Performance penalty / overhead

Every SDS has some overhead compared to direct disk access. For writing this overhead results in a performance "penalty" up to an order of magnitude (30 MB/s instead of 300 MB/s). Linstor > Vitastor > Ceph.

The reading performance can be close or even better than local as reads can be done of multiple machines in parallel. Vitastor does this best > Linstor > Ceph.

Remote vs local

All of these solutions can run a virtual machine on a remote machine, however, the results are not good. Probably the latency in my network is too big.

Linstor LVM-thin vs ZFS

LVM-thin is faster than ZFS. The insane read numbers when running locally is the ARC in memory at work.

Thoughts

Is an SDS in my homelab really a good option?
1. It seems so much slower than possible with my storage. Next step: test NFS / iSCSI storage.
2. Do I really need HA storage? Or can I better move the important stuff to a hoster / cloud provider?
Why are these solutions so slow when writing?
1. Is it my network?
2. Is my hardware too slow?
3. Or should/can I tweak some settings (let me know: joep@joeplaa.com)? Linstor has an asynchronous mode, protocol A, which also has the option on-congestion pull-ahead. Does that make a big enough difference? When enabling it, the remote option becomes unavailable though. That option requires protocol C.
Ceph was the easiest to install in Proxmox. However, it is also the slowest. Vitastor reading was by far the best on my systems, however writing was just a little better with Linstor. Both work, but need some patience and cli skills.
The debug info in the Linstor plugin in Proxmox is useless. You really have to go into the Linstor and read the error logs. That eventually showed me why I couldn't import or create a remote VM.
I find Linstors network interfaces confusing. I can't get it properly working with two subnets.
Vitastor seems promising. I couldn't figure out why it wouldn't work on my machines initially, but the developer replied quickly to my email and gave me some detailed answers.

❌

The proxmox plugin requires allow-two-primaries to be enabled. This mode only works when protocol C is used, which answers question 2c.
See: https://github.com/LINBIT/linstor-proxmox/issues/37.

Recommended for you

Homelab

Start Conky as a systemd service

a month ago • 2 min read

Coding

Build OpenCV with CUDA in docker

3 months ago • 14 min read

AWS

IPv6 adventures 4: AWS

5 months ago • 16 min read

Start Conky as a systemd service

Program the Elegoo Smart Robot Car Kit V4.0 with VScode

Build OpenCV with CUDA in docker

IPv6 adventures 4: AWS

Linstor, Ceph and Vitastor performance on Proxmox 8

Rationale

Why an SDS (Software Defined Storage) system

Why this performance testing/tuning

Cluster setup

Storage nodes

Diskless node

Testing individual drives

The drives

The setup

Test script

Results

Testing Linstor performance

The setup

Single disk (vm running locally)

Single disk (vm running remotely)

Hickups

Diskless node issues

Network issue

Testing Vitastor performance

The setup

Of note

Testing Ceph performance

The setup

Results

Performance penalty / overhead

Remote vs local

Linstor LVM-thin vs ZFS

Thoughts

Tags

Joep van de Laarschot

Recommended for you

Start Conky as a systemd service

Build OpenCV with CUDA in docker

IPv6 adventures 4: AWS