Linstor, Ceph and Vitastor performance on Proxmox 8
Rationale
Why an SDS (Software Defined Storage) system
Almost a year ago now, I started dipping my toes in the waters of kubernetes. In that exploration I quickly learned I needed a place to store data outside of the kubernetes cluster. Longhorn works fine, but I wanted my data to be stored outside of the VM's running the kubernetes cluster. As the main goal for me to use kubernetes over Docker is high-availability, that storage should also be highly-available.
And while I'm at it, I also wanted to use this storage to create highly-availability in Proxmox for some of my virtual machines.
Initially I wanted to use Ceph as it is build into Proxmox and it gets a lot of attention in the homelab community. However, when researching Ceph I noticed most benchmarks are done with really high-end systems. That made me question whether it would work on low-end hardware.
So far I haven't found an definitive answer to that question. I did initially only find a blogpost from 2022 stating you need "1 core per 1000-3000 IOPS"; this is one core of an AMD EPYC 7742 64C/128T! That puts "really high-end systems" in perspective. I then found this article: Comparing Ceph, LINSTOR, Mayastor, and Vitastor storage performance in Kubernetes, which again shows Ceph indeed needs lots of resources, but again with high-end hardware (AMD Ryzen 9 3900).
Through that article I found Linstor and Vitastor. Linstor used DRBD (Distributed Replicated Block Device) under the hood, where Linstor is "just" a handy CLI to configure and monitor DRBD. They promise this solution to be much faster than Ceph and for my small cluster the possible management overhead should be fine. Vitastor is more in line with Ceph. The difference being that it uses block storage at its base instead of object storage and has no webUI (yet).
Why this performance testing/tuning
When I initially started testing (Linstor), I gathered 6 random SSD's I had laying around. I installed two in each node and configured them in a software RAID-0. Local performance on the first node I tested was amazing (it happened to be a node with 2 Samsung 850 EVO 250GB, more about the drives below).
However, the overall performance was disappointing. Read performance was great, but write performance awful. I tried tuning the DRBD configuration (see sources) but that would not fix the slow writes.
But before I get side-tracked too much let me first explain my cluster layout and the tests I ran. At the end I will show the performance I eventually got out of my setup and my plans moving forward.
- https://kb.linbit.com/troubleshooting-performance
- https://kb.linbit.com/benchmarking-network-throughput
- https://kb.linbit.com/tuning-drbd-for-write-performance
- https://kb.linbit.com/tuning-drbds-resync-controller
- https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#ch-throughput
- https://serverfault.com/questions/740311/drbd-terrible-sync-performance-on-10gige
- https://linbit.com/blog/drbd-read-balancing/
Cluster setup
- 3 storage nodes:
jpl-proxmox7
,jpl-proxmox8
,jpl-proxmox9
- 1 diskless node:
jpl-proxmox6
Storage nodes
- CPU: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
- RAM: 16GB DDR4 2400 MT/s
- Nic: 2 10gbit connections; Mellanox ConnectX-3 Pro, MT27520 or MCX312B-XCCT):
- subnet
10.33.80.0/24
for Proxmox VM storage traffic (Ceph public). - subnet
192.168.100.0/24
for storage sync traffic (Ceph private).
- subnet
Diskless node
- CPU: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
- RAM: 128GB DDR4 2133 MT/s
- Nic 1: 40gbit connection; Mellanox ConnectX-3, MT27500 or MCX354A-FCBT:
- subnet
10.33.80.0/24
for Proxmox VM storage traffic.
- subnet
- Nic 2: 10gbit connections; Mellanox ConnectX-3 Pro, MT27520 or MCX312B-XCCT):
- subnet
10.33.80.0/24
for Proxmox VM storage traffic (Ceph public). - subnet
192.168.100.0/24
for storage sync traffic (Ceph private).
- subnet
Testing individual drives
The drives
SanDisk_SD7SB2Q512G1001
Sandisk X300 512GBKINGSTON_SA400S37240G
Kingston SA400 240GBMTFDDAK256MAM-1K12
Micron C400 256GBSamsung_SSD_850_EVO_250GB
Samsung 850 EVO 250GB
The setup
- I installed two SSD's in each storage node.
- On both disks I created a single partition on which LVM volume groups
linstor_vga
andlinstor_vgb
were created. As the names suggest,a
was created onsda
andb
onsdb
.
sudo sgdisk -n 1:0:0 /dev/sda
sudo sgdisk -n 1:0:0 /dev/sdb
sudo vgcreate linstor_vga /dev/sda1
sudo vgcreate linstor_vgb /dev/sdb1
- On each pool I then created thinpools
testa
andtestb
sudo lvcreate -V 10G --thin -n testa linstor_vga/thinpool
sudo lvcreate -V 10G --thin -n testb linstor_vgb/thinpool
- In these thinpools 10GB volumes
testa
andtestb
were created and formatted asext4
sudo lvcreate -V 10GiB -T linstor_vga/thinpool -n testa
sudo lvcreate -V 10GiB -T linstor_vgb/thinpool -n testb
sudo mkfs.ext4 /dev/linstor_vga/testa
sudo mkfs.ext4 /dev/linstor_vgb/testb
- These volumes were then mounted and the systems were ready for testing
sudo mkdir /mnt/testa
sudo mkdir /mnt/testb
sudo mount /dev/linstor_vga/testa /mnt/testa
sudo mount /dev/linstor_vgb/testb /mnt/testb
Test script
I used a script by Lawrence Systems which I slightly modified to be able to pass the blocksize as an argument.
#!/bin/bash
# this script requires fio bc & jc
# Directory to test
TEST_DIR=$1
BS=$2
# Parameters for the tests should be representive of the workload you want to simulate
#BS="1M" # Block size
IOENGINE="libaio" # IO engine
IODEPTH="16" # IO depth sets how many I/O requests a single job can handle at once
DIRECT="1" # Direct IO at 0 is buffered with RAM which may skew results and I/O 1 is unbuffered
NUMJOBS="5" # Number of jobs is how many independent I/O streams are being sent to the storage
FSYNC="0" # Fsync 0 leaves flushing up to Linux 1 force write commits to disk
NUMFILES="5" # Number of files is number of independent I/O threads or processes that FIO will spawn
FILESIZE="1G" # File size for the tests, you can use: K M G
# Check if directory is provided
if [ -z "$TEST_DIR" ]; then
echo "Usage: $0 [directory]"
exit 1
fi
# Function to perform FIO test and display average output
perform_test() {
RW_TYPE=$1
echo "Running $RW_TYPE test with block size $BS, ioengine $IOENGINE, iodepth $IODEPTH, direct $DIRECT, numjobs $NUMJOBS, fsync $FSYNC, using $NUMFILES files of size $FILESIZE on $TEST_DIR"
# Initialize variables to store cumulative values
TOTAL_READ_IOPS=0
TOTAL_WRITE_IOPS=0
TOTAL_READ_BW=0
TOTAL_WRITE_BW=0
for ((i=1; i<=NUMFILES; i++)); do
TEST_FILE="$TEST_DIR/fio_test_file_$i"
# Running FIO for each file and parsing output
OUTPUT=$(fio --name=test_$i \
--filename=$TEST_FILE \
--rw=$RW_TYPE \
--bs=$BS \
--ioengine=$IOENGINE \
--iodepth=$IODEPTH \
--direct=$DIRECT \
--numjobs=$NUMJOBS \
--fsync=$FSYNC \
--size=$FILESIZE \
--group_reporting \
--output-format=json)
# Accumulate values
TOTAL_READ_IOPS=$(echo $OUTPUT | jq '.jobs[0].read.iops + '"$TOTAL_READ_IOPS")
TOTAL_WRITE_IOPS=$(echo $OUTPUT | jq '.jobs[0].write.iops + '"$TOTAL_WRITE_IOPS")
TOTAL_READ_BW=$(echo $OUTPUT | jq '(.jobs[0].read.bw / 1024) + '"$TOTAL_READ_BW")
TOTAL_WRITE_BW=$(echo $OUTPUT | jq '(.jobs[0].write.bw / 1024) + '"$TOTAL_WRITE_BW")
done
# Calculate averages
AVG_READ_IOPS=$(echo "$TOTAL_READ_IOPS / $NUMFILES" | bc -l)
AVG_WRITE_IOPS=$(echo "$TOTAL_WRITE_IOPS / $NUMFILES" | bc -l)
AVG_READ_BW=$(echo "$TOTAL_READ_BW / $NUMFILES" | bc -l)
AVG_WRITE_BW=$(echo "$TOTAL_WRITE_BW / $NUMFILES" | bc -l)
# Format and print averages, omitting 0 results
[ "$(echo "$AVG_READ_IOPS > 0" | bc)" -eq 1 ] && printf "Average Read IOPS: %'.2f\n" $AVG_READ_IOPS
[ "$(echo "$AVG_WRITE_IOPS > 0" | bc)" -eq 1 ] && printf "Average Write IOPS: %'.2f\n" $AVG_WRITE_IOPS
[ "$(echo "$AVG_READ_BW > 0" | bc)" -eq 1 ] && printf "Average Read Bandwidth (MB/s): %'.2f\n" $AVG_READ_BW
[ "$(echo "$AVG_WRITE_BW > 0" | bc)" -eq 1 ] && printf "Average Write Bandwidth (MB/s): %'.2f\n" $AVG_WRITE_BW
}
# Run tests
perform_test randwrite
perform_test randread
perform_test write
perform_test read
perform_test readwrite
# Clean up
for ((i=1; i<=NUMFILES; i++)); do
rm "$TEST_DIR/fio_test_file_$i"
done
I created another script to invoke the above script with different inputs:
#!/bin/bash
# this script requires fio bc & jc
echo -e '\n\nSTART 4K A \n'
./drbd-test.sh /mnt/testa 4K
echo -e '\n\nSTART 128K A \n'
./drbd-test.sh /mnt/testa 128K
echo -e '\n\nSTART 1M A \n'
./drbd-test.sh /mnt/testa 1M
echo -e '\n\nSTART 4M A \n'
./drbd-test.sh /mnt/testa 4M
#echo -e '\n\nSTART 4K B \n'
#./drbd-test.sh /mnt/testb 4K
#echo -e '\n\nSTART 128K B \n'
#./drbd-test.sh /mnt/testb 128K
#echo -e '\n\nSTART 1M B \n'
#./drbd-test.sh /mnt/testb 1M
#echo -e '\n\nSTART 4M B \n'
#./drbd-test.sh /mnt/testb 4M
Before running the script, first install dependencies:
sudo apt install bc jq fio
Results
SanDisk_SD7SB2Q512G1001
(ok-ish: read speeds are just a bit slower than the Samsungs)KINGSTON_SA400S37240G
(this one is really sloooooow)MTFDDAK256MAM-1K12
(bad 4k performance, other than that ok-ish: read speeds are just a bit slower than the Samsungs)Samsung_SSD_850_EVO_250GB
(the winner and I happen to have 3, how convenient)
These results show the write bottleneck clearly. The Kingston was holding the whole array back.
Testing Linstor performance
The setup
- I used the
Samsung_SSD_850_EVO_250GB
for these tests, one in each node. - On each disk I created a single partition with an LVM volume group
linstor_vg
. - On each pool I then created a thinpool
test
. - This thinpool was used in Linstor as a storage-pool. On top of that a resource-group
pve-rg
was created which was imported into Proxmox.
- I created two VM's with a minimal Debian install. One on
jpl-proxmox6
(remote) and one onjpl-proxmox7
(local). I then ran the benchmarks from within the VM's to simulate the actual use-case.
Single disk (vm running locally)
Linstor is (or at least should be) smart enought to know the VM is running on one of the storage nodes. It can thus write directly to disk, no network in between, which should be fast. So this test should represent a best case scenario when using drbd as the storage backend.
Single disk (vm running remotely)
When running the VM on a diskless node, all data has to be read and written over the network. For reading this should be faster as all 3 nodes can be accessed in a RAID-0 like manner.
Hickups
Although performance of Linstor is great I had a few hickups that don't make me feel confident about using Linstor in production.
I couldn't create a VM backed by Linstor on the diskless node. I could also not migrate a disk to Linstor. The only way was to create/migrate a VM on/to one of the storage nodes, move the disk to Linstor and then migrating it back to the diskless node. A bit of a hassle.This seems to be fixed by properly configuring the network/firewall. I had used two subnets initially (public/private), but it turns out that using just a single vlan/subnet makes it work.- I often couldn't boot a VM on the diskless node after having it shutdown (see logs below). I also couldn't migrate the VM. The only way to recover it was by removing it, recreating it (from backup) on a storage node and migrating it back to the diskless node. That is, if there was a backup!
- I also had a weird networking issue, see below. This might be completely caused by me and my specific setup, although this wasn't an issue with Vitastor. However, if someone knows how to fix/mitigate it, please let me know: joep@joeplaa.com.
Diskless node issues
Digging into the logs I found that some settings drbd-options
were incompatible. Because the initial import of the VM was slow, I set on-congestion pull-ahead
. This option only works if also protocol A
is used. However, diskless nodes only work with protocol C
. This is nowhere in the docs, only when you dig into the logs, you will find some clues. Also the error message in Proxmox (below) is useless, it only mentions "failed due to an unknown exception", very helpful.
blockdev: cannot open /dev/drbd/by-res/pm-315bfa3c/0: No such file or directory
kvm: -drive file=/dev/drbd/by-res/pm-315bfa3c/0,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=io_uring,detect-zeroes=unmap: Could not open '/dev/drbd/by-res/pm-315bfa3c/0': No such file or directory
NOTICE
Intentionally removing diskless assignment (pm-315bfa3c) on (jpl-proxmox6).
It will be re-created when the resource is actually used on this node.
API Return-Code: 500. Message: Could not delete diskless resource pm-315bfa3c on jpl-proxmox6, because:
[{"ret_code":53739522,"message":"Node: jpl-proxmox6, Resource: pm-315bfa3c preparing for deletion.","details":"Node: jpl-proxmox6, Resource: pm-315bfa3c UUID is: 66b9f107-6ce3-441b-a808-177073229bed","obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.013075607Z"},{"ret_code":53739523,"message":"Preparing deletion of resource on 'jpl-proxmox6'","obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.025377733Z"},{"ret_code":-4611686018373647386,"message":"(Node: 'jpl-proxmox8') Failed to adjust DRBD resource pm-315bfa3c","error_report_ids":["674051B6-DA5B0-000023"],"obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.100380166Z"},{"ret_code":-4611686018373647386,"message":"(Node: 'jpl-proxmox7') Failed to adjust DRBD resource pm-315bfa3c","error_report_ids":["674051A7-AAAC5-000023"],"obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.115180897Z"},{"ret_code":-4611686018373647386,"message":"(Node: 'jpl-proxmox9') Failed to adjust DRBD resource pm-315bfa3c","error_report_ids":["674051B2-B5BBE-000023"],"obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.119553561Z"},{"ret_code":-4611686018373647386,"message":"Deletion of resource 'pm-315bfa3c' on node 'jpl-proxmox6' failed due to an unknown exception.","details":"Node: jpl-proxmox6, Resource: pm-315bfa3c","error_report_ids":["67405641-00000-000005"],"obj_refs":{"RscDfn":"pm-315bfa3c","Node":"jpl-proxmox6"},"created_at":"2024-11-22T12:17:37.122856887Z"}]
at /usr/share/perl5/PVE/Storage/Custom/LINSTORPlugin.pm line 594.
...more stack trace...
TASK ERROR: start failed: QEMU exited with code 1
Network issue
Initially I had a 40gbit nic in the diskless node. This allowed read-speeds of up to 1500 MB/s from the three storage nodes. Nice! However, write speeds to the diskless nodes was abysmal.
I found a problem in my switch (Brocade ICX6610-24), where a lot of packets got dropped when pushing data from the 40gbit link to a 10gbit link (when writing data from the diskless node to a storage node). I tried flow-control and limiting bandwidth, but nothing worked.
Egress queues:
Queue counters Queued packets Dropped Packets
0 47483309 718783
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
I then switched the 40gbit nic for a 10gbit one. Now, writing was better, but reading was slow (unfortunately I lost that data). Again a lot of packets got dropped in my switch, now the other way around 3x 10gbit ==> 1x 10gbit. I tried creating an LACP bond of 2 links, but that did nothing.
Testing Vitastor performance
The setup
- I used the same
Samsung_SSD_850_EVO_250GB
, one in each node. - On each disk a single, default OSD was created.
- On the three OSD I created a single pool
testpool
. - This pool was then imported into Proxmox.
- I re-used the two VM's with a minimal Debian install. One on
jpl-proxmox6
(remote) and one onjpl-proxmox7
(local). I then re-ran the benchmarks from within the VM's.
Of note
- The IO delay of the storage host will be ~1 core (in my case 25%). This is expected behavior as the author of Vitastor wrote to me in an email:
...it just represents how Linux kernel calculates iowait for io_uring threads — it could any thread waiting for io_uring events as 100% cpu iowait even though it’s not waiting for any disk event, it just waits for the network. This change was actually made in Linux 6.1.39 and 6.4.4, before these versions it wasn’t counting io_uring as 1 complete core iowait. Here are the discussions:
https://bbs.archlinux.org/viewtopic.php?id=287343
https://bugzilla.kernel.org/show_bug.cgi?id=217700
https://lore.kernel.org/lkml/538065ee-4130-6a00-dcc8-f69fbc7d7ba0@kernel.dk/
Quote from Jens Axboe (io_uring author): Just read the first one, but this is very much expected. It's now just correctly reflecting that one thread is waiting on IO. IO wait being 100% doesn't mean that one core is running 100% of the time, it just means it's WAITING on IO 100% of the time.
- Use enterprise SSD's. But this goes for all storage solutions.
Also I highly recommend to not use the whole disk for OSD if it’s a desktop SSD — you better should manually create an empty 50-100 GB partition, blkdiscard it, leave it empty and use the rest of space for OSD. Otherwise you may hit performance issues after some time, because Vitastor doesn’t support SSD TRIM yet and desktop SSDs don’t have extra overprovisioned space.
Testing Ceph performance
The setup
- I used the same
Samsung_SSD_850_EVO_250GB
, one in each node. - On each disk a single, default OSD was created.
- On the three OSD I created a single pool
testpool
. - This pool was then imported into Proxmox.
- I re-used the two VM's with a minimal Debian install. One on
jpl-proxmox6
(remote) and one onjpl-proxmox7
(local). I then re-ran the benchmarks from within the VM's.
Results
After a lot of hours of testing, I have more questions than answers. Here are some of my findings. For everyone interested I'll attach the raw data and graphs in an Excel sheet below.
Performance penalty / overhead
Every SDS has some overhead compared to direct disk access. For writing this overhead results in a performance "penalty" up to an order of magnitude (30 MB/s instead of 300 MB/s). Linstor > Vitastor > Ceph.
The reading performance can be close or even better than local as reads can be done of multiple machines in parallel. Vitastor does this best > Linstor > Ceph.
Remote vs local
All of these solutions can run a virtual machine on a remote machine, however, the results are not good. Probably the latency in my network is too big.
Linstor LVM-thin vs ZFS
LVM-thin is faster than ZFS. The insane read numbers when running locally is the ARC in memory at work.
Thoughts
- Is an SDS in my homelab really a good option?
- It seems so much slower than possible with my storage. Next step: test NFS / iSCSI storage.
- Do I really need HA storage? Or can I better move the important stuff to a hoster / cloud provider?
- Why are these solutions so slow when writing?
- Is it my network?
- Is my hardware too slow?
- Or should/can I tweak some settings (let me know: joep@joeplaa.com)? Linstor has an asynchronous mode,
protocol A
, which also has the optionon-congestion pull-ahead
. Does that make a big enough difference? When enabling it, the remote option becomes unavailable though. That option requiresprotocol C
.
- Ceph was the easiest to install in Proxmox. However, it is also the slowest. Vitastor reading was by far the best on my systems, however writing was just a little better with Linstor. Both work, but need some patience and cli skills.
- The debug info in the Linstor plugin in Proxmox is useless. You really have to go into the Linstor and read the error logs. That eventually showed me why I couldn't import or create a remote VM.
- I find Linstors network interfaces confusing. I can't get it properly working with two subnets.
- Vitastor seems promising. I couldn't figure out why it wouldn't work on my machines initially, but the developer replied quickly to my email and gave me some detailed answers.
allow-two-primaries
to be enabled. This mode only works when protocol
C
is used, which answers question 2c.See: https://github.com/LINBIT/linstor-proxmox/issues/37.