I have a decent amount of data stored away for backups and archives. While this data could easily be stored in a cloud platform, I take enjoyment in having my own copy of it.
I considered a few different technical options (wanting to build not buy), and how I would store the data including:
- Minio (a golang S3 clone by creator of GlusterFS) is great conceptually but still feels under development. It uses the underlying filesystem for storage and provides an abstration layer on top.
- Ceph (a scalable S3 compatible object multi-purpose file store). A single node ceph cluster has a lot of moving parts. I find Ceph’s approach in general too ambitious: block, object, and file system storage, custom raw partition format (Bluestore), along with a somewhat legacy C/C++ codebase makes the system fairly fragile.
- HDFS (the Hadoop distributed filesystem). Although I’ve done a few Hadoop implementations, I do not have the scale of data needed and wanted to use a single machine for which HDFS is probably less suitable that Ceph.
Compared to other traditional filesystems such as ext4, brtfs, etc, ZFS is still considered the best for long term data storage as it has features such as inbuilt checksuming and error healing capabilities, COW (copy on write) allowing for instantaneous snapshots will rollback support, inbuilt RAID and data mirroring options, and easy transmission of backups via ZFS send / receive.
A good and honest evaluation of ZFS can be found at Stephen Koskett’s blog.
After selecting ZFS my next task was installation.
Target server was a HP Gen 8 Microserver. Nice things about this server is its compact size, 4 x 3.5” hard drive bays, and support for ECC RAM. Not so nice things are HP’s fan controllers that spin at high noise without particular rpm packages forcing me to use CentOS rather than Ubuntu (Ubuntu is generally nicer for ZFS on linux as it includes ZFS in its core repositories).
I chose a mirrored pool with two hard drives (essentially RAID 1) for simplicity. Setting up a ZFS mirrored pool was relatively painless. The are some ZFS systemd units required for automounting and features around the ZED (ZFS evend daemon) but nothing of much complexity.
My next goal was to expose the data volumes securely with an assumption of using NFS.
There are two recent versions of NFS, NFSv3 and NFSv4 having different security options. NFSv3 allows essentially only whitelisting ip ranges which wasn’t enough security. NFSv4 allow for a full Kerberos implementation. Kerberos in general required either Microsoft AD or a MIT KDC.
FreeIPA, is an equivalent linux product providing full AD functionality for identity management including DNS, Kerberos (via MIT KDC), and a Certificate Authority. I went through the process of setting up secure FreeIPA server along with NFSv4. This included generating neccesary keytabs, and joining my laptop to a domain. However though this was successful the approach added far too much complexity and I decided to rollback. I chose instead to expose the filesystem via SSH. Nautilus, the file manager of GNOME has inbuilt SSH filesystem integration which makes managing data a breeze.
- ZFS works great as a long term filesystem.
- Copy-on-write and snapshots in general are wonderful features for filesystems.
- Legacy protocols such as NFS and SMB are buggy and complex with security as a bolt-on.
- Object storage (S3 compatible) systems are the future in this area but operate at a different abstraction layer compared to regular filesystems.