ZFS
ZFS is the Zetabyte File System.
Links
- OpenZFS - http://open-zfs.org
- Tuning Guide - https://web.archive.org/web/20161223004915/http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
- Hardware recommendations - http://blog.zorinaq.com/?e=10
- Mac ZFS - http://code.google.com/p/maczfs/
- Shadow migration feature - http://docs.oracle.com/cd/E23824_01/html/821-1448/gkkud.html
- Speed tuning - http://icesquare.com/wordpress/how-to-improve-zfs-performance/
- ZFS RAID levels - https://web.archive.org/web/20201120053331/http://www.zfsbuild.com/2010/05/26/zfs-raid-levels/
- http://en.wikipedia.org/wiki/ZFS
- http://wiki.freebsd.org/ZFSQuickStartGuide
- http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
- http://zfsguru.com
- http://zfsonlinux.org/faq.html
- https://web.archive.org/web/20190603150811/http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-113-size-zfs-dedup-1354231.html
- http://wiki.freebsd.org/ZFSTuningGuide#Deduplication
- Corruption / failure to import - https://github.com/zfsonlinux/zfs/issues/2457
- https://www.percona.com/blog/2018/05/15/about-zfs-performance/
- https://wiki.freebsd.org/ZFSTuningGuide
- https://freebsdfoundation.org/blog/raid-z-expansion-feature-for-zfs
- https://www.binwang.me/2023-12-14-ZFS-Profiling-on-Arch-Linux.html
- https://despairlabs.com/blog/posts/2024-10-27-openzfs-dedup-is-good-dont-use-it
Tips
Memory
- For normal operation, 1gb of memory per tb of disk space is suitable.
- For dedup operation, 5gb of memory per tb of addressable disk space is suitable.
Log devices
- Use a log device if you have lots of writes.
- Mirror it, because if you lose it you lose the whole volume.
- Speed and latency are most important, not size. Log flushes every 5 seconds.
- Get SLC if possible, otherwise MLC
l2arc Cache devices
- Use if you have lots of reads.
- Size does matter, with big devices more data can be cached for faster reads of more data.
- Speed and latency matter.
- Mirrororing l2arc does not matter because if it fails, reads come from the spinning disks.
- Too big of a device can suck up resources and cause poor performance. See https://wiki.freebsd.org/ZFSTuningGuide
Good explanation: https://web.archive.org/web/20160324170916/https://blogs.oracle.com/brendan/entry/test
zdb
Show the potential savings of turning on dedupe on zpool tank
zdb -S tank
Show transactions and human readable dates in the zdb history
Use zdb -e
for pools that are not mounted.
zdb -hh tank \
| egrep 'txg|time' \
| while read -r _ a b ; do
if [ "$a" == "time:" ] ; then
date -d @$b "+$a %F %T" ;
else
echo "$a $b" ;
fi ;
done
zpool
Create a zpool and its base filesystem
zpool create -f -o cachefile=/tmp/zpool.cache zpoolname /dev/ada1 #create a zpool
Add a cache device to a pool
## add ada0p3 as a cache device to the tank zpool
zpool add tank cache ada0p3
Show all configured zpool options for a given zpool
zpool get all tank
Show history of all operations on a given pool
## show history of operations on the pool, eg: snapshots, attribute changes
zpool history
Show real time statistics on a given zpool
## show per-device statistics every 1 second
zpool iostat -v 1
Show basic information about all imported zpools
## show zpool space info, deduplication ratio and health
zpool list
Show deduplication tables
## show deduplication table entries. Take entries * size / 1024 / 1024 to calculate DDT consumption
zpool status -D z2
Import a pool by different disk path
You can change the paths your pool is imported from. This is useful if you created your zpool using /dev/sdN
when you should have used /dev/disk/by-id/
, which is deterministic. The -d
option lets you specify a directory to look within for the given pool's devices.
zpool import -d /dev/disk/by-id/ "$ZPOOL_NAME"
You may find that your pool was imported using links from this path that are not desirable, because there are several options available. For instance, you may find that your pool was imported using wwn links (EG: wwn-0x5000cca22eca1056
) that are not very user friendly compared to a link that shows the model and serial number (EG: scsi-SATA_HGST_HMS5C4141BM_PM1302LAGR5A0F
). Because these links are managed by udev and are created when the disk is seen by the system, either at boot or at insertion, and because nothing else should be referencing these symlinks, they are safe to delete. Export your pool, then delete unwanted symlinks for the devices related to your pool, leaving only the symlinks you want to use, then run zpool import -d
once again.
Replace a disk in a zpool
## Replace the first disk with the second in the tank pool
zpool replace -f tank /dev/disk/by-id/ata-ST3000DM001-9YN166_W1F09CW9 /dev/disk/by-id/ata-ST3000DM001-9YN166_Z1F0N9S7
Real example
$ zpool replace -f tank /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PCJY9ASS /dev/disk/by-id/ata-HGST_HUH728080ALE600_VKHA6YDX
$ zpool status
pool: home
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Sun Dec 10 00:24:07 2017
config:
NAME STATE READ WRITE CKSUM
home ONLINE 0 0 0
ata-M4-CT064M4SSD2_0000000012170908F759-part4 ONLINE 0 0 0
errors: No known data errors
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jan 8 19:57:45 2018
47.1M scanned out of 13.7T at 6.72M/s, 592h39m to go
11.5M resilvered, 0.00% done
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
ata-HGST_HDN724040ALE640_PK1334PCJY9ASS UNAVAIL 0 1 0 corrupted data
ata-HGST_HUH728080ALE600_VKHA6YDX ONLINE 0 0 0 (resilvering)
ata-HGST_HDN724040ALE640_PK2334PEHG8LAT ONLINE 0 0 0
ata-HGST_HDN724040ALE640_PK2334PEHGD37T ONLINE 0 0 0
ata-HGST_HDN724040ALE640_PK2338P4H3TJPC ONLINE 0 0 0
errors: No known data errors
Expand a zpool in place after replacing disks with larger disks
Expansion happens automatically if you have done zpool set autoexpand=on tank
. If you did not do that and you find your pool has not expanded, you can perform the following:
List the absolute paths of your devices with something like:
zpool list -v -PH | awk '$1 ~ "^\/dev\/" {gsub("-part1","",$1) ; print $1 ;}'
Then go through your device list and run
zpool online -e tank <disk-name> # do the expansion
zpool list -v tank # check the EXPANDSZ column for the disk
After doing all of these your pool should be expanded.
zfs
show differences between current filesystem state and snapshot state
zfs diff tank tank@snap
Show configured properties for a filesystem
zfs get all
Show custom filesystem attributes
## show custom attributes that override inherited attributes
zfs get all -s local tank
Show an overview of all mounted zfs filesystems
## show disk space including free physical disk space and mount info
zfs list
Show specified fields of each filesystem
## show the listed fields of all filesystems
zfs list -t all -o name,referenced,used,written,creation,userused@root
Show only snapshots
zfs list -t snapshot
Show space consumed by file owner
zfs userspace tank
Disable atime updates for a filesystem
zfs set atime=off tank
Set compression to lz4 for a filesystem
zfs set compression=lz4 tank
Set deduplication to enabled for a filesystem
zfs set dedup=on tank
Set a filesystem to readonly
zfs set readonly=on zpoolname/dataset
Set a filesystem to allow NFS sharing
zfs set sharenfs=on tank
Create a dataset
## create a dataset 'sole' on zpool 'tank'
zfs create tank/sole
Destroy multiple snapshots
zfs destroy tank@20130413-weekly,20130420-weekly,20130428-weekly,20130505-weekly
zfs send / receive
Replicate a zpool (use the latest snapshot name as the source) to a blank zpool:
zfs send -v -D -R tank@20120907-oldest | zfs receive -F -v z2
- -D enables a deduplicated stream.
- -R enables a recursive send of all snapshots and filesystems up to that point.
- -F enables deletion of any snapshots on the target that don't exist on the sender
- -v enables verbose mode
recursively zfs send a filesystem to a remote host and receive it as a new dataset
zfs send -v -D -R z1@20120907-oldest | ssh otherhost zfs receive -v z2/z1
Show summary of what would be sent
This shows an entire dataset up to the given snapshot
zfs send -n -v -D -R tank@20140531-monthly
Show the space differences between two snapshots
zfs send -n -v -D -i tank@20140531-monthly tank@20141031-monthly
Show the amount of new space consumed by each monthly
zfs list -o name | grep 'tank@.*monthly' | while read -r X ; do [[ ! $a =~ .*monthly ]] && a=$X || zfs send -n -v -D -i $a $X && a=$X ; done 2>&1 | grep send
Complex examples
Create a raidz called tank
Create a raidz pool from 4 disks and set some properties:
pool=tank
zpool create -f "${pool}" raidz /dev/disk/by-id/scsi-SATA_HGST_HDN724040A_PK2338P4H*-part1 -o ashift=12
zfs set dedup=on "${pool}"
zpool set listsnapshots=on "${pool}"
zfs set atime=off "${pool}"
zfs set compression=lz4 "${pool}"
Create a case insensitive raidz3 out of 50 files
pool=tank
for X in {1..50} ; do mkfile -n 2g ${pool}.$X ; done ;
sudo zpool create -O casesensitivity=insensitive ${pool} raidz3 "${PWD}/${pool}".{1..50}
Troubleshooting
Mount a pool that is giving you Trouble
zpool import -o failmode=continue -o readonly=on zpool_name
This helped me get read access to a pool that was kernel panicking with the following error when I tried to import it normally:
Dec 7 14:48:40 localhost kernel: PANIC: blkptr at ffff8803fddb4200 DVA 0 has invalid OFFSET 294940902907904
ZFS on Mac OS X
Create a ZFS partition on /dev/disk3
## Must eject device in Disk Utility first
diskutil partitiondisk /dev/disk3 GPTFormat ZFS %noformat% 100% # strange syntax, but works
zpool create backups1 /dev/disk3s2 # create the zpool
mdutil -i off /Volumes/backups1 # required on MacZFS since spotlight does not function
ZFS on Linux
- If you get module errors:
modprobe zfs ; ldconfig
- If you get permission denied, check selinux settings
CentOS 6 Repository
sudo yum install -y epel-release # assumes later CentOS 6 where epel is provided upstream
sudo yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm
sudo yum install zfs -y
Reinstalling when things fail
##!/bin/bash -x
yum install -y kernel-devel-$(uname -r)
zfs_version=0.6.5.4
dkms remove -m zfs -v "${zfs_version}" --all
dkms remove -m spl -v "${zfs_version}" --all
dkms add -m spl -v "${zfs_version}" --force
dkms add -m zfs -v "${zfs_version}" --force
dkms install -m spl -v "${zfs_version}" --force
dkms install -m zfs -v "${zfs_version}" --force
Inspect the rpm for what scripts it runs
This is useful for debugging failures after kernel upgrade.
rpm -q --scripts zfs-dkms
Building on CentOS 6
yum groupinstall "Development tools" && yum install -y libuuid-devel zlib-devel bc lsscsi mdadm parted kernel-debug
## For spl, then again for zfs:
./configure && make && make rpm && rpm -i *64.rpm