ZFS¶
ZFS is the Zetabyte File System.
Links¶
- OpenZFS - http://open-zfs.org
- Tuning Guide - https://web.archive.org/web/20161223004915/http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
- Hardware recommendations - http://blog.zorinaq.com/?e=10
- Mac ZFS - http://code.google.com/p/maczfs/
- Shadow migration feature - http://docs.oracle.com/cd/E23824_01/html/821-1448/gkkud.html
- Speed tuning - http://icesquare.com/wordpress/how-to-improve-zfs-performance/
- ZFS RAID levels - https://web.archive.org/web/20201120053331/http://www.zfsbuild.com/2010/05/26/zfs-raid-levels/
- http://en.wikipedia.org/wiki/ZFS
- http://wiki.freebsd.org/ZFSQuickStartGuide
- http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
- http://zfsguru.com
- http://zfsonlinux.org/faq.html
- https://web.archive.org/web/20190603150811/http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-113-size-zfs-dedup-1354231.html
- http://wiki.freebsd.org/ZFSTuningGuide#Deduplication
- Corruption / failure to import - https://github.com/zfsonlinux/zfs/issues/2457
- https://www.percona.com/blog/2018/05/15/about-zfs-performance/
- https://wiki.freebsd.org/ZFSTuningGuide
- https://freebsdfoundation.org/blog/raid-z-expansion-feature-for-zfs
- https://www.binwang.me/2023-12-14-ZFS-Profiling-on-Arch-Linux.html
- https://despairlabs.com/blog/posts/2024-10-27-openzfs-dedup-is-good-dont-use-it
- https://www.zfshandbook.com
Tips¶
- As of zfs 2.3,
-jprovides json output for many zfs commands.
Memory¶
- For normal operation, 1gb of memory per tb of disk space is suitable.
- For dedup operation, 5gb of memory per tb of addressable disk space is suitable.
Log devices¶
- Use a log device if you have lots of writes.
- Mirror it, because if you lose it you lose the whole volume.
- Speed and latency are most important, not size. Log flushes every 5 seconds.
- Get SLC if possible, otherwise MLC
l2arc Cache devices¶
- Use if you have lots of reads.
- Size does matter, with big devices more data can be cached for faster reads of more data.
- Speed and latency matter.
- Mirrororing l2arc does not matter because if it fails, reads come from the spinning disks.
- Too big of a device can suck up resources and cause poor performance. See https://wiki.freebsd.org/ZFSTuningGuide
Good explanation: https://web.archive.org/web/20160324170916/https://blogs.oracle.com/brendan/entry/test
zdb¶
Show the potential savings of turning on dedupe on zpool tank¶
Show transactions and human readable dates in the zdb history¶
Use zdb -e for pools that are not mounted.
zdb -hh tank \
| egrep 'txg|time' \
| while read -r _ a b ; do
if [ "$a" == "time:" ] ; then
date -d @$b "+$a %F %T" ;
else
echo "$a $b" ;
fi ;
done
zpool¶
Create a zpool and its base filesystem¶
Add a cache device to a pool¶
Show all configured zpool options for a given zpool¶
Show history of all operations on a given pool¶
Show real time statistics on a given zpool¶
Show basic information about all imported zpools¶
Show additional columns in zpool list output¶
$ zpool status -c upath,model,size,temp,pwr_cyc
pool: z4
state: ONLINE
scan: scrub in progress since Mon Sep 15 12:36:25 2025
1.92T / 11.5T scanned at 1.13G/s, 906G / 11.5T issued at 535M/s
0B repaired, 7.68% done, 05:47:18 to go
config:
NAME STATE READ WRITE CKSUM upath model size temp pwr_cyc
z4 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sde ONLINE 0 0 0 /dev/sde HGST HUH728080ALE600 7.3T 50 115
sda ONLINE 0 0 0 /dev/sda HGST HUH728080ALE600 7.3T 48 113
sdc ONLINE 0 0 0 /dev/sdc HGST HUH728080ALE600 7.3T 51 113
sdd ONLINE 0 0 0 /dev/sdd HGST HUH728080ALE600 7.3T 49 113
errors: No known data errors
You can view the available columns with zpool list -c.
Show deduplication tables¶
## show deduplication table entries. Take entries * size / 1024 / 1024 to calculate DDT consumption
zpool status -D z2
Import a pool by different disk path¶
You can change the paths your pool is imported from. This is useful if you created your zpool using /dev/sdN when you should have used /dev/disk/by-id/, which is deterministic. The -d option lets you specify a directory to look within for the given pool's devices.
You may find that your pool was imported using links from this path that are not desirable, because there are several options available. For instance, you may find that your pool was imported using wwn links (EG: wwn-0x5000cca22eca1056) that are not very user friendly compared to a link that shows the model and serial number (EG: scsi-SATA_HGST_HMS5C4141BM_PM1302LAGR5A0F). Because these links are managed by udev and are created when the disk is seen by the system, either at boot or at insertion, and because nothing else should be referencing these symlinks, they are safe to delete. Export your pool, then delete unwanted symlinks for the devices related to your pool, leaving only the symlinks you want to use, then run zpool import -d once again.
Replace a disk in a zpool¶
## Replace the first disk with the second in the tank pool
zpool replace -f tank /dev/disk/by-id/ata-ST3000DM001-9YN166_W1F09CW9 /dev/disk/by-id/ata-ST3000DM001-9YN166_Z1F0N9S7
Real example¶
$ zpool replace -f tank /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PCJY9ASS /dev/disk/by-id/ata-HGST_HUH728080ALE600_VKHA6YDX
$ zpool status
pool: home
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Sun Dec 10 00:24:07 2017
config:
NAME STATE READ WRITE CKSUM
home ONLINE 0 0 0
ata-M4-CT064M4SSD2_0000000012170908F759-part4 ONLINE 0 0 0
errors: No known data errors
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jan 8 19:57:45 2018
47.1M scanned out of 13.7T at 6.72M/s, 592h39m to go
11.5M resilvered, 0.00% done
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
ata-HGST_HDN724040ALE640_PK1334PCJY9ASS UNAVAIL 0 1 0 corrupted data
ata-HGST_HUH728080ALE600_VKHA6YDX ONLINE 0 0 0 (resilvering)
ata-HGST_HDN724040ALE640_PK2334PEHG8LAT ONLINE 0 0 0
ata-HGST_HDN724040ALE640_PK2334PEHGD37T ONLINE 0 0 0
ata-HGST_HDN724040ALE640_PK2338P4H3TJPC ONLINE 0 0 0
errors: No known data errors
Expand a zpool in place after replacing disks with larger disks¶
Expansion happens automatically if you have done zpool set autoexpand=on tank. If you did not do that and you find your pool has not expanded, you can perform the following:
List the absolute paths of your devices with something like:
Then go through your device list and run
zpool online -e tank <disk-name> # do the expansion
zpool list -v tank # check the EXPANDSZ column for the disk
After doing all of these your pool should be expanded.
zfs¶
show differences between current filesystem state and snapshot state¶
Show configured properties for a filesystem¶
Show custom filesystem attributes¶
Show an overview of all mounted zfs filesystems¶
Show specified fields of each filesystem¶
## show the listed fields of all filesystems
zfs list -t all -o name,referenced,used,written,creation,userused@root
Show only snapshots¶
Show space consumed by file owner¶
Disable atime updates for a filesystem¶
Set compression to lz4 for a filesystem¶
Set deduplication to enabled for a filesystem¶
Set a filesystem to readonly¶
Set a filesystem to allow NFS sharing¶
Create a dataset¶
Destroy multiple snapshots¶
zfs send / receive¶
Replicate a zpool (use the latest snapshot name as the source) to a blank zpool:
- -D enables a deduplicated stream.
- -R enables a recursive send of all snapshots and filesystems up to that point.
- -F enables deletion of any snapshots on the target that don't exist on the sender
- -v enables verbose mode
recursively zfs send a filesystem to a remote host and receive it as a new dataset¶
Show summary of what would be sent¶
This shows an entire dataset up to the given snapshot
Show the space differences between two snapshots¶
Show the amount of new space consumed by each monthly¶
zfs list -o name | grep 'tank@.*monthly' | while read -r X ; do [[ ! $a =~ .*monthly ]] && a=$X || zfs send -n -v -D -i $a $X && a=$X ; done 2>&1 | grep send
Complex examples¶
Create a raidz called tank¶
Create a raidz pool from 4 disks and set some properties:
pool=tank
zpool create -f "${pool}" raidz /dev/disk/by-id/scsi-SATA_HGST_HDN724040A_PK2338P4H*-part1 -o ashift=12
zfs set dedup=on "${pool}"
zpool set listsnapshots=on "${pool}"
zfs set atime=off "${pool}"
zfs set compression=lz4 "${pool}"
Create a case insensitive raidz3 out of 50 files¶
pool=tank
for X in {1..50} ; do mkfile -n 2g ${pool}.$X ; done ;
sudo zpool create -O casesensitivity=insensitive ${pool} raidz3 "${PWD}/${pool}".{1..50}
Troubleshooting¶
Mount a pool that is giving you Trouble¶
This helped me get read access to a pool that was kernel panicking with the following error when I tried to import it normally:
Dec 7 14:48:40 localhost kernel: PANIC: blkptr at ffff8803fddb4200 DVA 0 has invalid OFFSET 294940902907904
ZFS on Mac OS X¶
Create a ZFS partition on /dev/disk3¶
## Must eject device in Disk Utility first
diskutil partitiondisk /dev/disk3 GPTFormat ZFS %noformat% 100% # strange syntax, but works
zpool create backups1 /dev/disk3s2 # create the zpool
mdutil -i off /Volumes/backups1 # required on MacZFS since spotlight does not function
ZFS on Linux¶
- If you get module errors:
modprobe zfs ; ldconfig - If you get permission denied, check selinux settings
CentOS 6 Repository¶
sudo yum install -y epel-release # assumes later CentOS 6 where epel is provided upstream
sudo yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm
sudo yum install zfs -y
Reinstalling when things fail¶
##!/bin/bash -x
yum install -y kernel-devel-$(uname -r)
zfs_version=0.6.5.4
dkms remove -m zfs -v "${zfs_version}" --all
dkms remove -m spl -v "${zfs_version}" --all
dkms add -m spl -v "${zfs_version}" --force
dkms add -m zfs -v "${zfs_version}" --force
dkms install -m spl -v "${zfs_version}" --force
dkms install -m zfs -v "${zfs_version}" --force
Inspect the rpm for what scripts it runs¶
This is useful for debugging failures after kernel upgrade.