ZFS

ZFS is the Zetabyte File System.

Links

OpenZFS - http://open-zfs.org
Tuning Guide - https://web.archive.org/web/20161223004915/http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
Hardware recommendations - http://blog.zorinaq.com/?e=10
Mac ZFS - http://code.google.com/p/maczfs/
Shadow migration feature - http://docs.oracle.com/cd/E23824_01/html/821-1448/gkkud.html
Speed tuning - http://icesquare.com/wordpress/how-to-improve-zfs-performance/
ZFS RAID levels - https://web.archive.org/web/20201120053331/http://www.zfsbuild.com/2010/05/26/zfs-raid-levels/
http://en.wikipedia.org/wiki/ZFS
http://wiki.freebsd.org/ZFSQuickStartGuide
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
http://zfsguru.com
http://zfsonlinux.org/faq.html
https://web.archive.org/web/20190603150811/http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-113-size-zfs-dedup-1354231.html
http://wiki.freebsd.org/ZFSTuningGuide#Deduplication
Corruption / failure to import - https://github.com/zfsonlinux/zfs/issues/2457
https://www.percona.com/blog/2018/05/15/about-zfs-performance/
https://wiki.freebsd.org/ZFSTuningGuide
https://freebsdfoundation.org/blog/raid-z-expansion-feature-for-zfs
https://www.binwang.me/2023-12-14-ZFS-Profiling-on-Arch-Linux.html

Tips

Memory

For normal operation, 1gb of memory per tb of disk space is suitable.
For dedup operation, 5gb of memory per tb of addressable disk space is suitable.

Log devices

Use a log device if you have lots of writes.
Mirror it, because if you lose it you lose the whole volume.
Speed and latency are most important, not size. Log flushes every 5 seconds.
Get SLC if possible, otherwise MLC

l2arc Cache devices

Use if you have lots of reads.
Size does matter, with big devices more data can be cached for faster reads of more data.
Speed and latency matter.
Mirrororing l2arc does not matter because if it fails, reads come from the spinning disks.
Too big of a device can suck up resources and cause poor performance. See https://wiki.freebsd.org/ZFSTuningGuide

Good explanation: https://web.archive.org/web/20160324170916/https://blogs.oracle.com/brendan/entry/test

zdb

Show the potential savings of turning on dedupe on zpool tank

https://web.archive.org/web/20130217052412/http://hub.opensolaris.org/bin/view/Community+Group+zfs/dedup

zdb -S tank

Show transactions and human readable dates in the zdb history

Use zdb -e for pools that are not mounted.

zdb -hh tank \
| egrep 'txg|time' \
| while read -r _ a b ; do
  if [ "$a" == "time:" ] ; then
    date -d @$b "+$a %F %T" ;
  else
    echo "$a  $b" ;
  fi ;
done

zpool

Create a zpool and its base filesystem

zpool create -f -o cachefile=/tmp/zpool.cache zpoolname /dev/ada1 #create a zpool

Add a cache device to a pool

## add ada0p3 as a cache device to the tank zpool
zpool add tank cache ada0p3

Show all configured zpool options for a given zpool

zpool get all tank

Show history of all operations on a given pool

## show history of operations on the pool, eg: snapshots, attribute changes
zpool history

Show real time statistics on a given zpool

## show per-device statistics every 1 second
zpool iostat -v 1

Show basic information about all imported zpools

## show zpool space info, deduplication ratio and health
zpool list

Show deduplication tables

## show deduplication table entries. Take entries * size / 1024 / 1024 to calculate DDT consumption
zpool status -D z2

Import a pool by different disk path

You can change the paths your pool is imported from. This is useful if you created your zpool using /dev/sdN when you should have used /dev/disk/by-id/, which is deterministic. The -d option lets you specify a directory to look within for the given pool's devices.

zpool import -d /dev/disk/by-id/ "$ZPOOL_NAME"

You may find that your pool was imported using links from this path that are not desirable, because there are several options available. For instance, you may find that your pool was imported using wwn links (EG: wwn-0x5000cca22eca1056) that are not very user friendly compared to a link that shows the model and serial number (EG: scsi-SATA_HGST_HMS5C4141BM_PM1302LAGR5A0F). Because these links are managed by udev and are created when the disk is seen by the system, either at boot or at insertion, and because nothing else should be referencing these symlinks, they are safe to delete. Export your pool, then delete unwanted symlinks for the devices related to your pool, leaving only the symlinks you want to use, then run zpool import -d once again.

Replace a disk in a zpool

## Replace the first disk with the second in the tank pool
zpool replace -f tank /dev/disk/by-id/ata-ST3000DM001-9YN166_W1F09CW9 /dev/disk/by-id/ata-ST3000DM001-9YN166_Z1F0N9S7

Real example

$ zpool replace -f tank /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PCJY9ASS /dev/disk/by-id/ata-HGST_HUH728080ALE600_VKHA6YDX
$ zpool status
  pool: home
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Dec 10 00:24:07 2017
config:

        NAME                                             STATE     READ WRITE CKSUM
        home                                             ONLINE       0     0     0
          ata-M4-CT064M4SSD2_0000000012170908F759-part4  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jan  8 19:57:45 2018
    47.1M scanned out of 13.7T at 6.72M/s, 592h39m to go
    11.5M resilvered, 0.00% done
config:

        NAME                                           STATE     READ WRITE CKSUM
        tank                                           DEGRADED     0     0     0
          raidz1-0                                     DEGRADED     0     0     0
            replacing-0                                UNAVAIL      0     0     0
              ata-HGST_HDN724040ALE640_PK1334PCJY9ASS  UNAVAIL      0     1     0  corrupted data
              ata-HGST_HUH728080ALE600_VKHA6YDX        ONLINE       0     0     0  (resilvering)
            ata-HGST_HDN724040ALE640_PK2334PEHG8LAT    ONLINE       0     0     0
            ata-HGST_HDN724040ALE640_PK2334PEHGD37T    ONLINE       0     0     0
            ata-HGST_HDN724040ALE640_PK2338P4H3TJPC    ONLINE       0     0     0

errors: No known data errors

Expand a zpool in place after replacing disks with larger disks

Expansion happens automatically if you have done zpool set autoexpand=on tank. If you did not do that and you find your pool has not expanded, you can perform the following:

List the absolute paths of your devices with something like:

zpool list -v -PH | awk '$1 ~ "^\/dev\/" {gsub("-part1","",$1) ; print $1 ;}'

Then go through your device list and run

zpool online -e tank <disk-name> # do the expansion
zpool list -v tank # check the EXPANDSZ column for the disk

After doing all of these your pool should be expanded.

zfs

show differences between current filesystem state and snapshot state

zfs diff tank tank@snap

Show configured properties for a filesystem

zfs get all

Show custom filesystem attributes

## show custom attributes that override inherited attributes
zfs get all -s local tank

Show an overview of all mounted zfs filesystems

## show disk space including free physical disk space and mount info
zfs list

Show specified fields of each filesystem

## show the listed fields of all filesystems
zfs list -t all -o name,referenced,used,written,creation,userused@root

Show only snapshots

zfs list -t snapshot

Show space consumed by file owner

zfs userspace tank

Disable atime updates for a filesystem

zfs set atime=off tank

Set compression to lz4 for a filesystem

zfs set compression=lz4 tank

Set deduplication to enabled for a filesystem

zfs set dedup=on tank

Set a filesystem to readonly

zfs set readonly=on zpoolname/dataset

zfs set sharenfs=on tank

Create a dataset

## create a dataset 'sole' on zpool 'tank'
zfs create tank/sole

Destroy multiple snapshots

zfs destroy tank@20130413-weekly,20130420-weekly,20130428-weekly,20130505-weekly

zfs send / receive

Replicate a zpool (use the latest snapshot name as the source) to a blank zpool:

zfs send -v -D -R tank@20120907-oldest | zfs receive -F -v z2

-D enables a deduplicated stream.
-R enables a recursive send of all snapshots and filesystems up to that point.
-F enables deletion of any snapshots on the target that don't exist on the sender
-v enables verbose mode

recursively zfs send a filesystem to a remote host and receive it as a new dataset

zfs send -v -D -R z1@20120907-oldest | ssh otherhost zfs receive -v z2/z1

Show summary of what would be sent

This shows an entire dataset up to the given snapshot

zfs send -n -v -D -R tank@20140531-monthly

Show the space differences between two snapshots

zfs send -n -v -D -i tank@20140531-monthly tank@20141031-monthly

Show the amount of new space consumed by each monthly

zfs list -o name | grep 'tank@.*monthly' | while read -r X ; do [[ ! $a =~ .*monthly ]] && a=$X || zfs send -n -v -D -i $a $X && a=$X ; done 2>&1 | grep send

Complex examples

Create a raidz called tank

Create a raidz pool from 4 disks and set some properties:

pool=tank
zpool create -f "${pool}" raidz /dev/disk/by-id/scsi-SATA_HGST_HDN724040A_PK2338P4H*-part1 -o ashift=12
zfs set dedup=on "${pool}"
zpool set listsnapshots=on "${pool}"
zfs set atime=off "${pool}"
zfs set compression=lz4 "${pool}"

Create a case insensitive raidz3 out of 50 files

pool=tank
for X in {1..50} ; do mkfile -n 2g ${pool}.$X ; done ;
sudo zpool create -O casesensitivity=insensitive ${pool} raidz3 "${PWD}/${pool}".{1..50}

Troubleshooting

Mount a pool that is giving you Trouble

zpool import -o failmode=continue -o readonly=on zpool_name

This helped me get read access to a pool that was kernel panicking with the following error when I tried to import it normally:

Dec  7 14:48:40 localhost kernel: PANIC: blkptr at ffff8803fddb4200 DVA 0 has invalid OFFSET 294940902907904

ZFS on Mac OS X

http://openzfsonosx.org

Create a ZFS partition on /dev/disk3

## Must eject device in Disk Utility first
diskutil partitiondisk /dev/disk3 GPTFormat ZFS %noformat% 100% # strange syntax, but works
zpool create backups1 /dev/disk3s2 # create the zpool
mdutil -i off /Volumes/backups1 # required on MacZFS since spotlight does not function

ZFS on Linux

If you get module errors: modprobe zfs ; ldconfig
If you get permission denied, check selinux settings

CentOS 6 Repository

sudo yum install -y epel-release # assumes later CentOS 6 where epel is provided upstream
sudo yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm
sudo yum install zfs -y

Reinstalling when things fail

##!/bin/bash -x
yum install -y kernel-devel-$(uname -r)
zfs_version=0.6.5.4
dkms remove  -m zfs -v "${zfs_version}" --all
dkms remove  -m spl -v "${zfs_version}" --all
dkms add     -m spl -v "${zfs_version}" --force
dkms add     -m zfs -v "${zfs_version}" --force
dkms install -m spl -v "${zfs_version}" --force
dkms install -m zfs -v "${zfs_version}" --force

Inspect the rpm for what scripts it runs

This is useful for debugging failures after kernel upgrade.

rpm -q --scripts zfs-dkms

Building on CentOS 6

yum groupinstall "Development tools" && yum install -y libuuid-devel zlib-devel bc lsscsi mdadm parted kernel-debug
## For spl, then again for zfs:
./configure && make && make rpm && rpm -i *64.rpm

ZFS