ZFS stands for Zettabyte File System. Zfs is a 128 bit Fs that was firstly
introduced on June 2006 with Solaris 10 6/06 release. ZFS allows 256 quadrillion
zettabytes of storage. Which means there would be no limit on number of
filesystems and on number of files/directories that can exists on ZFS.
Zfs does not replace any traditional Filesystem (UFS in Solaris) and it does not
improve any existing UFS technology. Instead it is a New Approach to Manage
Data in Solaris. Zfs is more robust, scalable & earier to administer than
traditional FS. But it will take some time to so capture the market and to replace
UFS, the most stable FS for Solaris till date.
ZFS Pools
ZFS pools are the storage pools to manager Physical Disks/Storage. In traditinal
UFS filesystem we use to partition the disk and then used to make Filesystems on
the slices. In ZFS the approach is completely different. Here we used to make a
pool of Block devices (Disks) and the Filesystems are created from the pools. It
means whatever disks would be free can be used to create Filesystems as per
requirement. You can think Pools as Diskgroups used in VXVM.
ZFS requirements
ZFS Terminology
TERMS
Definition
Checksum
Clone
snapshot
Dataset
A generic name for the following ZFS entities: clone,
filesystems, snapshots & volumes. Each dataset is identified by a unique name in
the ZFS namespace.
Default FS
A file system that s created by default when using Solaris Live
Upgrade to migrate from UFS to a ZFS root FS. The current set of default FS term
is /, /usr,/opt & /var.
ZFS FS
A ZFS dataset that is mounted within the standard syatem
namespace and behaves like other traditinal FS.
Mirror
A virtual device also called a RAID-1 devices, that stores
identical copies of data on two or more disks.
Pool
A logical group of block devices describing the layout & physical
characteristics of the available storage. Space for datasets is allocated from a
pool. Also called a storage pool or simply a pool.
RAID-Z
A virtual device that stores data and parity on multiple disks,
same as that of RAID-5.
Resilvering
The process of transerring data from one device to another
device. For example, when a mirror component is taken offline and then later is
put back online, the data from the up to date mirror component us copied to the
newly restored mirror component. The process is called mirror resynchronization
in traditional volume management products.
Shared FS
The set of the file systems that are shared between the alternate
boot environment and the primary boot environment. This set includes file
systems, such as .export & the area reserved for swap. Shared FS migh also
contain zone roots.
Snapshot
Virtual Device A logical device in a pool, which can be a physical device, a file or
a collection of device.
Volume
A dataset used to emulate a physical device, For example you
can create a ZFS volume as a swap device.
1.) RAID-0 : Data distributed across one or more disks. There is no redundancy in
RAID-0. If any disk fails, all data will be lost. That the reason RAID-0 is least
preferred.
2.) RAID-1 : The two exact copies of data will be retained in the server. There will
be no data loss unless & until one mirror survives. This is the most commmonly
used RAID in any Volume Manager.
Note: You can use -f option in case of any errors while zpool creation.
# df -h mypool
Filesystem size used avail capacity Mounted on
mypool 67G 21K 67G 1% /mypool
This output is showing a pool named as mypool and ZFS filesystem /mypool
which can be used to store data. ZFS its create this directory itself.
We can check the space availabity by using zpool list command as shown below.
# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 68G 124K 68.0G 0% ONLINE
Any errors in the Zpool can be check by zpool status command as show below
errors: No known data errors:
# zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
errors: No known data errors
Now I will create a new ZFS filesystem using same Zpool. As zpool list is
showng that the zpool is having 68GB space you we can use the free space to
create new FS.
# df -h /mypool/myfs
Filesystem size used avail capacity Mounted on
mypool/myfs 67G 21K 67G 1% /mypool/myfs
This is the way a simple ZFS FS is created. Herein now I will present to change
some of the FS properties which are mostly used in ZFS.
# I will create a new ZFS FS named as mypool/myquotefs and set the FS quota to
20GB. This will prevent the other FS in the pool from using all the space in the
pool.
# df -h /mypool/myquotefs
Filesystem size used avail capacity Mounted on
mypool/myquotefs 20G 21K 20G 1% /mypool/myquotefs
Here I will change the FS mountpoint to desired name by chainging the ZFS
property.
# df -h /test
Filesystem size used avail capacity Mounted on
mypool/myfs 67G 21K 67G 1% /test
#
Zfs list will list all the active ZFS FS and volumes on the server:
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 140K 66.9G 23K /mypool
mypool/myfs 21K 66.9G 21K /test
mypool/myquotefs 21K 20.0G 21K /mypool/myquotefs
-r option is used to list recursively the datasets in the zpool followed by the
pool name as shown below:
# zpool status
pool: mypool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
errors: No known data errors
Note: To check all the properties you can use zfs get all, I will show the output of
the same below in the post.
We can rename a ZFS FS using zfs rename command. Below is the syntax for
the same:
# zfs rename
Mirroring in ZFS:
=================
I will be going to present the mirroing on ZFS FS and the operation like taking
disk out of ZFS pool and insertion on disk in the pool. You will clearly notice the
status while insertion and taking the disk out. Also you will see how to check the
ZFS parameters and how to change them using zfs set command. I will also try
to show you how to destroy the ZFS FS and ZFS pool. This will give you the basic
platform on how to get speedup with the ZFS FS which is going to be the Primary
FS in Solaris11.
Note: You will also find the procedure to change disk under ZFS in the below
given post, You need to take the disk offline from the respective pool and replace
the disk with newer one and again take the replaced disk online for the
respective pool and monitor untill it completly synced.
Note: Both disks should be of same size or the new disk which we are going to
mirror is of more size than that of existing one.
# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Sun Oct 9 22:56:23 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0 143K resilvered
errors: No known data errors
# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
mypool 68G 156K 68.0G 0% ONLINE
# zpool iostat
# zpool status -x
all pools are healthy
# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Sun Oct 9 22:56:23 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0 143K resilvered
errors: No known data errors
# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Sun Oct 9 22:56:23 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0 143K resilvered
errors: No known data errors
#
# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Sun Oct 9 23:02:54 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0 154K resilvered
errors: No known data errors
#
# zpool status
pool: mypool
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online or replace the device with
zpool replace.
scrub: resilver completed after 0h0m with 0 errors on Sun Oct 9 23:02:54 2011
config:
NAME STATE READ WRITE CKSUM
mypool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t2d0 OFFLINE 0 0 0
# zpool status
pool: mypool
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Sun Oct 9 23:03:57 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0 24K resilvered
c1t0d0 ONLINE 0 0 0
errors: No known data errors
#
# zpool status
pool: mypool
state: ONLINE
scrub: scrub completed after 0h0m with 0 errors on Sun Oct 9 23:06:37 2011
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
errors: No known data errors
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
Note: I am disabling the ZFS inherit FS feature which means we wont be able to
see the FS which are mounted automatically by ZFS. Below is the eg. given:
Note: I will suggest you to go through the ZFS FS property index table which you
will be able to find easily from google or sun site. That will give you more idae
about all the parameters which can be changed and there effect can be noticed.
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 164K 66.9G 23K none
mypool/myfs 21K 66.9G 21K /test
mypool/myquotefs 21K 20.0G 21K none
# df -k
Filesystem kbytes used avail capacity Mounted on
/dev/vx/dsk/bootdg/rootvol
8262869 5102426 3077815 63% /
/devices 0 0 0 0% /devices
ctfs 0 0 0 0% /system/contract
proc 0 0 0 0% /proc
mnttab 0 0 0 0% /etc/mnttab
swap 23798088 1624 23796464 1% /etc/svc/volatile
objfs 0 0 0 0% /system/object
sharefs 0 0 0 0% /etc/dfs/sharetab
/platform/sun4u-us3/lib/libc_psr/libc_psr_hwcap1.so.1
8262869 5102426 3077815 63% /platform/sun4u-us3/lib/libc_psr.so.1
/platform/sun4u-us3/lib/sparcv9/libc_psr/libc_psr_hwcap1.so.1
8262869 5102426 3077815 63% /platform/sun4u-us3/lib/sparcv9/libc_psr.so.1
fd 0 0 0 0% /dev/fd
swap 23796480 16 23796464 1% /tmp
swap 23796528 64 23796464 1% /var/run
swap 23796464 0 23796464 0% /dev/vx/dmp
swap 23796464 0 23796464 0% /dev/vx/rdmp
/dev/vx/dsk/bootdg/var_crash
20971520 71784 19593510 1% /var/crash
mypool/myfs 70189056 21 70188892 1% /test
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 164K 66.9G 23K /mypool
mypool/myfs 21K 66.9G 21K /test
mypool/myquotefs 21K 20.0G 21K /mypool/myquotefs
# df -k
Filesystem kbytes used avail capacity Mounted on
/dev/vx/dsk/bootdg/rootvol
8262869 5102427 3077814 63% /
/devices 0 0 0 0% /devices
ctfs 0 0 0 0% /system/contract
proc 0 0 0 0% /proc
mnttab 0 0 0 0% /etc/mnttab
swap 23796840 1624 23795216 1% /etc/svc/volatile
objfs 0 0 0 0% /system/object
sharefs 0 0 0 0% /etc/dfs/sharetab
/platform/sun4u-us3/lib/libc_psr/libc_psr_hwcap1.so.1
8262869 5102427 3077814 63% /platform/sun4u-us3/lib/libc_psr.so.1
/platform/sun4u-us3/lib/sparcv9/libc_psr/libc_psr_hwcap1.so.1
8262869 5102427 3077814 63% /platform/sun4u-us3/lib/sparcv9/libc_psr.so.1
fd 0 0 0 0% /dev/fd
swap 23795232 16 23795216 1% /tmp
swap 23795280 64 23795216 1% /var/run
swap 23795216 0 23795216 0% /dev/vx/dmp
swap 23795216 0 23795216 0% /dev/vx/rdmp
/dev/vx/dsk/bootdg/var_crash
20971520 71784 19593510 1% /var/crash
mypool/myfs 70189056 21 70188892 1% /test
mypool 70189056 23 70188892 1% /mypool
mypool/myquotefs 20971520 21 20971499 1% /mypool/myquotefs
#
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 164K 66.9G 23K /mypool
mypool/myfs 21K 66.9G 21K /test
mypool/myquotefs 21K 20.0G 21K /mypool/myquotefs
I hope this post will help the begineers to get some speed with ZFS. I will try to
cover the complex ZFS tasks in my coming posts. This is just an overview and I
think is useful for many SOlaris Administrators. :-)
In our earlier post ( To get Started with ZFS ) , yogesh discussed about various
ZFS pool and file system Operations. In this post I will be demonstrating the
redundancy capability for different ZFS pools and also the recovery procedure
from the disk failure scenarios. I have performed this lab on Solaris 11 , these
instructions are same for Solaris 10 though.
1. Simple and Striped Pool ( Equivalent to Raid-0 and Data is Non redundant)
2. Mirrored Pool ( Equivalent to Raid-1)
3. Raidz pool ( Equivalent to Single Parity Raid 5 Can with stand upto single
disk failure)
4. Raidz-2 pool ( Equivalent to Dual Parity Raid 5 Can withstand upto two disk
failures)
5. Raidz-3 pool ( Equivalent to Triple Partity Raid 5 Can with stand upto thre
disk Failures)
A RAIDZ configuration with N disks of size X with P parity disks can hold
approximately (N-P)*X bytes and can withstand P device(s) failing before data
integrity is compromised.
Disk Configuration:
root@gurkulunix3:~# echo|format
state: ONLINE
scan: none requested
config:
NAME
STATE
poolnr
ONLINE
c3t2d0 ONLINE
c3t3d0 ONLINE
root@gurkulunix3:~# echo|format
Searching for disksdone
pool: poolnr
state: UNAVAIL
status: One or more devices are faulted in response to persistent errors. There
are insufficient replicas for the pool to
continue functioning.
action: Destroy and re-create the pool from a backup source. Manually marking
the device
repaired using zpool clear may allow some data to be recovered.
scan: none requested
config:
NAME
STATE
poolnr
UNAVAIL
0 insufficient replicas
c3t2d0 FAULTED
c3t6d0 ONLINE
From The Above Scenario it has been observed that Simple ZFS pool cannot
withstand for any disk failures.
root@gurkulunix3:~# echo|format
Searching for disksdone
2.0G
mpool/mtestfs
32K
2.0G
2.0G
31K
1%
2.0G
1%
NAME
STATE
mpool
ONLINE
mirror-0 ONLINE
c3t4d0 ONLINE
c3t7d0 ONLINE
/mpool
/mpool/mtestfs
root@gurkulunix3:~# echo|format
Searching for disksdone
NAME
STATE
mpool
DEGRADED
mirror-0 DEGRADED
c3t4d0 ONLINE
c3t7d0 UNAVAIL
0 cannot open
After physically Replacing the Failed disk ( placing new disk in same location)
root@gurkulunix3:~# echo|format
Searching for disksdone
>>> Label new disk with SMI Label ( A requirement to attach to ZFS pool)
root@gurkulunix3:~# format -L vtoc -d c3t7d0
Searching for disksdone
selecting c3t7d0
[disk formatted]
NAME
STATE
mpool
ONLINE
mirror-0 ONLINE
c3t4d0 ONLINE
c3t7d0 ONLINE
Single and Double Disk Failure Scenarios for ZFS Raid-Z Pool
root@gurkulunix3:~# echo|format
Searching for disksdone
NAME
STATE
rzpool
ONLINE
raidz1-0 ONLINE
c3t2d0 ONLINE
c3t3d0 ONLINE
c3t4d0 ONLINE
c3t7d0 ONLINE
5.8G 575M
5.3G
10%
/rzpool/r5testfs
root@gurkulunix3:/downloads# cd /rzpool/r5testfs/
root@gurkulunix3:/rzpool/r5testfs# ls -l
total 1176598
-rw-rr 1 root
root
root@gurkulunix3:/rzpool/r5testfs#
root@gurkulunix3:~# echo|format
Searching for disksdone
Specify disk (enter its number): Specify disk (enter its number):
NAME
rzpool
STATE
DEGRADED
raidz1-0 DEGRADED
c3t2d0 ONLINE
c3t3d0 ONLINE
c3t4d0 ONLINE
c3t7d0 UNAVAIL
0 cannot open
5.8G 575M
5.3G
10%
/rzpool/r5testfs
root@gurkulunix3:~# cd /rzpool/r5testfs
root@gurkulunix3:/rzpool/r5testfs# ls -l
total 1176598
-rw-rr 1 root
root
root@gurkulunix3:/rzpool/r5testfs#
After replacing the failed disk with new disk, in the same location
root@gurkulunix3:~# echo|format
Searching for disksdone
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using zpool replace.
see: http://www.sun.com/msg/ZFS-8000-4J
scan: none requested
config:
NAME
rzpool
STATE
DEGRADED
raidz1-0 DEGRADED
c3t2d0 ONLINE
c3t3d0 ONLINE
c3t4d0 ONLINE
c3t7d0 FAULTED
0
0
0 corrupted data
<<== this State
Changed to Faulted just because the zpool could see the new disk but with
no/corrupted data
errors: No known data errors
Replacing the Failed Disk Component in the Zpool
scan: resilvered 192M in 0h1m with 0 errors on Sun Sep 16 11:50:49 2012
config:
NAME
STATE
rzpool
ONLINE
raidz1-0 ONLINE
c3t2d0 ONLINE
c3t3d0 ONLINE
c3t4d0 ONLINE
c3t7d0 ONLINE
NAME
STATE
rzpool
ONLINE
raidz1-0 ONLINE
c3t2d0 ONLINE
c3t3d0 ONLINE
c3t4d0 ONLINE
c3t7d0 ONLINE
root@gurkulunix3:~# echo|format
Searching for disksdone
NAME
STATE
rzpool
UNAVAIL
raidz1-0 UNAVAIL
0
0
0 insufficient replicas
0 insufficient replicas
c3t2d0 ONLINE
c3t3d0 ONLINE
c3t4d0 UNAVAIL
0 cannot open
c3t7d0 UNAVAIL
0 cannot open
Conclusion: /rzpool/r5testfs filesystem not available for usage and the Zpool
cannot be recovered from the current state
Its too long here to post the RaidZ2 and RaidZ3 disk failure scenarios, I will be
posting them as a separate post.
The ZFS file system is a new kind of file system that fundamentally changes the
way file systems are administered, with the below mentioned features:
ZFS uses the concept of storage pools to manage physical storage. Historically,
file systems were constructed on top of a single physical device. To address
multiple devices and provide for data redundancy, the concept of a volume
managerwas introduced to provide a representation of a single device so that file
systems would not need to be modified to take advantage of multiple devices.
This design added another layer of complexity and ultimately prevented certain
file system advances because the file system had no control over the physical
placement of data on the virtualized volumes.
Transactional Semantics
ZFS is a transactional file system, which means that the file system state is
always consistent on disk. In Transactional file system data is managed using
copy on write semantics. Data is never overwritten, and any sequence of
operations is either entirely committed or entirely ignored. Thus, the file system
can never be corrupted through accidental loss of power or a system crash.
Although the most recently written pieces of data might be lost, the file system
itself will always be consistent. In addition, synchronous data (written using the
With ZFS, all data and metadata is verified using a user-selectable checksum
algorithm. In addition, ZFS provides for self-healing data. ZFS supports storage
pools with varying levels of data redundancy. When a bad data block is detected,
ZFS fetches the correct data from another redundant copy and repairs the bad
data, replacing it with the correct data.
Unparalleled Scalability
Zfs is 128 bit filesystem, that allows 256 quadrillion zettabytes of storage.All
metadata is allocated dynamically, so no need exists to preallocate inodes or
otherwise limit the scalability of the file system when it is first created. All the
algorithms have been written with scalability in mind. Directories can have up to
248 (256 trillion) entries, and no limit exists on the number of file systems or the
number of files that can be contained within a file system.
ZFS Snapshots
As data within the active dataset changes, the snapshot consumes disk space by
continuing to reference the old data. As a result, the snapshot prevents the data
from being freed back to the pool.
#zpool create raid-pool-1 raidz1 c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0
#zpool create raid-pool-1 raidz2 c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0
#zfs mount -a
#zfs umount a
#zfs list
# zpool status -x
# zpool list
#zpool iostat 2
#zpool iostat -v 2
#zpool online
#zpool clear
#zpool import
#zpool import -d
List snapshots
Destroy clone