3 - File Systems
By Peter Baer Galvin
For Usenix
Last Revision April 2009
Save time
Overview
Objectives
Choosing the most appropriate file system(s)
UFS / SDS
Veritas FS / VM (not in detail)
ZFS
Or...
Use virtualbox
Use your own system
Use a remote machine you have legit
access to
Consider
ISV support
Priorities
Sometimes it fails
Many limits
Many features lacking (compared to ZFS)
# metadb -a /dev/dsk/c0t0d0s5
# metadb -a /dev/dsk/c0t0d0s6
# metadb -a /dev/dsk/c0t1d0s5
# metadb -a /dev/dsk/c0t1d0s6
Now the root disk is mirrored, and commands such as Solaris upgrade, live
upgrade, and boot understand that
Transactional operation
Keeps things always consistent on disk
Removes almost all constraints on I/O order
Allows us to get huge performance wins
Copyright 2009 Peter Baer Galvin - All Rights Reserved 42
FS FS
Volume Volume
(stripe) (mirror)
Abstraction: malloc/free
No partitions to manage
Grow/shrink automatically
All bandwidth always available
All storage in the pool is shared
FS FS FS FS FS
mount
mount [-vO] [-o opts] <-a | filesystem>
unmount [-f] <-a | filesystem|mountpoint>
share <-a | filesystem>
unshare [-f] <-a | filesystem|mountpoint>
zfs/fs-name The name of the filesystem. If the special filesystem name "//" is used, then the
system snapshots only filesystems with the zfs user property "com.sun:auto-snapshot:<label>" set to
true, so to take frequent snapshots of tank/timf, run the following zfs command:
When set to none, we don't take automatic snapshots, but leave an SMF instance available for users to
manually fire the method script whenever they want - useful for snapshotting on system events.
zfs/keep How many snapshots to retain - eg. setting this to "4" would keep only the four
most recent snapshots. When each new snapshot is taken, the oldest is destroyed. If a snapshot has
been cloned, the service will drop to maintenance mode when attempting to destroy that snapshot.
Setting to "all" keeps all snapshots.
zfs/period How often you want to take snapshots, in intervals set according to "zfs/
interval" (eg. every 10 days)
zfs/backup-lock You shouldn't need to change this - but it should be set to "unlocked"
by default. We use it to indicate when a backup is running.
zfs/label A label that can be used to differentiate this set of snapshots from
others, not required. If multiple schedules are running on the same machine, using
distinct labels for each schedule is needed - otherwise oneschedule could remove
snapshots taken by another schedule according to it's snapshot-retention policy. (see
"zfs/keep")
http://blogs.sun.com/erwann/resource/
menu-location.png
Reply
On Thu, Nov 17, 2005 at 05:21:36AM -0800, Jim Lin wrote:
> Does ZFS reorganize (ie. defrag) the files over time?
Not yet.
Believe me, we've thought about this a lot. There is a lot we can do to
improve performance, and we're just getting started.
full backups
incremental backups
http://www.youtube.com/watch?v=CN6iDzesEs0&fmt=18
Copyright 2009 Peter Baer Galvin - All Rights Reserved 113
Objectives:
Requirements:
Identify the storage available for adding to the ZFS pool using the format(1) command. Your output will vary from that shown here:
# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c0t2d0
/pci@0,0/pci1022,7450@2/pci1000,3060@3/sd@2,0
1. c0t3d0
/pci@0,0/pci1022,7450@2/pci1000,3060@3/sd@3,0
Specify disk (enter its number): ^D
Objectives:
Requirements:
Two servers (SPARC or x64 based) - one from the previous example - running the OpenSolaris OS.
# svcs nfs/server
STATE STIME FMRI
disabled 6:49:39 svc:/network/nfs/server:default
# svcadm enable nfs/server
Share the ZFS filesystem over NFS:
# svcs nfs/client
STATE STIME FMRI
disabled 6:47:03 svc:/network/nfs/client:default
# svcadm enable nfs/client
Mount the shared filesystem on the client:
# mkdir /mountpoint
# mount -F nfs x4100:/mypool/myfs /mountpoint
# df -h /mountpoint
Filesystem size used avail capacity Mounted on
x4100:/mypool/myfs 9.8G 18K 9.8G 1% /mountpoint
Objectives:
Configure a CIFS share on one machine (from the previous example) and make it available on the other machine.
Requirements:
# svcs smb/server
STATE STIME FMRI
disabled 6:49:39 svc:/network/smb/server:default
# svcadm enable smb/server
Because we have not explicitly named the share, we can examine the default name assigned to it using the following command:
Step 5: Edit the file /etc/pam.conf to support creation of an encrypted version of the user's password for CIFS.
Step 8: Make a mount point on the client and mount the CIFS resource
from the server.
Mount the resource across the network and check it using the following
command sequence:
# mkdir /mountpoint2
# mount -F smbfs //root@x4100/mypool_myfs2 /mountpoint2
Password: *******
# df -h /mountpoint2
Filesystem size used avail capacity Mounted on
//root@x4100/mypool_myfs2 9.8G 18K 9.8G 1% /
mountpoint2
# df -n
/ : ufs
/mountpoint : nfs
/mountpoint2 : smbfs
Copyright 2009 Peter Baer Galvin - All Rights Reserved 122
Objectives
Requirements:
Step 1: Start the SSCSI Target Mode Framework and verify it.
Use the following commands to start up and check the service on the host that provides the target:
# svcs stmf
STATE STIME FMRI
disabled 19:15:25 svc:/system/device/stmf:default
# svcadm enable stmf
# stmfadm list-state
Operational Status: online
Config Status : initialized
Copyright 2009 Peter Baer Galvin - All Rights Reserved 123
Use the following command to ensure that the target mode framework can see the HBA ports:
# stmfadm list-target -v
Target: wwn.210000E08B909221
Operational Status: Online
Alias : qlt0,0
Sessions : 4
Initiator: wwn.210100E08B272AB5
Alias: ute198:qlc1
Logged in since: Thu Mar 27 16:38:30 2008
Initiator: wwn.210100E08B296A60
Alias: ute198:qlc3
Alias: ute198:qlc0
Logged in since: Thu Mar 27 16:38:30 2008
Initiator: wwn.210000E08B096A60
Alias: ute198:qlc2
Logged in since: Thu Mar 27 16:38:30 2008
Target: wwn.210100E08BB09221
Operational Status: Online
Provider Name : qlt
Alias : qlt1,0
Sessions : 4
Initiator: wwn.210100E08B272AB5
Alias: ute198:qlc1
Logged in since: Thu Mar 27 16:38:30 2008
Initiator: wwn.210100E08B296A60
Alias: ute198:qlc3
Logged in since: Thu Mar 27 16:38:30 2008
Initiator: wwn.210000E08B072AB5
Alias: ute198:qlc0
Logged in since: Thu Mar 27 16:38:30 2008
Initiator: wwn.210000E08B096A60
Alias: ute198:qlc2
Logged in since: Thu Mar 27 16:38:30 2008
Use ZFS to create a volume (zvol) for use as the storage behind the
target:
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
mypool 68G 94K 68.0G 0% ONLINE -
The zvol becomes the SCSI logical unit (disk) behind the target:
# sbdadm create-lu /dev/zvol/rdsk/mypool/myvol
Created the following LU:
GUID DATA SIZE SOURCE
6000ae4093000000000047f3a1930007 5368643584 /dev/zvol/rdsk/mypool/
myvol
# stmfadm list-lu -v
LU Name: 6000AE4093000000000047F3A1930007
Step 5: Find the initiator HBA ports to which to map the LUs.
Discover HBA ports on the initiator host using the following command:
# fcinfo hba-port
HBA Port WWN: 25000003ba0ad303
Port Mode: Initiator
Port ID: 1
OS Device Name: /dev/cfg/c5
Manufacturer: QLogic Corp.
Model: 2200
Firmware Version: 2.1.145
FCode/BIOS Version: ISP2200 FC-AL Host Adapter Driver:
Type: L-port
State: online
Supported Speeds: 1Gb
Current Speed: 1Gb
Node WWN: 24000003ba0ad303
Step 5: Find the initiator HBA ports to which to map the LUs.
Discover HBA ports on the initiator host using the following command:
# fcinfo hba-port
HBA Port WWN: 25000003ba0ad303
Port Mode: Initiator
Port ID: 1
OS Device Name: /dev/cfg/c5
Manufacturer: QLogic Corp.
Model: 2200
Firmware Version: 2.1.145
FCode/BIOS Version: ISP2200 FC-AL Host Adapter Driver:
Type: L-port
State: online
Supported Speeds: 1Gb
Current Speed: 1Gb
Node WWN: 24000003ba0ad303
. . .
Copyright 2009 Peter Baer Galvin - All Rights Reserved 129
Step 6: Create a host group and add the world-wide numbers (WWNs) of the initiator host HBA
ports to it.
With the host group created, you're now ready to export the logical unit. This is accomplished by
adding a view entry to the logical unit using this host group, as shown in the following command:
# stmfadm add-view -h mygroup 6000AE4093000000000047F3A1930007
First, force the devices on the initiator host to be rescanned with a simple
script:
#!/bin/ksh
fcinfo hba-port |grep "^HBA" |awk '{print $4}'|while read ln
do
fcinfo remote-port -p $ln -s >/dev/null 2>&1
done
The disk exported over FC should then appear in the format list:
# format
Searching for disks...done
c6t6000AE4093000000000047F3A1930007d0: configured with
capacity of 5.00GB
...
partition> p
Current partition table (default):
Total disk cylinders available: 20477 + 2 (reserved cylinders)
partition>
Can upgrade by using liveupgrade (LU) to mirror to second disk (ZFS pool) and
upgrading there, then booting there
Run luactivate on the newly upgraded alternatve BE so that when the system is
rebooted, it will be the new primary BE
# luactivate zfsBE
Copyright 2009 Peter Baer Galvin - All Rights Reserved 133
A lot of comparisons have been done, and will continue to be done, between ZFS and other filesystems. People
tend to focus on performance, features, and CLI tools as they are easier to compare. I thought I'd take a moment
to look at differences in the code complexity between UFS and ZFS. It is well known within the kernel group that
UFS is about as brittle as code can get. 20 years of ongoing development, with feature after feature being
bolted on tends to result in a rather complicated system. Even the smallest changes can have wide ranging
effects, resulting in a huge amount of testing and inevitable panics and escalations. And while SVM is
considerably newer, it is a huge beast with its own set of problems. Since ZFS is both a volume manager and a
filesystem, we can use this script written by Jeff to count the lines of source code in each component. Not a true
measure of complexity, but a reasonable approximation to be sure. Running it on the latest version of the gate
yields:
The numbers are rather astounding. Having written most of the ZFS CLI, I found the most horrifying number to
be the 162,000 lines of userland code to support SVM. This is more than twice the size of all the ZFS code
(kernel and user) put together! And in the end, ZFS is about 1/5th the size of UFS and SVM. I wonder what
those ZFS numbers will look like in 20 years...
Copyright 2009 Peter Baer Galvin - All Rights Reserved 136
performance
• Enterprise Grade Flash
> 3-5 year lifetime
100mS
10mS
1mS
100uS
10uS TAPE
1uS HDD
100nS FLASH/
SSD
10nS
DRAM
1nS
CPU
2x
11%
7410
288 x 3.5” SATAII Disks
Up to 287TB* total storage
Hybrid Storage Pool with Read and Write optimized SSD
Price
7210
48x 3.5” SATAII Disks
Up to 46TB total storage
Hybrid Storage Pool with Write optimized SSD
7110
16x2.5”SAS Disks, 2.3TB
Standard Storage Pool SSD is not used
[Kasper] Kasper and McClellan, Automating Solaris
Installations, SunSoft Press, 1995
DTrace
http://users.tpg.com.au/adsln4yb/
dtrace.html
http://www.solarisinternals.com/si/dtrace/
index.php
http://www.sun.com/bigadmin/content/dtrace/