Tip

Getting started with Sun ZFS file system

Sun ZFS is a new kind of file system -- it is a fundamentally new approach to data management. Sun's ZFS file system is the brainchild of Jeff Bonwick, Sun Microsystems Chief Technical Officer of Storage Technologies, who spent years working

    Requires Free Membership to View

on the Solaris Virtual Memory system, which applied virtual memory concepts to storage.

When you add a DIMM, you don't partition it, you don't allocate it, and when it is replaced you don't fsck it. Memory management is something we all take for granted because complex software masks management away. ZFS was born with the intent to bring the same advantages to storage.

At first glance, the most striking feature of ZFS is that it combines the volume manager (which virtualizes disks, typically via RAID) and the file system into a single piece of software.

Disks are formed into a storage "pool" using a single simple command:

root@quadra ~$ zpool create mypool c4t0d0 c10t0d0 c11t0d0

root@quadra ~$ zpool list    
NAME   SIZE  USED AVAIL  CAP HEALTH ALTROOT
mypool 2.77G  144K 2.77G   0% ONLINE -

root@quadra ~$ zpool status
 pool: mypool
 state: ONLINE
 scrub: none requested
config:

    NAME    STATE   READ WRITE CKSUM
    mypool   ONLINE    0   0   0
     c4t0d0  ONLINE    0   0   0
     c10t0d0  ONLINE    0   0   0
     c11t0d0  ONLINE    0   0   0

errors: No known data errors

root@quadra ~$ zfs list 
NAME            USED AVAIL REFER MOUNTPOINT
mypool          73.5K 2.72G  18K /mypool

root@quadra ~$ df -h | grep mypool
mypool        2.8G  18K 2.8G  1% /mypool

Using the simple command "zpool create" we specify a name for our pool ("mypool") and then the drives we want to assign to that pool, in this case three. This creates a "dynamic stripe," akin to RAID0 with the added bonus that strip width is not set in stone.

The proceeding commands in the example explore our new creation.

* "zpool list" will give you a quick summary of available pools
* "zpool status" will output information about the pool configuration, status and errors
* "zfs list" will show us all our available datasets
* "df -h" is our old and trusty friend that displays information about mounted file systems

This is the fantastic part. Notice carefully what has happened and what we didn't need to do! The disks were partitioned, the RAID was set up and made ready, the file system was created and it was mounted.

Using the traditional Unix Logical Volume Manager we would need to partition the disks, create physical volumes, then create a logical volume, then create a file system, create a mount point and mount the volume.

Furthermore, Veritas Volume Manager or Logical Volume Manager commands are long, complex and frustrating!

With ZFS this was one simple and easy-to-understand command without any hard work. If you don't appreciate the full gravity of how remarkable this is, you should spend more time setting LVM or VxVM and then come back.

Drawing from the memory paradigm, now that the pool is in place we won't touch it again unless we want to add or replace disks in the pool. Really and truly, that is it.

File systems: Data sets and properties
ZFS changes the way we think of a file system. Traditionally we are limited to one file system per volume. And why shouldn't we be? Except consider that individual file systems may need different mount options or mount points. The way to solve this in the past was to create smaller volumes and create separate file systems for each purpose, but given the complexity in managing this you can only get so granular. There is also a considerable overhead in disk consumption for each file system.

In ZFS we instead think of a pool containing multiple data sets. A data set is a generic term that is for all intents and purposes just like what you consider a file system to be. A data set can have any mount point you wish, and can enable or disable certain mount options, such as turning "atime" on or off, setting read-only, etc. Except that now datasets are extremely lightweight and mount options are replaced with "data set properties".

Furthermore, data sets can be nested to form more management structures. In this way, data sets now become a point of administrative control for assigning quotas, mount points, compression, etc. Essentially you can create hundreds or thousands of "file systems" on a single system, perhaps one for each user's home directory. The more data sets, the more control you have.

Let us create some data sets using the "zfs create" command and change the point for them using the ZFS "mountpoint" property.

root@quadra ~$ zfs create mypool/home
root@quadra ~$ zfs create mypool/home/user001
root@quadra ~$ zfs create mypool/home/user002
root@quadra ~$ zfs create mypool/home/user003
root@quadra ~$ zfs set mountpoint=/myhome mypool/home
root@quadra ~$ zfs list
NAME            USED AVAIL REFER MOUNTPOINT
mypool           186K 2.72G  19K /mypool
mypool/home         76K 2.72G  22K /myhome
mypool/home/user001     18K 2.72G  18K /myhome/user001
mypool/home/user002     18K 2.72G  18K /myhome/user002
mypool/home/user003     18K 2.72G  18K /myhome/user003

Notice that we've created nested data sets, and when I changed the mount point for "mypool/home" it also trickled down to all its children. This recursive behavior is known as "inheritance." I could easily override any one of the children, but when dealing with large numbers of data sets this makes life much simpler.

Now let us uncover some of ZFS's deep magic by looking at ZFS data set properties on one of these data sets:

root@quadra ~$ zfs get all mypool/home
NAME     PROPERTY       VALUE         SOURCE
mypool/home type         file system       -
mypool/home creation       Wed Dec 31 14:01 2008 -
mypool/home used         76K          -
mypool/home available       2.72G         -
mypool/home referenced      22K          -
mypool/home compressratio     1.00x         -
mypool/home mounted        yes          -
mypool/home quota         none          default
mypool/home reservation      none          default
mypool/home recordsize      128K          default
mypool/home mountpoint      /myhome        local
mypool/home sharenfs       off          default
mypool/home checksum       on           default
mypool/home compression      off          default
mypool/home atime         on           default
mypool/home devices        on           default
mypool/home exec         on           default
mypool/home setuid        on           default
mypool/home readonly       off          default
mypool/home zoned         off          default
mypool/home snapdir        hidden         default
mypool/home aclmode        groupmask       default
mypool/home aclinherit      restricted       default
mypool/home canmount       on           default
mypool/home shareiscsi      off          default
mypool/home xattr         on           default
mypool/home copies        1           default
mypool/home version        3           -
mypool/home utf8only       off          -
mypool/home normalization     none          -
mypool/home casesensitivity    sensitive       -
mypool/home sharesmb       off          default

Here we have a variety of useful knobs to turn. The first several properties are informational, such as creation time, space used, available and referenced. Here is a short list of what some of these properties are and how to use them:

  • * quota: Space quotas can be imposed anywhere, and they are recursive by default. Simply "zfs set quota=10g mypool/home/user001" and that user can never use more than 10GB of disk.
  • * reservation: Similar to a quota, but reservations are "pre-allocated." The disk space is removed from common use so that the space is guaranteed.
  • * mountpoint: Place where the data set is mounted. Mount points are created by ZFS on your behalf.
  • * compression: Just "zfs set compression=on mypool" and all new data will be compressed! Define this for everything or on a case-by-case basis.
  • * atime: If atime is "on" everytime you touch a file, its time stamp will be updated, which can be a lot of unwanted write activity. Use "zfs set atime=off mypool" to disable it.
  • * readonly: Want to lock away archive data? Just "zfs set readonly=on mypool/home/user002" and the user can look but not touch.

Three really exciting options above are "sharenfs", "shareiscsi", and "sharesmb". By simply turning the property on (zfs set sharenfs=on mypool/home) you've exported that, and any children, via NFS. No fuss no muck, turn it on and your done. The same applies to iSCSI or CIFS ("smb").

Volumes
Like file systems, we can also create block volumes as easily. With these volumes we can create legacy file systems (UFS, VxFS, etc.) or share iSCSI block volumes.

root@quadra ~$ zfs create mypool/volumes
root@quadra ~$ zfs create -V 500m mypool/volumes/volume001
root@quadra ~$ zfs list
NAME            USED AVAIL REFER MOUNTPOINT
mypool           500M 2.23G  20K /mypool
mypool/home         76K 2.23G  22K /myhome
mypool/home/user001     18K 2.23G  18K /myhome/user001
mypool/home/user002     18K 2.23G  18K /myhome/user002
mypool/home/user003     18K 2.23G  18K /myhome/user003
mypool/volumes       500M 2.23G  18K /mypool/volumes
mypool/volumes/volume001  500M 2.72G  16K -

You can see that creating a block volume data set is done in the same way as file system data set. We simply add "-V" proceeded by the desired size.

If we'd added the "-s" flag after "create," we would have created a "sparse" volume, which is better known as thin provisioning. Thin provisioning means that we've defined a block allocation but we're not going to actually steal away the blocks until they are actually requested. In this way, we could create dozens of block volumes even if we didn't have enough space for them right now. Because resizing file systems can be complex, this allows us to oversize for the future even if the disk isn't actually available at the moment.

Here is a grotesque example on my little pool with only 2.2 GB available:

root@quadra ~$ zfs create -s -V 1t mypool/volumes/megavol  
root@quadra ~$ zfs list
NAME            USED AVAIL REFER MOUNTPOINT
mypool           500M 2.23G  20K /mypool
...
mypool/volumes       500M 2.23G  18K /mypool/volumes
mypool/volumes/megavol   16K 2.23G  16K -
mypool/volumes/volume001  500M 2.72G  16K -

Notice I created a 1 TB volume, but its only consuming 16 K.

Snapshots and cloning
ZFS makes things easy to create and manage as we've seen, but it also brings enterprise-grade features down to the average user. The best example is that of snapshots and cloning.

We can create a snapshot using the "zfs snapshot" command, and following the data set name with an "@" and the desired snapshot name.

root@quadra ~$ zfs snapshot mypool/home/user001@snap01
root@quadra ~$ zfs list
NAME             USED AVAIL REFER MOUNTPOINT
mypool            500M 2.23G  20K /mypool
mypool/home          76K 2.23G  22K /myhome
mypool/home/user001      18K 2.23G  18K /myhome/user001
mypool/home/user001@snap01   0   -  18K -

The "@" character specifies a snapshot, followed by the name. Snapshots are lightweight and created instantaneously.

One of the advantages of snapshot is the ability to cherry pick files out of them.

In the "/myhome/user001" mount point we'll find a hidden directory that can not be seen but will give us access to the snapshot contents:

root@quadra ~$ cd /myhome/user001
root@quadra user001$ ls -alh
total 3.0K
drwxr-xr-x 2 root root 2 Dec 31 14:01 .
drwxr-xr-x 5 root root 5 Dec 31 14:01 ..
root@quadra user001$ cd .zfs
root@quadra .zfs$ ls -l
total 0
dr-xr-xr-x 2 root root 2 Dec 31 14:01 snapshot
root@quadra .zfs$ cd snapshot/
root@quadra snapshot$ ls -l
total 2
drwxr-xr-x 2 root root 2 Dec 31 14:01 snap01

Here we can traverse the file system as it appeared at the snapshots point-in-time and recovery files by simply copying them out.

Snapshots are used for many things, but let's look at cloning. For example, if you are working on a project and want to create a copy of it for another user so he doesn't mess with your work, no problem! Create a snapshot and clone it!

root@quadra ~$ zfs create mypool/project 
root@quadra ~$ zfs create mypool/project/working
root@quadra ~$ zfs snapshot mypool/project/working@now
root@quadra ~$ zfs clone mypool/project/working@now mypool/project/working-copy
root@quadra ~$ zfs list
NAME             USED AVAIL REFER MOUNTPOINT
mypool            500M 2.23G  21K /mypool
mypool/project         37K 2.23G  19K /mypool/project
mypool/project/working     18K 2.23G  18K /mypool/project/working
mypool/project/working@now    0   -  18K -
mypool/project/working-copy   0 2.23G  18K /mypool/project/working-copy
...

You can take that clone and NFS or CIFS share it, or do whatever you like!

ZFS brings enterprise storage capabilities to any system of any size. All the examples I used were preformed using three 1 GB USB sticks. Administration is simple, easy to understand and extremely fast. I hope this article has given you that warm fuzzy feeling that will help you get started using this amazingly powerful open source technology in your environment.

ABOUT THE AUTHOR: Ben Rockwood is the director of systems at cloud computing infrastructure company Joyent Inc. A Solaris expert and Sun evangelist, he lives just outside of Silicon Valley, Calif., with his smokin' hot wife Tamarah and their three children. Read his blog at cuddletech.com.

This was first published in January 2009

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.