»
S
I
D
E
B
A
R
«
Analyzing iSCSI on ZFS (Playing With ZIL)
April 4th, 2009 by admin

In the previous entry, I showed some performance numbers that I was able to achieve from my ZFS backed iSCSI targets. The read performance that I was able to achieve were great, unfortunately I was only able to touch max around 1MB/s on writes.

After much discussions, I was inclining towards the theory that the culprit is the high latency and low IO performance of my 5 x SATA-II RAID-Z, which is shared between the data, metadata and the ZFS Intent Log. ZFS Intent Log is the logging system for ZFS file system, responsible for maintaining the file system consistent in the event of crashes etc. ZIL is also used by the file system for providing synchronous writes, for POSIX compliancy. For a good performing system it is recommended that the device that hold’s the log is very fast (in terms of latency for handling many small random writes). Neelakanth’s blog entry has some very good info on ZIL.

Playing with ZFS Intent Log (ZIL)

So as next step, I will try to identify and more importantly quantify the kind of IO operations that are taking place during writes.

Typically Solid State Disks (SSD) with high transactional performance are recommended to be used for the ZFS Intent Logs. The only problem is that the cost of buying an expensive SSD disk would be prohibitively expensive esp. for my home brew NAS.

I still want to not only see how much a SSD device can really help (which is an easy part), but also to optimize the performance of my iSCSI targets for write performance.

In order to have fast Apple Time Machine backups I would need faster iSCSI devices. This would get better once Apple moves to ZFS as their native file system and the backup would truly mean sending the incremental snapshots using “zfs send”.

So I started by monitoring the activity on the logging subsystem using “dtrace”. Dtrace is one of the most exposing tool for realtime analysis, that I have come across.

So I planned a small experiment, where I would write (and re-write) a 128MB file with 256k block size using IOzone on the iSCSI device. And during this time would monitor all the activities on the ZIL. This requires setting two probes on the Solaris workstation before starting the IOzone test.

Setting Probes on OpenSolaris

Probe 1: To collect information on all the ZIL activity.

# dtrace -n zil\*:entry'{@[probefunc]=count();}' -n zio_free_blk:entry'{@[probefunc]=count();}' 
dtrace: description 'zil*:entry' matched 44 probes
dtrace: description 'zio_free_blk:entry' matched 1 probe
^C

  zil_itx_clean                                                     4
  zil_clean                                                        16
  zil_header_in_syncing_context                                    82
  zil_sync                                                         82
  zil_vdev_compare                                               1152
  zil_flush_vdevs                                               31617
  zil_commit_writer                                             32457
  zil_lwb_write_done                                            32764
  zil_lwb_write_start                                           32764
  zio_free_blk                                                  32764
  zil_add_block                                                 32769
  zil_commit                                                    32933
  zil_lwb_commit                                                33106
  zil_itx_assign                                                33585
  zil_itx_create                                                33585
  zil_lwb_write_init                                            34253

Probe 2: To monitor block size of the log writes.

# dtrace -n ::zil_lwb_commit:entry'{@[1] = quantize((lwb_t*)args[2]->lwb_sz);}'
dtrace: description '::zil_lwb_commit:entry' matched 1 probe
^C

        1
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         58
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   31202
           16384 |@@                                       1643
           32768 |                                         203
           65536 |                                         0        

Note: Notice the ^C character right below the probe; Once the IOzone run is finished the probe must be interrupted for getting the results.

04042009-iscsi-blog-2

Executing the IOzone test on the client: The following test would basically write 256MB of data to the iSCSI device.

$ ~/Downloads/iozone -r256k -s128m -i0 -w 
     Iozone: Performance Test of File I/O
                Version $Revision: 3.321 $
                Compiled for 32 bit mode.
                Build: macosx
         .
         .
         .
        Record Size 256 KB
        File size set to 131072 KB
        Setting no_unlink
        Command line used: /Users/abisen/Downloads/iozone -r256k -s128m -i0 -w
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.

              KB  reclen   write rewrite
          131072     256     872     869   

iozone test complete.

I still need to completely understand the output from the two ZFS probes, but it is clearly visible that for writing 256MB of data it requires a lot of activity on the logging subsystem. Now that we have quantified the number of commits and block size of commits on the ZIL. It would be very helpful to somehow see how our storage subsystem is performing from IOPS and latency perspective.

I performed a quick test again, using the same IOzone benchmarking tool and iostat. But this time I executed the benchmark locally on the Solaris workstation. Alternatively I also created a small ZFS file system on an old SanDisk Cruzr (512MB) USB Flash Disk. And ran the same test on this file system as well. Just to get an idea on how a USB Flash would fare against my 5 x SATA-II RAID-Z volume.

The video below shows what I noticed, Not only the service time (latency) on the USB Flash Disk was 4-5 times better than 5 x SATA HDD’s, even the IOPS that the old little USB flash could sustain were far greater than what 5 SATA disks could achieve.

So a USB Flash Disk might have some potential for helping in improving the write performance of a home brew NAS. I am hesitant on trying this out right away because, it’s easy to add a new device to a ZFS file system, but I have not been able to find any information on deleting/removing a device from a Zpool.

Next steps:

  • Testing the performance by disabling the ZIL completely. This would give an idea on how fast can these iSCSI device get if ZIL is taken out of the picture. A ZFS file system should never be used with ZIL disabled, but it’s a neat feature for testing if separating ZIL would give any performance benefits.
  • Testing the performance of the ZFS file system by using a USB based Flash Disk.

5 Responses  
Michael Joyce writes:
April 29th, 2009 at 10:55 am

Have you ever looked at ZFS using DDR-based NVRAM as the stable storage? I’m wondering if you might find the same 4-5x performance improvement over flash that you found when you compared flash to SATA RAID.

admin writes:
April 29th, 2009 at 11:34 am

I think NVRAM backed ZIL would have a huge performance gains much better than the flash disk in almost all of the IO workloads. Unfortunately for my home brew system it would be prohibitively expensive to get one of these NVRAM devices :)

Ty Morton writes:
May 6th, 2009 at 7:54 pm

Great blog article you have here. I have a similar setup and recently purchased a couple of SSD’s for the purpose. A few things that might help you out along the way (from my experience).

1. ZIL drives CAN NOT be removed from a zpool. This is a known issue and the only way to do it is to zpool destroy!

2. I purchased a couple of 32GB SSD’s with reasonable read/write performance. Initially I had a single raidz pool and used one for L2ARC and the other for ZIL/SLOG. I ended up rebuilding that pool to get rid of the ZIL device because it appeared to me that it was underutilized (I was only looking at the output from zpool iostat -v pool 1). I am starting to have second thoughts though and may order a couple of more SSD’s.

3. According to what I have read and witnessed on my home ZFS server it is best to get as large of a drive as possible for L2ARC (trust me it will use it quickly) and the ZIL really can be small even 10GB is plenty. most of the time my usage for the ZIL was under 1mb (again according to iostat).

My plan is to add 2 more 64Gb or larger disks with really good read throughput and use the two 32GB disks as ZIL.

There is also a blog post somewhere (someone from Nexenta I think) that tried to use some sort of RAM drive and had some really bad experience’s regarding data reliability. I believe he may have ended up loosing the pool all together.

I hope to post some more information on my Blog in the near future when I get a few more kinks worked out and settle on my final configuration. The dtrace information you have here is great though and I look forward to running it with my testing.

One last question I have is what iSCSI target are you using? Is it the default that comes with opensolaris or the Comstar target software?

admin writes:
May 9th, 2009 at 5:26 pm

Thanks Ty, for the feedback. I am glad this post was helpful.

> 1. ZIL drives CAN NOT be removed from a zpool. This is a known issue and the only way to do it is to zpool destroy!

Yes that’s one reason why I did not try the ZIL extensively. And it’s also something that I think should be changed.

> One last question I have is what iSCSI target are you using? Is it the default that comes with opensolaris or the Comstar target software?

I was using the standard iscsitadm (target) as this is the only target supported by the OSX initiators that I was playing with initially. I got hold of the ATTO’s Xtend SAN later which apparently works with the Amber Road (iSCSI Target). And the performance (esp write) was much better than the globalSAN iSCSI target.

Ross Walker writes:
July 10th, 2009 at 11:32 am

Unfortunately in my tests with iscsitgt, it never even touches the ZIL, as it uses async IO instead of direct IO, so the performance you get is what you get.

I do iSCSI off of soft-partitions backed by HW RAID controllers with big battery backed write-back caches and that performs well, but then again we are running our Exchange/SQL databases off of it, so we need any little extra we can get.

For a home network though, you can get good “enough” iSCSI performance with good disks and good controllers (25-30MB/s for simple copy operations more for backup operations).

-Ross

Leave a Reply

»  Substance: WordPress   »  Style: Ahren Ahimsa