This is the second in a series of articles on Sun Microsystem’s Unified Storage System, also known as Amber Road. Previous post(s):
So in the first post in this series I discussed the USS and gave a basic overview of the hardware. In this post I’ll discuss the disk components of the hardware in more detail and look at the use of flash (SSD) drives and ZFS to produce a commodity storage device.
Traditional storage arrays permit the configuration of multiple disk types within a single array. This can range from solid state disks (SSDs), through to fast fibre channel drives and slower high capacity SATA drives. USS operates a slightly different model — all drives in the USS array are high capacity SATA. SSD drives are then used to ameliorate performance on read and write activity in combination with the ZFS file system, by using the SSDs for read caching and write logging.
How ZFS & SSD Are Used
OK, I’m not going to post a long diatribe about ZFS (although I may in the future), however it’s worth just having a look at the basic concepts in order to understand how ZFS impacts USS performance. So, ZFS (originally called “Zettabyte File System”) is a high performance, high capacity filesystem introduced into Solaris about three years ago. It is more resilient than UFS, not requiring filesystems checks after a system crash. It also integrates the features of a standard filesystem and volume manager, pooling physical disks into groups from which filesystems can then be created. ZFS supports RAID protection, including RAID-1 and RAID-Z, a proprietary implementation of RAID-5. RAID-Z doesn’t suffer the same performance penalty as traditional RAID-5 as ZFS uses a Copy-on-Write (COW) methodology to write data into new locations rather than overwriting the original position.
ZFS uses two features (which relate to USS) to improve performance. Firstly, disk reads are held in cache (called ARC – Adaptive Replacement Cache). Second, disk writes are journalled (or logged) into the ZIL (ZFS Intent Log). The ZIL provides resilience in the event of a system crash, however it also offers the opportunity for increased filesystem write performance. Have a look at the graphic on the right, which is heavily used in the Sun documentation on USS. This shows how traditional storage pools would be allocated with RAM and disk. The USS model implements ARC for cached reads (which is stored in RAM), L2ARC, a level 2 ARC which extends ARC and is stored on read-biased SSDs and the ZIL, which is stored on write-biased SSDs.
L2ARC allows cache reads to be improved by creating an intermediate tier of read cache between disk and main memory. ZIL improves writes by logging them to SSD and periodically flushing them to physical disk. In the event of a system crash, integrity is still maintained as the ZIL is non-volatile.
In the USS, SATA drives are used in the main disk pool. STEC SSD drives are used for the L2ARC and ZIL. The model I reviewed had 36GB of ZIL cache, deployed as two 18GB SSD modules in standard disk enclosures. The current implementation of USS only allows for a single disk pool, which means all data has to be protected with the same RAID level. This is an annoying restriction, but I expect it will change in a future release as creating separate pools is simply a ZFS feature.
Why SSD and SATA?
It’s worth touching on why the USS is different to a traditional storage device. In a typical general storage array there will be LUNs presented to hosts which are very active, some moderately active and some totally inactive. If the LUN activity is plotted on a graph with the busiest LUNs on the left, the least active on the right and the Y-axis showing the degree of activity on each LUN in IOPS, the profile of a normal system will follow the “Long Tail” model. This variation in activity is why savings can be made from operating a tiered model in a large storage array , placing LUNs on the appropriate tier of storage based on their activity level.
However, the trouble with taking I/O profile snapshots is that they’re just that — a snapshot. They represent the I/O activity at that point in time. Take a sample at another time of day or day of the week and another profile results. This may show a very different set of busy LUNs compared to those highlighted previously. One option is to average out the profiles over a suitable interval — say a day, a week or a month. Whilst this will show on average the busiest LUNs, it will also mask any potential peaks in I/O demand as they will be averaged out over the period. The shorter these peaks are, the less likely they will be noticed.
Deployment of tiering has one other problem and that is determining the amount of storage required in each tier. It may well be that the ratios of each storage tier required changes over time as an array grows in size. Perhaps the consumers of storage on the array realise that tier-1 storage is expensive and ask for more tier-2 or a new project comes along that needs a large volume of tier-0 SSD. Typically, traditional arrays are inflexible at physically swapping tiers of storage on demand.
The USS provides one option to the Long Tail model. By accepting all writes into SSD and destaging later to SATA, it ensures that high performance non-volatile storage is available at the time of the write and for multiple successive reads. Fronting disk access with SSD ensures that high performance is dynamically provided to LUNs as it is needed.
Now it would be possible to compromise the SSD write cache by flooding a USS array with writes and this would be true for any array. The question is at what point the USS would fail. Unfortunately with my testing, I wasn’t able to generate sufficient workload to overwhelm the 7210 I tested. However I can say that in the testing I performed, the array coped easily with the workload I threw at it. Clearly there’s still a requirement to manage the ratio of SSD to SATA based on the workload profile of the array.
Value Proposition
So what’s the value of using SATA and SSD in combination as the USS does? There are a number:
- All data is stored on cheap, high capacity SATA drives, reducing the overall cost of the solution.
- I/O performance demands are managed by a small incremental cost in SDD.
- Variations in I/O workload performance is dynamically managed, removing the need to implement multiple storage tiers, significantly reducing management overhead.
- Array expansion is simplified — there’s no need to spend time planning how additional storage should be assigned to an array by tier.
Next time I’ll look at the analytics provided by the USS and how it allows detailed device reporting.