Modern Infrastructure

Liquid immersion cooling surfaces in the server market

bluebay2014 - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

In-line deduplication for a smaller data footprint

Big data and massive data volumes are now staples in a data center, but dedupe and compression are storage techniques that diminish the overwhelming data trail.

It's not just for backup anymore. Dedupe and compression are must-haves for primary storage too.

Deduplication is a compression technique that minimizes data by identifying repetitive patterns, removing the duplicates and leaving a pointer to the stored copy in their place. This pointer is created from a hash of the data pattern of a given size.

Chances are you've been deduplicating backup and archive data for years, but the question is: Should you apply data reduction and efficiency techniques such as dedupe and compression to primary storage as well?

Not all dedupe is created equal. IT professionals should weigh their options before adding it to their primary storage environment.

When to dedupe

Dedupe data when it is first created. Then, all downstream operations -- backup, replication, archive and any attending network transfers -- can take advantage of that diminished data footprint, said Arun Taneja, founder of Taneja Group Inc., storage consultants in Hopkinton, Mass. 

"Years ago, I wrote that I understand why dedupe is [often done] on the backup device. But if there are no limitations, dedupe should be done at the time of data creation, and should be stored in its dehydrated format for its entire lifecycle," Taneja said. The only exception should be when the user or application needs to access the data.

But deduping primary data is hard to accept because you're tampering with your main data set, said George Crump, lead analyst at Storage Switzerland LLC, a storage consulting firm. "With backup, if the dedupe didn't work, it didn't matter because you weren't messing with production data," he said. "With primary storage, it matters, and you will want to understand how it impacts performance, reliability and data integrity."

Only a small number of primary storage arrays currently offer data deduplication as part of their product. Less than 5% of disk arrays support true in-line dedupe and compression, said Tom Cook, CEO at Permabit, a provider of data efficiency technology. But that figure will likely increase to 25% in the next 18 months, and 75% in 36 months, Cook said.

Dedupe for dummies

Space savings from deduplication is significant, depending on the type of data and the size of the chunks used by the deduplication engine. Text files and virtual desktop infrastructure environments, for example, benefit from high deduplication rates, on the order of 40:1. Video, meanwhile, may be compressable, but not dedupable. Storage vendors cite a 6:1 deduplication rate as a good average. Coupled with compression of those same blocks, data centers can easily achieve 10:1 space savings with these techniques.

Achieving those kinds of space savings has obvious appeal, but dedupe can be very processor-intensive. That's less of an issue in secondary storage environments, but it can be a show stopper in primary storage environments, said Dave Russell, Gartner vice president and distinguished analyst for storage technologies and strategy.

"The real fear is that the application will be stuck waiting for a storage write, or even read," Russell said. "If performance [of the primary storage array] is a concern, then you have to do it post-process," once the data has already been written.

Not deduping in real time also allows providers to play with the algorithms to maximize potential data reduction rates. Quantum's DXi-Series of backup appliances, for instance, uses a variable block-size deduplication algorithm, which can be three times more efficient than a fixed-block-size approach, said Casey Burns, senior product marketing manager for Quantum's data center products.

Primary storage plays by another set of rules

So while post-process deduplication has its advantages, it does require that you set aside a so-called landing zone -- the extra capacity to write data before it is deduped. "With post-process, you have to have all the space you'd normally be using at your disposal," Russell said. That can defeat the purpose of the deduplication process, especially with regard to high-priced primary storage capacity, notably flash.

Not surprisingly, primary storage providers, particularly all-flash array players, have been among the first to push for doing in-line dedupe at the time of data creation.

Support for in-line dedupe is becoming the entry fee in the all-flash and hybrid flash array market. That makes sense given flash's high per-gigabyte cost, said Storage Switzerland's Crump. "When you think about the fact that hard disk drives cost well under $1 per gigabyte, the value of deduplication lessens," Crump said. "But with flash costing $8 to $9 per gigabyte, making that five to ten times more efficient with deduplication makes a lot of sense."

Dedupe by the numbers

  • <5%: Disk arrays shipping today that support inline dedupe and compression
  • 75%: Predicted disk arrays shipping with inline dedupe and compression in three years
  • 6:1: Average rate of dedupe
  • 40:1: Average dedupe rate for VDI environments and text files
  • 10:1: Dedupe rate when compression technology enters the picture
  • $1: The per-gigabyte cost of a hard drive
  • $8 to $9: The per-gigabyte cost of a flash drive

Flash array providers that offer in-line deduplication make up a new generation of what Crump calls "ankle-biters," nipping at the heels of established tier-one storage providers. They include the likes of Pure Storage, Nimble and Tegile, to name a few.

Any flash providers that don't currently have dedupe will likely adopt it soon. Violin Memory, meanwhile, expects to ship in-line deduplication and compression on its Concerto 7000 All Flash Array later this year.

Not only does dedupe make better use of expensive flash resources, it's also relatively easy to implement. Compared with most primary storage vendors, "the flash guys are better at dedupe because input/output operations per second are almost free on flash," said Jesse St. Laurent, vice president of product strategy at SimpliVity Inc., a hyperconverged infrastructure provider.

SimpliVity offers in-line dedupe and compression based on custom embedded silicon in its product. One SimpliVity customer, the City of Arvada, Colo., reports that it sees storage efficiency rates of 13.5:1, with performance that is on par or better than its Cisco UCS server and Dell Compellent storage environments, said CIO Ron Czarnecki.

The last frontier

Traditional storage vendors have been slower to add in-line deduplication. NetApp offered in-line deduplication in 2007 with ASIS standard on all its FAS arrays. However, primary storage deduplication on NetApp arrays comes with important caveats, said Taneja. Using it in-line "will bring performance down to its knees," and should only be used post-process. NetApp's 16-bit deduplication algorithm also "just is not large enough to have a very low probability of collision," raising the possibility (albeit very small) that it will produce the same hash from two different chunks.

NetApp's competitors are catching up. EMC offers block deduplication on its VNX series and post-process dedupe on its Isilon Scale-out storage. Dell Compellent and EqualLogic arrays both have primary dedupe, as do HP 3PAR StoreServ arrays. Hitachi Data Systems has dedupe on its network access server and unified storage arrays through an original equipment manufacturer deal with Permabit, while IBM offers dedupe on its Storwize array and SAN Volume Controller.

But those products have yet to set their mark, allowing for deduplication appliances to front-end existing storage arrays, said Gartner's Russell. In theory, that allows shops to get more serviceable life out of the storage they already have, he said. "Some will embrace that," depending on how open they are to new technology.

Permabit's new SANblox, an in-line deduplication appliance based on the company's Albireo Index Engine technology, allows customers to insert in front of a traditional Fibre Channel SAN array. SANblox brings in-line deduplication of legacy storage arrays on an as-needed basis.

SANblox gives tier-one storage players an easy way to quickly add quality in-line dedupe to their existing offerings, said Storage Switzerland's Crump. "Anecdotally, the flash vendors are taking market share away from the big guys. This gives them a way to stop the bleeding beyond just giving away storage," he said.

Further out, Intel chips will provide all the horsepower needed to do in-line deduplication in software, predicted Taneja, obviating the need for proprietary silicon or appliances.

"The next round of Intel chips will have more than enough power for dedupe," Taneja said, and it will be an "integrated feature of every primary storage arrays and converged system." In other words, "Three years from now, we won't be discussing this anymore."

Article 2 of 8

Dig Deeper on Enterprise data storage strategies

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

Why not simply usa a database that doesn't store dupliicate data in the first place? Something like Ancelus?

Get More Modern Infrastructure

Access to all of our back issues View All