Data deduplication and single instance storage (SIS) technology have both been around for several years, but industry observers say it is emerging as an important way to contain data center costs and
The idea behind deduplication, deduping for short, is simple enough. When writing data that's been written before, rather than write it again, simply leave a pointer to the original data. Assuming 50 people in a company receive an email with a 3 MB attachment, a shortcut is left to the original file, and 3 MB instead of 150MBs of server space are consumed.
Deduping can happen on two levels. File-level dedupe aims to eliminate identical files either within or across systems. For example, Windows servers all have the same executables, .dll and so forth, and storing them only once can shrink the size of backups.
Block-level deduplication, on the other hand, works below the file level, and attempts to find patterns of ones and zeroes. It can use either fixed-length block sizes or logical, variable-sized blocks, according to Avamar founder Jedidiah Yueh, whose company's Axion product uses block-level deduping to reduce the space consumed by disk-based backups.
Other vendors that use block-based deduping technologies include Diligent Technologies, Data Domain and Symantec (a recent entrant) with its Veritas NetBackup PureDisk. For the time being, analysts say their technology is generally pretty similar.
But which is better depends on the type of information and infrastructure at hand, says Marc Staimer, president and CDS of Dragon Slayer Consulting in Beaverton, Oregon.
"I would say the net effect is marginally different between them," Staimer said. For one thing, it depends on the data. "Powerpoint dedupes really well. Compressed data doesn't. Encrypted can't. You want encryption after deduping, not before…" Furthermore, some algorithms do a really good job deduping, but that may require a more complex database, or can be slower. "It's all over the map," he says.
Whatever the technology, the end result is 20-70% in saved disk usage, and compression of upwards of 20 to 1 compressions in some cases, analysts say.
Some users see more.
Using Avamar Axion, "Last night it was 1% new data, it backed up 258 gigs in two hours and 15 minutes. With Veritas [Backup Exec] 9.0, what we were using, it would have been 18 hours," said Michael Fair, network administrator at St. Peter's Hospital in Albany, N.Y., who added that his system has a capacity of about 2.5 terabytes. Fair also likes how Axion dedupes on the client, minimizing the amount of traffic that gets sent over the network.
Those kinds of results can even make storage endeavors such as WAN vaulting affordable, as well as disk-based backup and archiving.
SIS adoption can shrink disk usage and ease hardware expenses, but some, like Claude Lorenson, group product manager for Windows storage technologies at Microsoft, attribute more importance to reducing management costs, for example regarding administrator-related efforts to manually track duplicate files, Lorenson said.
And while single instancing is an established technology within Exchange, it will eventually find its way to other applications as well, he said.
"Years from now, SIS will be the default way of managing multiple copies of files and will be an integral part of most users' backup and archiving strategy," Lorenson wrote in an email. "Having fewer files also will make the backup process more efficient."
Deduping fears unfounded
Despite its promise, there may be users that are wary of dedupe technology, fearing data loss or corruption.
Some of the cryptographic algorithms employed by block-level deduping schemes, for example MD5, have been cracked. Furthermore, it is feasible (although statistically unlikely) that the same MD5 hash can refer to the same set of content.
But Lorenson and others concurred that any fears of data loss as a result of fewer files should not be an issue, and there is no widespread user fear.
Not surprisingly, guaranteeing data integrity is at the top of the list for every dedupe vendor. "It's a 110% focus," said Neville Yates, CTO of Diligent Technology. "You can't go to somebody with multiple petabytes and say 'trust me I'm 99% sure.'"
"In general, it's not a real danger," Staimer said. "The possibility exists, but the software has the ability to make sure that corruption in a file can be fixed. Call it autonomous self-healing."
Must have or nice-to-have?
Analysts are split on the importance of deduplication. Robert Gray, research VP for worldwide storage systems research at IDC, downplayed the role of deduplication and SIS.
"I don't think it's an inevitability," Gray said. "Keeping extra copies of things is not the most desirable but it's not the most killer either. I think it's going to happen, but I think it'll be just one part of data management, and other aspects will have more payoff than say simply removing 45 or 50 PowerPoint files…. It's nice but not necessary."
But according to Staimer, major vendors will be forced to implement the feature within their product suites, should they hope to compete.
"Nothing too near-term but probably next year," Staimer said. "I would say that at that point in time they'd better have it. I can see customers saying 'No dedupe? Ok, next.' It's such a huge cost issue."
Let us know what you think about the story; e-mail: Joe Spurr, News Writer