Hadoop is fairly flexible, which simplifies capacity planning for clusters. Still, it's important to consider factors...
such as IOPS and compression rates.
The first rule of Hadoop cluster capacity planning is that Hadoop can accommodate changes. If you overestimate your storage requirements, you can scale the cluster down. If you need more storage than you budgeted for, you can start out with a small cluster and add nodes as your data set grows.
Another best practice for Hadoop cluster capacity planning is to consider data redundancy needs. One of the advantages of storing data in a Hadoop cluster is that it replicates data, which protects against data loss. These replicas consume storage space, which you must factor into your capacity planning efforts. If you estimate that you will have 5 TB of data, and you opt for the Hadoop default of three replicas, your cluster must accommodate 15 TB of data.
Hadoop co-creator Doug Cutting says a willingness to experiment is still a hallmark of the open source platform, which helps it fit many development styles.
You must also consider overhead and compression in your Hadoop capacity planning efforts. Compression plays a major role in the amount of storage space that your Hadoop cluster will require, so it is essential to consider the type of data that it will store. Scientific data, Docker containers and compressed media lack significant redundancy, so they don't always benefit from compression. On the other hand, some data types might yield compression rates of 80% or more, especially text-heavy data.
Conduct experiments with compression
As you plan your cluster's storage requirements, experiment with a data sample to see how well it compresses. If you can compress a small data set by 50%, then you can project that the production data footprint will also be cut in half, although you might want to base your estimate on a slightly smaller percentage just to play it safe.
A Hadoop cluster might store more than one type of data, in which case you must treat compression as a percentage of a percentage.
Suppose that a particular data set is compressible to 40% of its original size in a lab setting, and assume that data set makes up 20% of the cluster's overall 10 TB of storage. If you do the math, 20% of 10 TB is 2 TB. When you shrink that data to 40% of its original size, this particular data set occupies 0.8 TB of space without accounting for overhead or redundancy.
Account for IOPS
Storage requirements are also defined in terms IOPS. A highly transactional data set requires the cluster's storage to handle far more IOPS than is necessary for a large, but relatively static data set.
If you expect the cluster's data set to be read/write-intensive, you must account for IOPS in your Hadoop cluster capacity planning efforts. The easiest way to accommodate a high number of IOPS is to use flash storage, but that may not always be the best choice because of cost and write durability issues.
Rather than using flash, you can spread storage across more disks. If you estimate that your Hadoop cluster will require 12 TB of storage, then you can get more IOPS from four 3 TB drives than you would from three 4 TB drives.