A Short History of Data Storage
Computers are an incredible technology but can do very little of substance without data storage. To do useful work, computers need persistent media in some form, as a repository to store the results of data processing for long-term retention (by which we mean more than a few milliseconds).
Modern computer systems use a form of the von Neumann Architecture, named after the design proposed by John von Neumann in 1945. The structure of the von Neumann Architecture has four main components, the processing unit, the memory unit, an input device and an output device (in many cases, input and output devices are the same hardware).
Both data and instructions are loaded from the input device, with the results of calculations stored on the output device. System memory has always been volatile, as the contents are lost during a power cycle, so persistent storage is needed to retain any new data created during processing.
Modern computing systems generally implement a modified von Neumann Architecture in which data and instructions are stored in separate caches within the processor, but the design proposed eighty years ago still remains at the heart of every computing device we use today (except perhaps for quantum machines).
Paper, Tape, Disk and Flash
Of course, the first persistent storage bears little resemblance to that in use today. Early computing systems used paper tape as input, with metal tape and eventually magnetic tape following later. IBM introduced the first hard disk drives in 1956, with a massive 5MB capacity and monthly rental of $20,000.
Disk drives miniaturised in the 1970s and 1980s, settling on a form factor developed by Al Shugart (the founder of Seagate) that is still in use today. Hard disk drives or HDDs very quickly became commodity devices, with a rapid increase in capacity and expansion of performance capabilities (which we will discuss later).
NAND flash was invented by Toshiba in the 1980s, but only became commercially practical for enterprise computing in the mid-2000s. Apple drove down NAND prices by using the technology in consumer iPod devices, with EMC introducing the first Enterprise Flash Drives in 2007.
For the following half decade there was a Cambrian explosion of flash systems and designs, led by Pure Storage, Violin Memory, Hitachi Data Systems, IBM (following the acquisition of Texas Memory Systems) and others. However, as in nature, many of these vendors and products died off, leaving only a few survivors that dominate the market today.
The Next Generation
In 2015, Intel attempted to supersede the NAND flash market with the introduction of Optane, based on 3D-XPoint technology. This new storage medium offered persistence at system memory speeds and capacities. Unfortunately, the company couldn’t fully commercialise the technology and it was discontinued in 2020. Micron, Intel’s 3D-XPoint co-developer never released a commercial product based on the technology.
Today there are many other persistent storage technologies either available or in development, many using the phase-change technique employed in 3D-Xpoint. However, despite a dearth of new media choices, the death of early storage media including disk and tape have been frequently predicted but yet to occur. We will look in depth at the reasons for this in a later chapter.
Systems
So far, we have mainly discussed storage media in the form of hard disk and solid-state drives. However, large-scale computing environments demand resiliency and high availability that a single drive can’t provide. As a result, we need to take the base storage components and building storage systems that include advanced data services needed for enterprise computing.
IBM built the first “SLED” disk subsystems (as we mentioned earlier) during the reign of the mainframe. These “Single Large Expensive Disks” dominated the market until the development of RAID data protection. From this point on, storage vendors could build systems from racks of inexpensive disks, using software to implement data protection in the event of a drive failure. As a side note, it’s worth highlighting that if a SLED failed, there was a risk of data loss, so frequent backups were essential.
Storage systems, variously known as arrays or appliances, were first introduced by EMC Corporation in 199X. The hardware advanced quickly in the 2000s, with the introduction of the Fibre Channel protocol, initially using copper and then with optical interconnects. Fibre Channel drove the development of SANs, or Storage Area Networks, with the capability to attach hundreds of individual servers to a single storage array.
With data concentrated in one system, storage vendors were able to add features such as data optimisation and reduction technologies and data protection technologies (like snapshots), while guaranteeing a more consistent performance level for all applications. Eventually, quality of service or QoS was introduced to tier performance levels.
The benefit of centralising storage through a SAN was to remove hard disk drives from individual servers. This had the benefit of reducing the maintenance overhead (imagine someone having to do a daily trawl of servers replacing failed drives), reducing mistakes (if the wrong drive was replaced, or the wrong server was chosen), improve efficiency and improve performance.
Software-Defined
The appliance model worked well for centralised storage during the 2000s. However, storage systems were large, power hungry and heavy, mainly based on custom components and shipped in custom 42U racks that demanded 3-phase power supplies.
At the heart of all storage systems, software controlled the delivery of data to applications and the reading and writing of data to disk. Logically, there was no reason why storage arrays couldn’t be built using standard servers, and this option emerged in the late 2000s as an architectural choice for IT teams.
Software-defined Storage, or SDS, enabled the uncoupling of storage software from the hardware. End users could choose to buy the hardware independently, just buying the software from the vendor. The ability to deploy SDS was, in part, due to the ever-improving reliability of servers and storage components.
However, storage vendors always did more than just component and software bundling. They guaranteed compatibility between the software and hardware, which had been rigorously tested in vendor labs. Now the customer was responsible for that task, which could be a challenge. SDS deployments can throw out “corner-case” issues, for example, with certain hardware components and firmware.
We will dig into SDS more in a later chapter.
Modern Storage
The storage industry has remained relatively stable over the last decade and a half, with the appliance model continuing to dominate the market. Most modern deployments are all-flash systems with some hybrid configurations. Purely HDD-based systems still exist, but are increasingly resigned to relatively inactive or archive data (such as backups).
Although block-based protocol appliances dominated early storage systems, both object and file storage now figure heavily in the mix within a modern data centre. Arguably, the growth in these systems is greater than for block, as unstructured data is used for modern applications such as AI.
We may yet see further advances in storage systems as the performance demands from AI and data analytics continues to increase. Media evolution continues at a pace, with new technologies such as ceramic and DNA storage to consider.
The Public Cloud
Finally, we should mention the availability of storage in the public cloud, both as an integral part of workloads running on the cloud or as a destination for data from on-premises systems.
The types of data available in the public cloud mirror those options available on-premises. Customers can choose to consume block, file and object-based protocols, although block-based devices are not generally made available outside of the cloud computing environment.
The end user has no visibility on how the storage resources are delivered. Instead, storage is purchased by capacity, with variable charges for increased availability and performance.
The public cloud pioneered the mass-adoption of object storage, initially with AWS Simple Storage Service (S3), which is the de-facto API protocol adopted by the industry. Both file and object storage in the public cloud can be accessed by on-premises systems, as the protocols operate over standard Ethernet connections. Note there is no technical reason why block-based storage couldn’t also be accessed over a network using iSCSI or NVMe/TCP, but the latency introduced in this configuration makes the scenario impractical.
Future Challenges
Much of the design of storage systems has remained the same for decades, especially the protocols used to store and retrieve data from application servers. This position has been maintained for good reason, as data storage systems need to deliver 100% reliable access to store and retrieve information.