A Data Storage Taxonomy
The world of data storage is a surprisingly large and complex environment. Everything starts with media, from which complex systems are built. However, data storage has a vast array of processes, best practices and procedures to operate systems. Putting all these pieces into some form of structured taxonomy is no mean feat.
The Storage Hierarchy
A good starting point is to look at the "Storage Hierarchy", a graphical representation of the types of storage media available for consumption across the enterprise. A typical visualisation of the hierarchy is shown in Figure X. Media are rated with respect to their performance and capacity; the slowest and most capacious devices at the bottom, the fastest (and arguably) lowest-capacity devices at the top.
The heirarchy exists because the cost of media is also variable. Following a similar vein, the cheapest (per GB) devices are generally at the bottom of the stack, while the most expecnsive are at the top. Ideally, system memory would be used to store all programs and data. But that design is impractical and to expensive. RAM (Random Access Memory) has no inherent resiliency and, of course, the contents are lost when the power goes off. NAND Flash storage, for example, is quick, has low latency but is relatively expensive compared to hard disk drives and way more expensive compared to tape media. If data isn't being accessed, then keeping it on expensive storage (in relative terms) makes no commercial sense.
With these multiple vectors in mind, data storage has developed into a series of tiers that are used to store data throughout its lifetime. Historically, we assumed that data would be most active just after creation, diminishing in activity and value over time. As we will discuss later, this model is no longer accurate due to the pervasiveness of analytics including machine learning and AI.
The idea of a pyramidal structure for the hierarchy is also a point of debate. The diminishing size of each layer is meant to represent the relative quantities of each media type deployed in a traditional enterprise. However, we may see NAND flash overtake hard disk drives across the enterprise. Arguably for many businesses, that point has already been surpassed. The traditional pyramid may turn out to be more of an hourglass, with the legacy technology of hard disk drives in the squeezed middle. Alternatively, the hierarchy may be different for every business and require multiple sample visualisations.
Another concept to view all storage media and systems is to think in terms of a continuum. Within the NAND flash market, for example, there is now a wide range of products that address the requirements of endurance, performance, capacity and cost. A similar distribution used to exist within the hard drive market, but the rapid evolution of solid-state storage has put this market into a drive towards capacity and cost only.
Systems and media can be assigned to a linear continuum or a 2D representation with mulitple choices of axes representing some of the characteristics we have already discussed. We will use the concept of the continuum within this documentation as a process to explain the range of solutions and products now available in the market.
A Storage Systems Taxonomy
Storage systems form the vast majority of data storage deployed in the enterprise. Over the last four decades, we have seen storage move from centralised mainframe connectivity, initially as directly-attached SLEDs (Single Large Expensive Disk) through a highly distrubuted DAS model where storage was installed into individual servers, to the current market where centralised or SAN storage is the norm.
Figure X shows our mapping of the Data Storage Kingdom (part of the Infrastructure Domain). We divide the Kingdom into two divisions - Products and Process. In turn, Products is subdivided into three Classes - Hardware-Centric products, Software-Centric products and Service-Centric products. Taking the Hardware-Centric Class as an example, this is divided into three Families - Storage Media, Storage Systems and Storage Peripherals.
Products and Process
We subdivide the Data Storage Kingdom into two divisons - Products and Process. In Products we describe both the phyical storage products sold to businesses, whether as complete systems or as components and media. Process refers to the operational aspects of implementing and managing data storage, including best practices, standard operational processes and system administration.
Products are divided into three categories.
- Hardware-Centric - products and solutions sold as physical hardware. This can include storage media (hard drives, SSDs), storage peripherals (RAID cards, SmartNICs) and storage systems - appliances or storage arrays.
- Software-Centric - software-based solutions, which can include the operating system deployed on storage hardware, software-defined storage (SDS) and file systems.
- Service-Centric - solutions offered to customers that are consumed either through a cloud-like model, in the public cloud or on-premises.
Each of these categories is now defined in more detail.
Hardware-Centric Products
Over the last sixty years, most of the data storage industry has been focused on selling physical products. These include storage media (hard drives, SSDs, tape), storage systems (appliances, or arrays) and storage perhipherals (such as RAID cards). Customers choose to acquire or lease-purchase hardware products, depending on the cost and the financial options to amortise costs over time.
Software-Centric Products
When first brought to market, storage appliances typically incorporated software that was directly integrated into the hardware. Modern systems generally separate out the software component from the hardware as the industry has gradually transitioned from bespoke hardware designs to using off-the-shelf components. In turn, the emergence of software-defined storage in the late 2000s enabled businesses to deploy self-purchased hardware with storage software to build custom solutions. Software-defined storage remains a large part of the market, particularly with open-source solutions.
Service-Centric Products
The public cloud provided the capability for businesses to consume data storage resources without the need to build and deploy physical storage infrastructure. Modern "storage-as-a-Service" solutions are generally accepted to have started with the introduction of Amazon Web Services' Simple Storage Service (S3) in 2006. Data is stored and accessed through a simple web-based (HTTP) interface without the customer needing to know how or where the data is stored. STaaS solutions now encompass data backup, file and object storage and databases, charging customers through hardware-abstracted metrics such as capacity, performance and availability.
The centralised or ICDA (Integrated Cache Disk Array) systems started as a collection of commodity hard disk drives, linked together with compute capability and system memory. EMC Corporation released the first ICDA (Symmetrix) in 1990, using RAID-1 mirroring. System memory provided the capability to implement both read and write caches, mitigating some of the performance challenges of spinning media. For example, write I/O could be accumulated in protected system memory cache, then flushed to disk in a more efficient use of I/O. Meanwhile the application receives acknowledgement the I/O has taken place, dramatically improving performance.
The first storage systems supported block-based protocols, specifically SCSI with a limited number of fixed connections. In 1992 Network Appliance, Inc, (now NetApp, Inc.) developed a storage appliance supporting NAS (Network Attached Storage) protocols. Object storage became popular during the late 2000s, introducing a third way to interact with centralised storage.