Exploring Storage Architectures
I started contributing to the Thanos project a while back, and one of my first tasks was to learn about the different ways we could store files. I knew about file storages, and S3, and a couple of other object storage implementations, but I never really thought of them as related - or why to use one over the other.
Learning about other architectures turned out to be a fun rabbit hole with a whole lot of new things to learn - depending on how deep you want to go of course. Here are some of the things I learnt
Almost all computer users are familiar with file storage. It stores files in folders which are in turn, stored in other folders in some sort of hierarchy. File storage shines when you have data that can be easily organized. It especially makes sense when you have a mix of structured and unstructured data (e.g on a web host that is both serving web pages and storing some amount of user-generated media). Data can be easily shared (users/servers mostly just need to be on the same hard drive or network-attached storage).
At a larger scale, the appeal of file storage reduces. This is because navigating the hierarchy of directories and sub-directories becomes harder as the number of files grow. Also, hard drives need to be replaced with higher-capacity ones to reduce I/O latency.
Object storage stores data as “objects” where object is a combination of the data itself with accompanying metadata set by the developer/administrator. It is typically used to store large, unstructured data and static files that need not change frequently (think compacted logs, video files, images, etc) in a scalable way.
Objects are identified by keys computed based on the data/metadata. The metadata here is more descriptive than that of file storage as it can be customized to add more context beyond just filename and creation dates. While it isn’t exactly as performant as others, it works fine especially at scale where file and block storage begin to fail.
In distributed systems, you'd expect eventual consistency from object storage (as they'd pick availability and partition tolerance over consistency in the CAP theorem) but it also depends on the provider. For instance, Google provides strong consistency for Google Cloud Storage except when granting or revoking access to resources. Amazon’s S3 on the other hand guarantees strong consistency only when creating new objects (updating and deleting both objects and buckets are eventually consistent).
Some examples include of object storage implementations include Amazon’s S3 (its API is almost like a standard at this point), OpenStack Swift, Google Cloud Storage, Ceph (using the Rados Gateway that is compatible with both S3 and OpenStack Swift), etc.
This was kind of tricky as I couldn’t really tell its differences from regular object storage at first. Similar to object storage since they both use keys/unique identifiers to identify data in a sea of data. Also, both keys and data (or values) can be arbitrary string and of arbitrary sizes. Unlike object storage though, KV stores don't usually store extra metadata alongside their values. Also by design, KV stores expect values to be smaller (relative to data in object stores) and it makes sense to expect strong consistency from KV stores (Redis for instance tries to achieve strong consistency with the “WAIT” command, DynamoDB lets you pick between strong and eventual consistency but defaults to the later).
This has the most performance and lowest latency of the bunch. This efficiency makes it perfect for workloads like databases and boot volumes. It stores data files on storage area networks by breaking them into blocks - each with its own Logical Block Number(LBN) and stores them separately. During reads, the system assembles the file from the blocks based on the LBN and presents them.
While this barely scratches the surface of each of these data storage architectures, it was pretty interesting having to think about what those things are and where one might want to use them. I’m also hoping to learn some more about object storage particularly in the coming weeks, as my current feature for the Thanos project would use it quite heavily.