Data on disk

I’ve started reading Database Internals, a very interesting book about how a database systems actually work. Why? Apart from being interested in how a vital tool in Software Engineering is doing its job, a Backend software system is really a big Database system, that is, data comes in, data is processed, and data goes out. So learning, and seeing, advanced techniques for data management and associated tasks is always going to be helpful for what I do.

The first chapter has shown me two very interesting things about how databases store data on disk (hence the title :P).

Blocks

The first thing to know is that hard disks have some physical limitations, and those limitations permeate their way up through the disk controllers, file systems, and into operating systems. This is called a block (data storage).

A DBMS, regardless of it being a Document Database (like MongoDB, etc), NoSQL (Redis, Couchbase, ScyllaDB, etc) or Relational (Postgres, Mysql, Oracle, etc) stores data on disk and is constrained by those blocks.

Layout

With that now established we can think about the data the DBMS has to store. The data can be stored either Row oriented, or Column oriented. That is when the data is stored on disk, a file, which is made up of blocks, is filled with the data from one row, or the row is split, by field, so that each file is a column of data.

The two techniques offer different pros and cons. The row approach means that, in order for a query to get filed information for a number of rows, all the data from the rows will be read. This means that a query like

SELECT firstname FROM users

Will require every row from the users table files to be read and transferred to the CPU from disk.

The column approach will only require the data from the table files that contain the firstname data.

However, if the query is more row oriented, eg.

SELECT * FROM users WHERE id = 1

Then a row oriented layout will only need the file with that row in it.

The column oriented layout will need every file with column data for that row.

It’s clear that the nature of the queries should determine which layout should be used.

New Data

As data is added to the DBMS, a question on where to put it in the files needs to be answered.

There are three strategies, each with pros/cons to consider.

Index format

You can think of this as having similar characteristics to an array. The data is stored at the index, that is, the 0th (assume) row is placed in the 0th position of the file (think offsets), the 4th row is in the 4th position, and so on.

The index format is fast for search, with a O(1), because you know where in a file the data is stored, but compared to Heap format, can be an inefficient use of disk.

Hash format

You can think of this as having similar characteristics to a map, or associative array. The location of the data is determined by a hash calculation.

The result is, like index format a search is O(1), but the cost of the hash computation needs to be taken into account, and how collisions are handled. This format is the worst for disk efficiency because there is no guarantee that any of the calculated hashes will be next to any others in the same file, and there is no guarantee that all the possible hashes for a file will be ever used.

Heap format

There is no expected ordering of data in a file, however the most common implementation is to append each new piece.

The heap format requires an external index which means that a search for a given row can be requires two disk IO operations per read. As well there is the cost of maintaining that index during other operations.

However this is the most efficient use of disk, as every row will abut the previous one.

Summary

The database book is providing a wealth of understanding, with these two facts spelled out I feel a lot more confident reasoning about how the data is stored affecting the amount of work for reading/writing/updating data in a given DBMS, whether that be a RDBMS with tabular data, or an Object DBMS.

Shane Howearth

Home

About