NoSql - How is the data stored ?

Prashant Kumar
Jan 21, 2023
10 min read

NoSql database have gained lot of popularity over the years and have become a viable choice to build many highly scalable and performant systems.

There are many flavours of NoSql databases:

Document databases - Mongodb, Couchdb, Amazon Documentdb
Key value stores - Amazon Dynamodb, Aerospike, Riak, Redis
Graph databases - Neo4j, Graph base
Column-oriented databases - Cassandra, Hbase

As an application developer, why should you even bother about knowing how data is stored in NoSql databases. Answer is simple - firstly, to make better and informed decisions on making a database choice and second, in the face of application mayhem caused by database issues, this can serve you as a guiding light to look into the right direction.

Basics

Before we get into the elaborate design of NoSql storage engines which is customised for different types of databases and vendors, let's begin with a very basic example of storage engine. Let's call it basicFileDb as it would use basic file read and write operations. We would use following bash functions for read and write operations.

Write Bash function:

set() {
    echo "$1, $2" >> file_db
}

The set function here takes two parameters as $1 (a key) and $2 (a value) and puts it in file_db file. Keys are normally a keyword, a string or a number and values can be string or a json. Let's go ahead with json for better data representation. For example:

set 10 '{"name": "Bill", "age": "40}'
set 11 '{"name": "Steve", "age": "45}'
set 11 '{"name": "Steve", "age": "46}'

Read Bash function:

  get() {
      grep "^$1," file_db | sed -e "s/^$1,// | tail -n 1
  }

Bash commands being concise and efficient, multiple things are happening in this one statement of get() method. get() method takes only one parameter $1 which is the key we're searching for. First part of the code grep "^$1," file_db gets all the lines from file containing the key. The next command sed stands for stream editor. This is a powerful tool in bash to modify the results from different commands without actually updating contents of the file. "s/^$1,// - this part of the code replaces all the occurrences of $1 (the requested key) in the result with empty value so that we would just have values in the result set. tail -n 1 would return only the last (supposedly most recent) entry from the result in case there are duplicate entries for the same key. Keeping duplicate entry in the file is also a conscious choice, rather an important one, we'll discuss this shortly.

Example:

get 10
{"name": "Bill", "age": "40}
get 11
{"name": "Steve", "age": "46}

Now with two simple bash commands, we have ways to store and retrieve data in a file and an operational basicFileDb. For update operations, set() method would just add another entry in the file with same key but the new value. This makes the write and updates very optimised and fast but the get operation becomes slow due to following reasons:

Sequential search on a key is not the best way to go about searching a value specially if the file grows big over time. The cost of such searches would be O(n).
Each update would add another entry with same key which again has to be handled to get latest record. This is handled here using tail -n 1.

To optimise the reads in basicFileDb, what if we have a hashmap where map key would be the key being searched for and value would be the location of the key in the file. An offset value can be used to identify the location in the file. Then we first read the map and directly go to the offset in the file and get the value. This is the main idea behind creating indexes. Every time an update is made or a new entry is created in the file, we update the map as well. Now comes the trade off, creating or updating entries in the hashmap adds to the insert or update time but makes the read faster. For this reason, map is created only for selective keys, which are most often used for getting records from the database.

Here we're only appending the new entries and not updating in place in files. This is an important design consideration used by multiple databases. All the data written in files are eventually stored in the hardware systems like Magnetic disks or SSDs. For in place updates, the operating system would need to first search that specific entry in the disk and then allocate right memory for the new value. Increasing to the friction, new value might take larger space and adjacent memory block might as well be filled. This makes in place write inefficient considering hundreds or thousands of writes per second. Both the popular persistent storage devices- Magnetic disks and SSDs, prefer sequential writes when performance is the consideration. Sequential writes also helps to handle and avoid concurrency and conflict resolutions. In case multiple processes are trying to update the same entry at the same time, which request should be considered first or how to handle cases when system crashes half way to the update operation ending up having part old data and part new data. Currently, most of the databases use a log file to sequentially append the entry first and then process the log file to further create indexes and update the database files.

Segments and Indexes

We discussed earlier that for our file based database implementation, we would keep appending the new entries using set() method, both for new records or the updates. You might be wondering, once new entry is added for a key, what happens to the older entries ? We need a way to clean the files with useless data and contain the expansion of data in basicFileDb. To handle this, we can break down the data files into certain size or segment. Once the segment file reaches a certain size, the file is closed and new entries are created in a new segment file. A background process called compaction runs which removes duplicate entries. As per our design, segment files are immutable so even in the compaction process, no entry is removed in place immediately as it might effect ongoing operations . In compaction process, all existing segments are traversed and most recent entry for different keys are written to new segment files. Data from multiple segments are thus compacted into a smaller number of segments.

Reads continue from the older segment files as there won't be any changes to older segments. Once the new compacted segment file is created, reads are routed to this new segment file and older segments with duplicate data is deleted.

Indexes

For each segment, there would be one in-memory hash table. These hash tables serves as indexes for the data stored in the file. In this hash table, keys would be the keys from the data and values would be the location offset of the key in file. For any request, hash table for the most recent segment is checked first. If the key is not found then the next most recent hash table is search and so on until we get that key. With compaction process, ideally there shouldn't be a lot of segments to look up for.

As the indexes are in-memory, it's very much prone to data loses on system crash or reboot. One way is to recreate the whole hash map going through all the segments and creating <key> - <file offset> entries in hash map for the most recent value for each key. But this is a time taking process. To optimise this, a snapshot of the in-memory hash map is persisted time to time on the disk and loaded on system reboot or whenever required. Bitcask uses the same method to improve the index recovery.

SSTables and LSM-Trees

In our current implementation of basicFileDb, key values are written to files in sequential order, so the keys would appear in the order they were written. In case of same keys, the most recent key would be given precedence and the order of keys is not important as of now.

Let's change this implementation a bit with following requirements:

All the key-value pairs should be in a sorted order by key.
Each key should appear only once in a merged segment

Write and Merge process of segments should take care of the above two requirements. We would see the advantage of this change in a bit. This format of storage is called Sorted String Table or SSTables.

In the image above, different segments have key-value sorted by key and when they're merged, it again generates key-value pair sorted by key. The merging process here is like mergesort,

Each segment is read side by side
Compare the first keys from each segment and write the most recent minimum valued key in the new segment file
Increment the counter for the segment from which the key is selected and repeat again from the first step.

As the keys are sorted now, keeping in memory hash table or index is not required anymore for all the keys. We can have mapping of some keys, also called sparsed index. Normally one key is mapped for each few kilobytes of data.

In the image above, let's say the request arrives for the key Dan. We chose to create sparsed index on Arvind , Sarah, Tom etc. As all the records are sorted by keys, we know that Dan would lie somewhere between Arvind and Sarah. We keep scanning the keys starting from Arvind until we reach Sarah. If the key i.e. Dan is found, key-value is returned else Key Not found exception is handled. To optimise this further, range of records can be compressed in blocks before writing to disk. Each entry in the hashtable can point to the starting of each compressed block. This would save the space in the disc as well as improve the I/O bandwitdth.

Creating SSTables

We saw how SSTables works, improve the indexes and help save the disk space. How do we create this structure on disk with write and update operations as it's no more a sequential data append. Constructing such structure is easier in memory so let's start with that. Tree data structures like AVL trees or Red-black trees very well support this operation. There is inbuilt logic in these trees to restructure the data so that irrespective of how data was written, when read from the tree, we get the sorted data. So whenever a write request comes, entry is first written in-memory to this kind of tree data structure, also known as memtable. Once the memtable is filled or has reached some limit, this sorted data is written to the disk in form of SSTable. Writing to SSTable is simpler now because data is already in sorted form and can be written as is to the disk; again making write operation a sequential write. This new SSTable file becomes the most recent segment for the database. Meanwhile, new memtable instance is created to continue the write operations.

For any new read request, first the key is looked into the memtable. Given the limitation of in-memory space, there's a possibility that the key would not be present in the memory. In that case, latest segment on the disk is searched for the key. If the key is still not found, key is searched in the next most recent segment and the process continues until the least recent segment is searched. Again, with compaction process, number of segments should be less. For this reason, the read performance for records written or accessed recently would be faster than older records.

The terms, Memtable and SSTable come from Google's Bigtable paper. The term Log Structure Merged Tree was introduced by Patrick O'Neil where he has described this type of indexing structure. Storage engines based on merging and compacting mechanisms as mentioned here are often called LSM storage engines.

One problem with above method is, what if the system crashes or reboots, would we loose data written in-memory but not yet written on the disk as in-memory is a volatile storage. To handle this problem, one solution is to keep a log on disk. Everything written on the memtable should immediately be written in the log file as well. After system crashes, on reboot, the memtable is recreated using the stored log first and the whole system can continue functioning as usual. This log file doesn't need to be infinite as well, as soon as the data is written on the disk from memtable, that data in the log file can be discarded.

Real World Examples

LevelDB and RockDB are key-value storage engine libraries used in different applications mostly use the same algorithm and schemes as mentioned for our basicFileDb. Bitcask, the default storage engine in Riak uses in-memory indexes similar to our basicFileDb to provide high performance reads and writes. LevelDB can be used in Riak database as well. Google's Bigtable and inspired from Bigtable's paper - Cassandra and Hbase use similar storage engines.

ElasticSearch and Solr uses Lucene as indexing engine for full text search. Lucene implements term dictionary for fast searching. In term dictionary, the key is the value to be searched for, normally a word or a term and value is the list of IDs of all document which contains that term also called the postings list. In Lucene, this mapping of term to postings list is kept in SSTable like sorted files and implements similar merging methods in the background.

Implementation Considerations

File Format - Binary file format is generally preferred where whole content is saved as raw string prepended with encoded length of the string in bytes.

Deleting records - As all the files are immutable and mostly append only, to do a delete operation, a special entry (also called tombstone) is appended for the key to be deleted. In the compaction process, all the previous records for the deleted key is discarded.

Concurrency Control - A general choice is to have a single writer thread. Considering only append only operations for writes, this is expected to support high throughput. As the segment files are immutable, concurrent threads can read the data without worrying about data inconsistency.

Slow disk lookup - As we saw, when the key is not found in the memtable, key is looked into the disk which can be a big performance concern considering the lookup would need to go back to least recent segment starting from the most recent segment. For this, storage engines implement Bloom filters. Bloom filters are probabilistic data structures which can tell whether a given value is not present in the given data set with high performance and memory efficiency. This saves many disk reads and enhances the read performance.

Merging Methods - Among multiple merging and compaction methods, size tiered and leveled compaction are most common. LevelDB and RockDB use leveled method, Hbase uses size-tiered and Cassandra supports both the compaction methods. In size-tiered compaction, smaller and newer SSTables are merged into older and larger SSTables. In leveled compaction, key ranges are split into smaller SSTables and different levels of SSTables are created using this. It is ensured that first and last token of one SSTable never overlaps with other SSTables at the same level. In Cassandra, SSTables in each level can be ten times bigger than the SSTables from previous level.

Summary

If we have to summarise the SSTables and LSM-Trees based storage mechanism - segmented and sorted data are stored in files which can be referred to using hash maps in memory. This method is simple and very effective. It has also proved to be scaling very well with large datasets. As indexes are sparse indexes, large set of data can be indexed and with sorted data in the disk, read operations becomes very efficient. With sequential writes, forms of storages like SSDs or Spinning hard drives, are able to provide very high write throughput. Although there are multiple other nuances and subtleties in different storage engine implementations, this idea at its core has contributed a lot to the performance and popularity of NoSql databases we currently know about.