The Evolution of Blox

It has been an exciting year here at OpenBlox, starting off with an idea that has now evolved through many iterations to become the game and world that it now is. Throughout this process, we have…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

HASH or RANGE in distributed databases

Here is a general idea that works in any database: you want to colocate the data that will be retrieved together. In a NoSQL database which generally is built for one use case, this is easy: you have one use-case, you have one data store (called collection in MongoDB, or improperly called “table” in DynamoDB) and have one partitioning scheme: hashing. In a relational database, this is different: data is stored to be accessed by multiple use case. Because different users may navigate, or aggregate, from different business point of view. You have multiple tables (normalization to avoid update anomalies), multiple partitioning schemes for tables and indexes. Basically, there are two structures in IT when you want to colocate data:

For my examples, on YugabyteDB, I’ll load an AVENGERS table:

However, I’ll do my test on a geo-distributed cluster in order to see the latency when reading from multiple nodes:

I have 173 avengers in a YugabyteDB table and one index:

Let’s see what are the optimized access paths. Basically we will see either:

Without any where clause, of course we need to read all table rows and Seq Scan is the right access path.

Let’s filter with a predicate on the key:

Here, the Index Scan knows which part of the table to access thanks to the “Index Cond”. This is an equality predicate and on a HASH key the value is hashed and can get to the right partition, and the right place in this partition.
Let’s go further:

With many values the Index Scan is still possible: each value is accessed in its partition.

With an inequality predicate we cannot use the HASH index because the hash function doesn’t keep the order. In this case a Seq Scan is the only solution, reading all rows from all tablets and filtering them afterwards. This is not an optimal access path and should be avoided for the critical use cases.

When you define the primary key, you probably don’t know all use cases in advance. But you know your data. In this case, the “index” primary key is a number just to get a small identifier but the value has no business meaning, and no arithmetic purpose like comparing and sorting. There is no reason to ever query it on a range, or sort it, and then it probably doesn’t need a RANGE key. HASH key is more flexible when you know you don’t need a range scan because, with consistent hash sharding, you can achieve the best distribution across nodes, and keep this while rebalancing, adding, or removing.

There are some columns that have an order, and where you definitely want to query on a range. Let’s index the column that holds the number of appearances of the avengers character in a Marvel comic books:

Without any mention of ASC or DESC it defaults to HASH.

My HASH index was not used here when querying for the avengers with less than 5 appearance. Hashing can be used only when having discrete values in the condition:

This is not very useful. For such a column, where a range, or order, makes sense we need to define a range index, either in ASCending or DESCending order:

Here the index has been scanned, even if the order of the scan (descending from 5 here) is opposite of the index order (ASC). No additional filtering on the output as we have read only the required rows. And no Sort operation as we know they come in the order we want in the ORDER BY.

The same works when reading in ascending order of course.

So, for the numeric datataypes where the order is meaningful you want to define ASC or DESC. It is the same for a character string column if the order makes sense. Let’s index the avenger’s name, first with a HASH:

With the predicate on the full name, a HASH index is used. But let’s say you don’t know Spiderman’s middle name:

With the LIKE 'Peter% Parker' pattern we had to scan all the rows. Can we do better? Let's create a RANGE index:

Here, the query planner knows that the names starting with ‘Peter’ are all colocated and this is an index access to contiguous entries. This is where a relational database is more powerful than NoSQL: you have multiple access paths and a query planner to find the optimal one.

Note that I’ve created two indexes here but the more indexes you have and the more expensive will be the insert, delete or update of the indexed columns. Here, a HASH index makes no sense because the RANGE index can also serve equality predicates

You can also index multiple columns, where the first one is hashed to partition it and the second one is sorted to scan by range and order it. You can also colocate tables together in tablegroups. We will see that in future posts.

The Evolution of Blox

HASH or RANGE in distributed databases

Add a comment

Related posts:

How Stress is Impacting Your Weight

How Smart Cities are Benefiting From Conversational AI

Midnight.