How to Analyze Billions of Records per Second on a Single Desktop PC

Introduction

This article gives an overview of LocustDB [1], a new and extremely fast open-source analytics database built in Rust. Part 1 gives some background on analytical query systems and my goals for LocustDB. Part 2 presents benchmark results produced on a data set of 1.46 billion taxi rides and demonstrates median speedups of 3.6x over ClickHouse and 2.1x over tuned kdb+. Part 3 is an architecture deep dive that follows two SQL statements through the guts of the query engine. Part 4 concludes with random learnings, thoughts on Rust, and other incoherent ramblings.

Online analytical processing

Simply put, online analytical processing (OLAP) involves interactively running queries on often large data sets to gain new insights. To give a concrete example, you might be logging all requests to your website and store fields like the country from which the page was accessed, the accessed url, http status code returned by the website, time it took to service the request, and any number of parameters included in the request. Then one day you get reports of performance issues on your search page and run a query like following to identify specific requests that lead to slow responses:

SELECT params.search_term, latency
FROM requests
WHERE url = '/search'
AND latency > 10000
AND day = $TODAY
ORDER BY latency DESC

Or you may want to do produce a report on the performance of your website, broken down by various attributes such as url and country. Or someone is trying to break into your systems, and you want to construct some filter that allows you to identify malicious requests and check whether they succeeded. Or one of your pages is returning errors and you want to filter down to and inspect the parameters of those requests.

Typical OLAP queries have a number of distinguishing features that impose unique design constraints:

  • Individual queries can be very expensive and may require aggregating trillions of records or more.
  • Data is often filtered on or grouped by multiple columns.
  • Queries are created ad-hoc and are often unique with respect to others queries previously seen by the system.

As a corollary, elaborate indexing datastructures are often ineffective. Since the number of possible ways to filter or group a dataset grows exponentially with the number of columns, we can only provide precomputed indices/aggregates for a tiny subset of possible queries. The approach taken by LocustDB is to forgo any attempts at complex indexing structures in favor of simple partitioning, performing brute-force scans over large parts of the data set, and then making those scans as fast as hardware allows. The main techniques used to achieve this are caching data in memory, massive parallelism, various optimizations around columnar data storage [2] and general techniques for high performance computing [3]. The trade-off is that simple queries will be relatively inefficient, which restricts throughput to maybe 1000s of queries per second. LocustDB is not the first system designed in this fashion. Notably Scuba [4], a proprietary system used at Facebook, and ClickHouse [5], an open source column-oriented DBMS developed at Yandex, have successfully taken the same approach. Scuba in particular adds two additional innovations which are not currently offered by any widely available systems.

Aggressive Compression

Compression is a key tool for speeding up queries reading from disk and for reducing cost. A 2015 paper by Facebook [6] claims compression rates of 10x and larger, achieved through automatically applying multiple compression algorithms tailored to different kinds of data. Add to this the complete absence of indexing overhead, and it results in a very cost effective solution. I suspect there exist many low QPS HBase/Cassandra/Elasticsearch/… clusters that could be replaced by a system like Scuba with minimal cost increase (or even significant savings) while providing higher query speeds, more expressive query capabilities and eliminating the need to manually design indices and table schemas.

Flexible Schemas

Most analytics systems have followed the SQL approach of requiring rigid table schemas that define the type and name of all columns upfront. While this makes a lot of sense for applications using a database for long-term storage of critical data, it is often less useful in an analytics setting where data may be short-lived and data integrity is not as crucial. I believe that foregoing rigid, predefined schemas unlocks a long tail of valuable use cases related to debugging, short lived experiments, prototyping and quick iteration for which setting up or migrating schemas is prohibitively cumbersome. A more subtle benefit is that supporting a flexible schema makes it much easier to support a multitude of compression schemes that may be difficult to retrofit if your code makes more rigid assumptions.

LocustDB

LocustDB originally started out during a Dropbox 2017 hackweek as a clone of Scuba. The prototype was open sourced and I have since continued working on it for fun and profit (minus the profit). So far most of my efforts have been focused on the query engine. It is still incomplete in terms of the breadth of supported query types and operators, but already extremely fast and fairly mature architecturally. Other components include a barely functional SQL(ish) parser, in-memory storage and simple compression schemes. There was support for basic persistence built on top of RocksDB at some point but most of this was killed during a refactor. Other functionality that you might consider important and that does not exist yet include the ability to replicate data and run queries across multiple machines, advanced compression algorithms, and not crashing randomly.

Part 2: Benchmarks

9 thoughts on “How to Analyze Billions of Records per Second on a Single Desktop PC

    1. clemenswinter Post author

      I did evaluate Druid for a specific analytics use case at work one time and found performance to be underwhelming (but it might work better for other workloads and I seem to remember that Druid also offers very strong consistency and durability guarantees).

      Inverted indices make for very efficient point lookups/sparse filtering on a single column but introduce indexing overhead and don’t benefit all queries. The idea behind systems like ClickHouse and LocustDB is to make brute force scans so fast that you don’t even need an index. This does sacrifice performance/efficiency on simpler queries. But if your queries are large and comparatively infrequent (as is the case for many analytics workloads) this approach is a very good fit and should generally give superior performance and cost efficiency.

      Reply
  1. Tim McNamara 🐍🦀 (@timClicks)

    Clemens – this is a really ambitious, exciting project. Well done for continuing your motivation after the initial hackfest!

    Do you have a gut feel for how many records that you would need before a system like this would be justified? In particular, how essential is the distributed component? It sounds like it scales down to a single laptop quite well.

    Reply
    1. clemenswinter Post author

      There are a number of design constraints to consider:
      1. How many queries are run per second
      2. How costly are queries (i.e. for how long do they run, which among other things depends on the number of records they touch)
      3. Latency requirements (i.e. you want queries to complete within n seconds)
      4. How many records are stored in the system

      If any of 1-3 are the bottleneck, you will need to add additional machines/CPUs to increase throughput and/or decrease latency. If a single machine can process all your queries sufficiently quickly, the limiting factor for the number of records is now the size of your RAM (or disk, but this is not currently supported by LocustDB and will lead to lower query speeds).

      So then the number of records you can store is (more or less) given by the formula memory_size * compression_ratio / size_per_record.

      Memory size is obvious.
      Compression ratio will depend on the actual data values (and also how good the compression algorithms are). Compression ratio is difficult to predict and depends very much on the values in the dataset, typical values might be between 3x and 20x.
      The size per record will depend on how many columns your table has and the size of each column.

      To give a concrete example, suppose you have 64GB of memory, you want to leave a spare 10GB of memory for running queries, you table has two dozen columns storing mostly integers and dictionary compressible strings with total record size of 200B uncompressed and 20B compressed. Then the number of records you could store would be roughly 54GB/200B = 2.7 billion.

      Of course the numbers for your use case may be very different, and you would have to actually load and benchmark (a portion) of your dataset to get accurate estimates for compression rates and query speeds.

      Reply
  2. david

    Reminds me a lot of the concepts from SanssouciDB described in detail by Hasso Plattner in his book “In-Memory Data Management”. I thought it might be interesting for you as well, since you haven’t mentioned SAP HANA as a database using the same concepts such as caching, compression, massive parallelism, etc.
    SanssouciDB was the research project which’s concepts eventually evolved into the dbms of the commercial HANA platform.

    Reply
    1. clemenswinter Post author

      Yes, SAP HANA seems to be quite successful in this space. I’d never heard of SanssouciDB or the book before, will have to check that out!

      Reply
  3. Pingback: How Read 100s of Millions of Records per Second from a Single Disk | Clemens' Blog

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.