Rust and other random learnings

Finally, I will conclude with various random learnings I acquired while working on LocustDB and give some unsolicited thoughts on Rust. This is not meant to be a comprehensive evaluation of Rust, but rather specific examples where I found Rust to work particularly well (or poorly) on this project. This section is arranged to start off with general learnings and devolve into arcane areas of Rust towards the end. Lest anyone get the wrong impression, let me pre-face this section by saying that I have found Rust to be an incredibly well designed, fun and productive language to work with. Massive kudos to the team at Mozilla and the many external contributors.

WSL

I currently use Windows almost exclusively (long story). This was a bit of a problem when it came to generating the data set, which involves running a bunch of bash scripts and applications installed via apt-get (notably PostgreSQL) for multiple days. Lacking other options, I figured I would try to get it to work on the Windows Subsystem for Linux (WSL). Not knowing what WSL does and how, I didn’t really expect this to work. Much to my surprise, it did! There were a few minor roadblocks where functionality differed somewhat from a standard Ubuntu install, but they all had reasonable and easy to find workarounds. I still don’t know what WSL does and how, but I now expect it to work (my favorite kind of technology).

Profiling

I love profiling, but for most of the initial parts of this project it was singularly useless. By design, LocustDB has very little overhead and spends the majority of its time running a few simple operations with a predictable performance. So pretty much all profiling would tell me was “more than 90% of runtime is spent in these two simple loops”. Well, OK then, I guess I didn’t do anything really stupid. I did get some mileage out of it on some of the more complex queries that are comprised of dozens of operations where the relative costs are less obvious.

VTune

One instance where profiling was invaluable was when I was debugging an incredibly strange performance regression that was caused by varying the partition size (which in theory should have minimal performance impact, unless the partition size is made very small). After some experimentation I found some pathological partition sizes where performance would degrade by more than 2(!!) orders of magnitude! On these pathological cases CPU utilization would plummet, which was really weird because the query engine contains only a small number of very shortly-held locks. Instrumentation showed occasional massive slowdowns in seemingly arbitrary sections of the code. Profiling with CodeXL did not show anything other than a larger fraction of time spent in the long tail.

Out of ideas, I decided to give Intel’s VTune Amplifier profiler a shot, which immediately flagged low parallelism as a problem and pointed me to a spinlock in the RtlFreeHeap function of ntdll.dll that was blocking threads and consuming most of the runtime. Thanks, Microsoft, for reminding me to not produce pathological memory allocation patterns I guess? Best I could tell, what triggers this is making multiple fairly large (KB-MB) allocations in several threads, and then deallocating them soon afterwards.

This problem was mostly sidestepped by reducing the size of temporary allocations performed by the query engine, which is something I wanted to do anyway, but some (less optimized) queries are still affected by this. Compiling under Linux, where Rust will link jemalloc as the memory allocator by default, gives massive speedups on those queries. Unfortunately there doesn’t seem to be a way to compile with jemalloc under Windows at the moment but I think there is some work in that direction.

Benchmarks

Rust includes an (unstable) benchmarking framework that is integrated with the testing framework and makes it super easy to define and run (micro) benchmarks. The benchcmp tool makes it very convenient to compare outputs of different benchmark runs. One limitation I have run into as I started to use larger data sets and longer running benchmarks is that currently the benchmarking framework will run each benchmark case at least 350 times, but by and large it has been very useful and convenient. I’ve also heard good things about criterion but the built-in benchmarking has worked well enough for me so far.

The cake is a lie

On one of my routine benchmark runs after a code change that I wasn’t really expecting to affect performance at all, I saw consistent 20-30% performance regressions across the board. After much puzzlement, I determined that the regression was related to using a more recent version of the Rust compiler. Further digging revealed that compiler changes introducing incremental/parallel compilation prevent various cross-module optimizations, and that in addition to passing --release it is now necessary to set the RUSTFLAGS="-Ccodegen-units=1" and CARGO_INCREMENTAL=0 environment variables to get optimal performance. Good thing I was running on an old version of Rust, otherwise I would never have known what I was missing out on. It would be great if there was a single flag that you could rely on to set all current (and future) options required for maximum performance. In related news, there is also RUSTFLAGS=-Ctarget-cpu=native which allows use of architecture specific instructions, but also hurts portability in a major way so it makes sense for it to be set explicitly and I didn’t actually see any speedups from it on LocustDB.

Tests

In a lot of programming languages, creating tests feels kind of tedious. You have to maintain a separate folder hierarchy, create a new file, add a bunch of imports and boilerplate (cough Java cough). Rust makes it very easy to define unit tests within the file they are testing which for me greatly reduces the friction of adding tests.

Macros

So far I’ve only written a few very simple macros myself, but made extensive use of the (excellent) nom parser combinator crate for implementing the query parser. Nom allows you to generate extremely efficient parsers at compile time, and relies heavily on macros. One unfortunate aspect of macros (at least currently in IntelliJ) is that much of the tooling ceases to work inside macro contexts (automatic reformatting, typeahead, jump to definition, …). This made me somewhat regret my choice of nom, especially since query parsing doesn’t actually have to be particularly fast.

Fearless concurrency

The parallelism in LocustDB is fairly coarse-grained so I didn’t do anything crazy here. The initial query engine was single threaded and converting it to run multithreaded required making sure that pretty much all the data structures used anywhere were thread safe. Amazingly, the Rust compiler is able to detect any thread unsafe structures automatically which made hunting down the various offenders trivial. In a different language, I could have achieved the same thing by meticulously checking all my data structures, and I probably would not have missed anything, but it is really nice I didn’t have to.

unsafe

The unsafe keyword in Rust provides an escape hatch which unlocks some additional capabilities at the cost of memory safety. All of LocustDB contains essentially just three different uses of unsafe. The first is calls to str::from_utf8_unchecked which I used to reconstruct strings that are packed into contiguous byte arrays. Strings in Rust are guaranteed to be valid UTF8 (and it can cause unsafe/undefined behavior if they are not), so the safe version of this function has to spend some cycles to check that the byte slice it is given actually constitutes valid UTF8. The other two uses of unsafe are places where I was not able to make the Rust borrow checker work out for me, and had to resort to transmute lifetimes to the unbounded 'static lifetime (more on that later). So far, I have not really felt the need for unsafe with respect to performance optimizations.

Documentation

I have found the Rust documentation to be very solid overall. One difficulty is that on many objects a lot of the methods are implemented for a trait (and maybe generically or as an extension method). This can make specific methods difficult to hunt down. One feature that could be very useful here is searching methods by type signature. For example, you might know/suspect that there is a method that will test whether some predicate is true for all elements in a vector but have no idea what it’s called. Clearly though, one of the inputs to the method has to be Vec or Iterator and it has to return a bool which narrows candidates down to a manageable shortlist. Apparently this kind of search was actually possible at one point but for some reason I’ve never gotten it to work. I occasionally Ctrl-F for -> ReturnType inside the browser but it’s a poor substitute.

One neat feature of the Rust docs is that all modules/methods/structs/… have a link to their source code which makes it very easy to satisfy your curiosity about how something is implemented. It may also lead you to discover gems such as this.

GADTs

If you’ve never heard of generalized algebraic data types (GADTs), don’t feel bad because they are a fairly obscure construct relegated to languages such as Haskell and Scala. Basically what they allow you to do is add a generic type parameter to an enum type that can be instantiated to a different type for each variant of the enum. This can be useful for modelling expression trees that may evaluate to different types (as used in LocustDB’s query engine). If Rust had support for GADTs (it does not), I would without a doubt have spent much time integrating them with my query engine. As it turns out everything works just fine without. I did run into a few bugs that they could have prevented, but they were fairly obvious and easy to fix. So maybe Golang has a point after all (a part of my soul died writing this sentence).

Zero-cost abstractions

Rust’s generics provide a powerful way of abstracting over types. Generic parameters can be constrained to implement specific traits which makes it easy to abstract over e.g. integer types or any other class of values that can be modeled by a trait. The compiler will determine all the concrete instantiations of the type parameters at compile time and generate specialized functions/structs for each of them that are as efficient as the equivalent code not using generics. LocustDB makes extensive use of generics to be able to deal with many different data types in a uniform way without duplicating code or introducing inefficiencies.

As an example, the sum operator, which performs grouped summation for in queries like select passenger_count, sum(total_amount), takes two input vectors: the first contains the data to be summed, and the second is an index that designates a group for each value. Both are allowed to be any integer type which makes VecSum work for, at time of writing, 16 different combinations of types (caring about binary sizes is so 1991). Without this, our choices would be wasting precious cycles and cache lines by converting to a single common type, incurring virtual function call and/or boxing overhead, or resort to some dreadful hacks or codegen that may interact poorly with the type system and introduce build system complexity. The only other language feature I’m aware of that provides similar capabilities is C++ templates (if I your favorite programming language also supports this, good on you and I meant no offense).

In addition to types, Rust’s generics also make it possible to abstract over operators by adding a (phantom) type parameter that is constrained to a trait with one or more static functions. This makes things like the (slightly horrifying) VecConstBoolOperator possible which implements multiple different comparison operations that compute a boolean from two values which may have different types. This could be taken even further and make it possible to add support for many different operators and functions with minimal effort.

Lifetime woes

Many of Rust’s innovations and strong safety guarantees are made possible by its unique lifetime system that statically assigns lifetimes to all values and ensures they cannot be modified concurrently or accessed after they are dropped. The lifetime rules are enforced by the infamous borrow checker. A mistake that I made when I was still somewhat new to Rust was to add unnecessary lifetimes to core structs. These lifetimes are propagated by any functions or structs that use them, quickly infect large parts of your program and lead to unmaintainably complex types. I now know better and make liberal use of #[derive(Copy)], clone(), Rc, and Arc unless I absolutely know that I will need every last bit of performance.

For the most part, I’ve been able to eliminate all (non-local) lifetimes from LocustDB. But one still remains, has caused me endless pain, and is the antagonist of the remaining sections. I have easily spent dozens of hours on the various type gymnastics required to appease the borrow checker in this case (plus an additional 10 hours writing this rant. Yes, I do question my decision making sometimes…). On several occasions, the addition of a seemingly trivial feature suddenly resulted in multiple hours of figuring out how to make the lifetimes work out. By now, things are set up quite nicely and mostly just work so at least there is little ongoing cost. And of course it should also be mentioned that I have spent zero hours debugging heap corruption, segfaults and use after free bugs (excepting that one time when my hubris got the better of me).

Naturally LocustDB supports a string datatype. The simple solution of just cloning all string values used in the query engine would introduce unacceptable performance overhead and can explode memory usage. So operators that work with strings take them by reference (&str) which means they require a lifetime parameter corresponding to the lifetime of the object containing data that the string references. In some Rust programs it is possible to eliminate this type of lifetime by replacing the &str with an index into the array storing the string values but that’s not workable in this case. First, we still need to access the actual string data. Second, there isn’t a single backing store for strings. They might come from a Vec packing multiple strings into a contiguous area of memory, a String that was originally part of the query string or created on-the-fly by the query engine, or a Vec used by a dictionary codec. It must be possible to uniformly merge and operate on all these different string values originating from many different columns and &str is the only common representation that does not sacrifice performance.

‘self

When a is query is executed all the worker threads processing the query hold a reference to a shared QueryTask structure which stores reference counted pointers (Arc) to the partitions of the table that the query is operating on. Each worker thread produces a partial result from evaluating the query on some subset of partitions, which is written back to the QueryTask to be later merged with the results from the other threads. These sub results contain references to string data in the original table and so require a corresponding lifetime. Now what exactly should this lifetime be? It has to be longer than the lifetime of the individual worker threads, because they may exit before the query is fully evaluated. We do know that the string references will be valid for as long as the QueryTask object lives (as long as the field referencing the table is not mutated). What we would like to do is to tie the lifetime of the partial results to the lifetime of the containing structure. Unfortunately this is not currently possible and the semantics of adding support for this to the type system are actually quite tricky.

Currently the only solution is to use the unsafe mem::transmute function to cast the lifetime to 'static which is guaranteed to live for the duration of the program. So we are lying to the compiler, but this is safe as long as we keep those references private to the QueryTask struct and prevent them from being accessed by any code that runs after the QueryTask has been dropped. After the query completes fully and before the QueryTask object is dropped, the final result set is converted into owned values and returned without any associated lifetimes. This pattern is actually common enough that there exist crates that aim to provide a safe abstraction for it.

Codec

All columns storing data consist of two components. One is the compressed data making up the elements in the column. The second is an instance of Codec, which is a trait with a decode function that turns the compressed data back into values that the query engine knows how to operate on. Instances of Codec may themselves own some additional data as required for decoding (e.g. a dictionary of strings used for dictionary encoding). In retrospect this was a terrible design decision, even though keeping the data separate from Codec would create its own set of challenges. By the time I realized my mistake though it was difficult to undo and easier to just power through (or so I thought).

Initially, the Codecs for the various columns operated on by the query planner/engine were passed around as &'a Codec using the same lifetime already required for references to the column data. This worked well enough. At some point, I needed the query planner to be able to create new codecs on the fly. This caused me quite the headache. These codecs generated on-the-fly are allocated on the stack so it is not possible to create any (long lived) references to them. They are simple value types so copying them would be possible, but then that is not an option for any Codec owning large amounts of data. You might try to put everything inside an Arc, but that completely eliminates the lifetime 'a which is still required as the return type for the decode function which may yield strings that reference non-static data.

What finally worked was to promote the lifetime parameter to the Codec<'a> trait, require it to be clonable, and then have implementors of Codec that own data implement Codec not on the structure itself, but for references to that structure. (E.g. impl<'a> Codec<'a> for &'a DictionaryEncoding). Making all of this work out required atrocities like this, which utilizes the poorly “documented” <'a> for <'a> type expression. Gazing upon this code fills me with a sort of pride and also abject horror. (Okay, I exaggerate, but cultivating a morbid sense of humor is actually a crucial tool for remaining sane in the presence of the Borrow Checker). In all of this, the borrow checker did save me from multiple subtle use-after-free bugs that would have been difficult to avoid in other unmanaged languages.

Variance

Of course, most implementations of Codec won’t actually contain any references. E.g. we might have a very simple codec that simply subtracts a constant offset from an integer:

pub struct OffsetCodec {
    offset: i64,
}

When we implement Codec for this struct, we still need to pass on some lifetime to Codec. Well, this is easy, right? Since OffsetCodec contains no references and is valid for the entire duration of the program, we can just impl Codec<'static> for OffsetCodec. Wrong! Trait lifetime parameters in Rust are always invariant and you will now get borrow checker errors when passing OffsetCodec to functions taking a Codec<'a> and some other value with the same lifetime 'a, which is now suddenly forced to be 'static as well (but can’t be). What we can do is write impl<'a> Codec<'a> for OffsetCodec which allows the borrow checker to instantiate the lifetimes to be whatever it needs. This solution is very simple. Coming up with it when all I had to go on were far removed borrow checker errors was not.

Streaming iterators

Early on in the project, I tried for a little while to make some kind of reusable Codec trait work that would allow you to (at compile time) compose multiple compression algorithms into a single compression scheme without any runtime overhead. E.g. you might want a Codec for string data that will apply dictionary encoding to the string values, variable-width integer encoding to the indices into the dictionary, and compress the dictionary using LZ4. Or you might want to apply two passes of delta encoding and one pass of variable-width encoding, and still allow the compiler to fuse all of these passes into a single loop. There are various other complexities to be accommodated, but a simplified version of this kind of Codec trait might look like this:

pub struct Codec {
    type In;
    type Out;
    fn decode(in: In) -> Out;
}

Of course, concrete instances of In and Out might have a lifetime parameter. E.g a decoding algorithm that restores strings packed into a single vector of bytes might have In=&'a [u8] and Out=Vec<&'a str>. Crucially, we must ensure that 'a refers to the same lifetime in &'a [u8] and Vec<&'a str>. This is currently not easily expressible in Rust, and equivalent to the problem faced by what is usually referred to as streaming iterators. There are some ~~hacks~~ techniques that can solve this, subject to various limitations, but I am insufficiently masochistic to give them a try. In any case, there is hope! My most anticipated RFC, the scary-sounding generic associated types, will allow us to write something like the following to express this kind of constraint:

pub struct Codec {
    type In<'a>;
    type Out<'a>;
    fn decode<'a>(in: In<'a>) -> Out<'a>;
}

The crown jewel

Do you remember how I casually mentioned that Rust made it easy for me to implement operators generically for many different types? Well, I lied. By now, it will come as no surprise to you that this capability was actually acquired at great personal cost and involved multiple blood sacrifices to the Borrow Checker.

Even simple queries are composed of multiple operators that are stitched together at run time by the query planner. This means we need a common type that allows us to pass the results from one operator as the input to another. Let’s make a trait for it and call it AnyVec<'a>. It essentially allows us to wrap any one of many different result types in a boxed trait that provides a lot of different methods that will return the original data. This looks something like this:

pub trait AnyVec<'a> {
    fn cast_ref_u8(&self) -> &[u8] { panic!("cast_ref_u8") }
    fn cast_ref_u16(&self) -> &[u16] { panic!("cast_ref_u16") }
    fn cast_ref_str<'b>(&'b self) -> &'b [&'a str] { panic!("cast_ref_str") }
    // ...
}

By default, these methods will simply panic (I know, I know, they should return a Result. Feel free to make a pull request). Each supported type implements the corresponding methods, e.g. for string data:

impl<'a> AnyVec<'a> for Vec<&'a str> {
    fn cast_ref_str<'b>(&'b self) -> &'b [&'a str] { self }
}

Ok, great! Now we can implement an overly complicated function like the following which computes an AND operation:

fn and(left: &AnyVec<'a>, right: &AnyVec<'a>) -> AnyVec<'a> {
    let left = left.cast_ref_u8();
    let right = right.cast_ref_u8();
    left.zip(right).map(|l, r| l & r).collect::<Vec<_>>()
}

But what if we want to implement a function that e.g. implements equality generically for all types? A first attempt might look like this:

fn equals(left: &AnyVec<'a>, right: &AnyVec<'a>) -> AnyVec<'a> {
    let left: &[T] = left.cast_ref_???();
    let right: &[T] = right.cast_ref_???();
    left.zip(right).map(|l, r| l == r).collect::<Vec<_>>()
}

This almost gets us there, but of course we can’t just call cast_ref_u8() to cast the inputs to byte values because we want this to work with any type. We need a way of calling the correct cast_ref_x method depending on what T is. With associated type constructors, we could define a GenericVec as follows:

pub trait GenericVec {
    type Item<'a>;
    fn unwrap<'a, 'b>(vec: &'b TypedVec<'a>) -> &'b [Item<'a>];
}

So now we can just add a trait bound equals<T>(left: &'a T, right: &'a T) where T: GenericVec + 'a, T::Item: Eq and then call T::unwrap(left) to cast the inputs into the correct types. Implementing GenericVec for specific types is once again trivial:

impl<'a> GenericVec for &'a str {
    type Item<'a> = &'a str;
    fn unwrap<'a, 'b>(vec: &'b AnyVec<'a>) -> &'b [Item<'a>] {
        vec.cast_ref_str()
    }
}

Alas, associated type constructors don’t exist yet. What do! Well, since you’re still reading you must be the kind of person that enjoys pain and doesn’t give up easily. So you will surely agree that the only sensible course of action is to bang our head against the type system until something gives. After some trial and error, you may stumble upon the following:

pub trait GenericVec {
    fn unwrap<'a, 'b>(vec: &'b AnyVec<'a>) -> &'b [T] where T: 'a;
}

Not even that horrible! We use a where clause to tie the lifetime in TypedVec<'a> to the elements T in the result. And this almost makes everything work! Implementing VecType for u8 proceeds without major issues:

impl GenericVec for u8 {
    fn unwrap<'a, 'b>(vec: &'b AnyVec<'a>) -> &'b [u8] where u8: 'a {
        vec.cast_ref_u8()
    }
}

The compiler demands that we add in a tautological where u8: 'a constraint to match the signature of the trait. That’s a bit irritating, for sure, but we can live with it. The real problem comes when we try to do the same for &str:

impl<'c> GenericVec<&'c str> for &'c str {
    fn unwrap<'a, 'b>(vec: &'b AnyVec<'a>) -> &'b [&'c str] where &'c str: 'a {
        vec.cast_ref_str()
    }
}

error[E0308]: mismatched types
  --> src/main.rs:14:81
   |
14 |     fn unwrap<'a, 'b>(vec: &'b AnyVec<'a>) -> &'b [&'c str] where &'c str: 'a { vec.cast_ref_str() }
   |                                                                                 ^^^^^^^^^^^^^^^^^^ lifetime mismatch
   |
   = note: expected type `&'b [&'c str]`
              found type `&'b [&'a str]`
note: the lifetime 'a as defined on the method body at 14:5...
  --> src/main.rs:14:5
   |
14 |     fn unwrap<'a, 'b>(vec: &'b AnyVec<'a>) -> &'b [&'c str] where &'c str: 'a { vec.cast_ref_str() }
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: ...does not necessarily outlive the lifetime 'c as defined on the impl at 13:1
  --> src/main.rs:13:1
   |
13 | impl<'c> GenericVec<&'c str> for &'c str {
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0308`.

Whaa! Come again? (Playground link so you can personally experience the horror). Already tasting victory, we brace for the final struggle. But this time, the Borrow Checker is unyielding. Pragmatism prevails, and we put in an unsafe call to mem::transmute and a salty comment. We can finally get back to our lives, scarred yet stronger, and ready to face the inescapable day when we will once again meet the Borrow Checker in battle.

LocustDB needs YOU

If you managed to make it all this way, you indubitably have what it takes to take LocustDB to the next level! There is still a lot of work to do, some of it easy, some of it difficult, all of it interesting. Quite frankly, I have gone about as far as I am able/willing to on my own and any further progress will be a team effort. If you are excited by the prospect of running LocustDB in production, or always wanted to build the world’s fastest analytics database, your dream is now just one click away.

Acknowledgements

Thanks to Akhil Ravidas for introducing me to Scuba, encouraging my efforts and reviewing a draft of this article. Thanks to York Winter for reviewing a draft of this article. Thanks to David Fisher for working with me on LocustDB during two amazing hackweeks. Thanks to Dropbox for holding hackweeks in the first place and allowing me to open-source the original prototype. Thanks to everyone who contributed to Rust, without which LocustDB would not exist. And thanks to the many teachers, family members, colleagues and friends that helped me develop the skills and confidence to take on ambitious projects.

References

[1] Clemens Winter, “cswinter/LocustDB,” GitHub, 2018, https://github.com/cswinter/LocustDB.

[2] Daniel Abadi, “The Design and Implementation of Modern Column-Oriented Database Systems,” Foundations and Trends® in Databases 5, no. 3 (2012): 197–280, https://doi.org/10.1561/1900000024.

[3] Ulrich Drepper, “What Every Programmer Should Know About Memory,” What Every Programmer Should Know About Memory, 2007, https://people.freebsd.org/~lstewart/articles/cpumemory.pdf.

[4] Lior Abraham et al., “Scuba: Diving into Data at Facebook,” Proceedings of the VLDB Endowment 6 (2013): 1057–1067.

[5] “ClickHouse DBMS,” ClickHouse — open source distributed column-oriented DBMS, 2009, https://clickhouse.yandex/.

[6] Nathan Bronson, Thomas Lento, and Janet L. Wiener, “Open Data Challenges at Facebook,” in 2015 IEEE 31st International Conference on Data Engineering (2015 IEEE 31st International Conference on Data Engineering (ICDE), IEEE, 2015), https://doi.org/10.1109/icde.2015.7113415.

Part 3: Architecture deep dive

Pages: 1 2 3 4

13 thoughts on “How to Analyze Billions of Records per Second on a Single Desktop PC”

Jake July 9, 2018 at 3:18 pm

Have you tried middle-out compression to speed up search results? lol

Reply ↓
Paul Masurel July 9, 2018 at 3:28 pm

Have you had a look at solution based on inverted index like Druid or Imhotep ?

Reply ↓
1. clemenswinter Post authorJuly 9, 2018 at 8:46 pm
  
  I did evaluate Druid for a specific analytics use case at work one time and found performance to be underwhelming (but it might work better for other workloads and I seem to remember that Druid also offers very strong consistency and durability guarantees).
  
  Inverted indices make for very efficient point lookups/sparse filtering on a single column but introduce indexing overhead and don’t benefit all queries. The idea behind systems like ClickHouse and LocustDB is to make brute force scans so fast that you don’t even need an index. This does sacrifice performance/efficiency on simpler queries. But if your queries are large and comparatively infrequent (as is the case for many analytics workloads) this approach is a very good fit and should generally give superior performance and cost efficiency.
  
  Reply ↓
  1. fulmicoton July 10, 2018 at 1:34 am
    
    Thanks for your thorough reply!
Tim McNamara 🐍🦀 (@timClicks) July 9, 2018 at 6:36 pm

Clemens – this is a really ambitious, exciting project. Well done for continuing your motivation after the initial hackfest!

Do you have a gut feel for how many records that you would need before a system like this would be justified? In particular, how essential is the distributed component? It sounds like it scales down to a single laptop quite well.

Reply ↓
1. clemenswinter Post authorJuly 9, 2018 at 8:22 pm
  
  There are a number of design constraints to consider:
  1. How many queries are run per second
  2. How costly are queries (i.e. for how long do they run, which among other things depends on the number of records they touch)
  3. Latency requirements (i.e. you want queries to complete within n seconds)
  4. How many records are stored in the system
  
  If any of 1-3 are the bottleneck, you will need to add additional machines/CPUs to increase throughput and/or decrease latency. If a single machine can process all your queries sufficiently quickly, the limiting factor for the number of records is now the size of your RAM (or disk, but this is not currently supported by LocustDB and will lead to lower query speeds).
  
  So then the number of records you can store is (more or less) given by the formula memory_size * compression_ratio / size_per_record.
  
  Memory size is obvious.
  Compression ratio will depend on the actual data values (and also how good the compression algorithms are). Compression ratio is difficult to predict and depends very much on the values in the dataset, typical values might be between 3x and 20x.
  The size per record will depend on how many columns your table has and the size of each column.
  
  To give a concrete example, suppose you have 64GB of memory, you want to leave a spare 10GB of memory for running queries, you table has two dozen columns storing mostly integers and dictionary compressible strings with total record size of 200B uncompressed and 20B compressed. Then the number of records you could store would be roughly 54GB/200B = 2.7 billion.
  
  Of course the numbers for your use case may be very different, and you would have to actually load and benchmark (a portion) of your dataset to get accurate estimates for compression rates and query speeds.
  
  Reply ↓
david July 17, 2018 at 12:52 am

Reminds me a lot of the concepts from SanssouciDB described in detail by Hasso Plattner in his book “In-Memory Data Management”. I thought it might be interesting for you as well, since you haven’t mentioned SAP HANA as a database using the same concepts such as caching, compression, massive parallelism, etc.
SanssouciDB was the research project which’s concepts eventually evolved into the dbms of the commercial HANA platform.

Reply ↓
1. clemenswinter Post authorJuly 17, 2018 at 8:31 am
  
  Yes, SAP HANA seems to be quite successful in this space. I’d never heard of SanssouciDB or the book before, will have to check that out!
  
  Reply ↓
Pingback: How Read 100s of Millions of Records per Second from a Single Disk | Clemens' Blog
Jason January 10, 2019 at 7:52 pm

I’m aware this is an old post and feature sets / versions have changed but from what I can gather ClickHouse wasn’t run with LLVM enabled (compile=1; is disabled by default but is perfectly production ready). In my environment this generally improves performance by 15-50% depending on the query. Note that the first time a query runs there is a cost as it performs code-gen but that is not indicative of real world usage where the lib would pre-exist due to previous runs.

Additionally newer versions of ClickHouse support dictionary encoding (LowCardinality). Would be interesting to see this test repeated with these new features as I suspect the results would be quite different now.

Reply ↓
pindash91 July 30, 2019 at 9:27 am

For KDB, casting the column to symbol, (`$) would allow you to do the dictionary aggregation. KDB’s symbol type is what everyone else calls an enum.

Reply ↓
Phi October 14, 2020 at 2:50 am

Hej, Interesting stuff. Are you aware of the more obscure “APL” like systems like kdb+ or shakti (the newer and IMHO more interesting project by Arthur Whitney)?
Column based, memory mapped files, in memory + on disk data abse + CEP engine.

Reply ↓
1. clemenswinter Post authorOctober 14, 2020 at 7:27 am
  
  I actually have some comparisons to kdb+ in part 2 of this blogpost. This is the first time I’ve heard of shakti, hard to keep up with all the different database systems these days. There doesn’t seem to be much information about it yet but looks like like a neat project, will be interesting to see if it gains traction.
  
  Reply ↓