Tuesday, January 19, 2016

The advantages of an LSM vs a B-Tree

The log structured merge tree (LSM) is an interesting algorithm. It was designed for disks yet has been shown to be effective on SSD. Not all algorithms grow better with age. A long time ago I met one of the LSM co-inventors, Patrick O'Neil, at the first job I had after graduate school. He was advising my team on bitmap indexes. He did early and interesting work on both topics. I went on to maintain bitmap index code in the Oracle RDBMS for a few years. Patrick O'Neil made my career more interesting.

Performance evaluations are hard. It took me a long time to get expertise in InnoDB, then I repeated that for RocksDB. Along the way I made many mistakes. Advice on doing benchmarks for RocksDB is here and here.

tl;dr - the MyRocks advantage is better compression and less write-amplification

The MyRocks Advantage


There are many benefits of the MyRocks LSM relative to a B-Tree. If you want to try MyRocks the source is on github, there is a wiki with notes on buildingmy.cnf and more details on the MyRocks advantage. Track down Yoshinori at the MyRocks tutorial at Percona Live or his talk at FOSDEM to learn more. The advantages include:
  • better compression - on real and synthetic workloads we measure 2X better compression with MyRocks compared to compressed InnoDB. A B-Tree wastes space when pages fragment. An LSM doesn't fragment. While an LSM can waste space from old versions of rows, with leveled compaction the overhead is ~10% of the database size compared to between 33% and 50% for a fragmented B-Tree and I have confirmed such fragmentation in production. MyRocks also uses less space for per-row metadata than InnoDB. Finally, InnoDB disk pages have a fixed size and more space is lost from rounding up the compressed page output (maybe 5KB) to the fixed page size (maybe 8KB).
  • no page reads for secondary index maintenance - MyRocks does not read the old version of a secondary index entry during insert, update or delete maintenance for a non-unique secondary index. So this is a write-only operation compared to read-page, modify-page, write-page for a B-Tree. This greatly reduces the random IO overhead for some workloads (benchmark results to be shared soon).
  • less IO capacity used for persisting changes - a B-Tree does more & smaller writes to persist a database while the MyRocks LSM does fewer & larger writes. Not only is less random IO capacity used, but the MyRocks LSM writes less data to storage per transaction compared to InnoDB as seen in the Linkbench results (look for wKB). In that case the difference was ~10X.
  • less write amplification from flash GC - the write pattern from the MyRocks LSM is friendlier to flash GC compared to InnoDB. This leads to lower write-amplification so that InnoDB was writing up to 18X more data per transaction compared to MyRocks as seen in the Linkbench results.
  • simpler code - RocksDB is much simpler than InnoDB. I have spent years with both -- debugging and reading code. MyRocks still has mutex contention, but the problem is easier to fix because the engine is simpler. New features are easier to add because the engine is simpler. We are moving fast to make MyRocks better.

The MyRocks Disadvantage


There is no free lunch with database algorithms. From theory we can expect better efficiency on writes and worse efficiency from reads with an LSM compared to a B-Tree. But what does that mean in practice? This will take time to figure out and I try to include IO efficiency metrics in my benchmark results that include disk reads, disk KB read and disk KB written per transaction to understand the differences. But at this point the IO efficiency improvements I see from MyRocks are much greater than the IO efficiency losses.

MyRocks is not as feature complete as InnoDB. While the team is moving fast the MySQL storage engine API is hard and getting harder. Missing features include text, geo, native partitioning, online DDL and XA recovery on the master (keep binlog & RocksDB in sync on master). I am not sure that text & geo will ever get implemented. I hope to see XA recovery and partial support for online DDL this year.

MyRocks is not proven like InnoDB. This takes time, a variety of workloads and robust QA. It is getting done but I am not aware of anybody running MyRocks in production today.

An LSM might use more CPU per query because more data structures can be checked to satisfy a query. This can increase the number of CPU cache misses and key comparisons done per query. It is likely to be a bigger deal on CPU-bound workloads and my focus has been on IO-bound. One example of the difference is the tuning that was required to make the MyRocks load performance match InnoDB when using fast storage. Although in that case the problem might be more of an implementation artifact than something fundamental to the LSM algorithm.

An LSM like MyRocks can also suffer from the range read penalty. An LSM might have to check many files to get data for a point or range query. Most LSMs, including MyRocks, use bloom filters to reduce the number of files to be checked during a point query. Bloom filters can eliminate all disk reads for queries of keys that don't exist. We have had some workloads on InnoDB for which 30% of the disk reads were for keys that don't exist.

Bloom filters can't be used on pure range queries. The most frequent query in Linkbench, and the real workload it models, is a short range query. Fortunately this query includes an equality predicate on the leading two columns of a covering secondary index (see the id1_type index in the Linkbench schema) and RocksDB has a feature to create the bloom filter on a prefix of the key (innovation!).

Even when the prefix bloom filter can't be used the range read penalty can be overstated. It is important to distinguish between logical and physical reads. A logical read means that data is read from a file. A physical read means that a logical read had to use the storage device to get the data. Otherwise a logical read is satisfied from cache. A range query with an LSM can require more logical reads than a B-Tree. But the important question is whether it requires more physical reads.

7 comments:

  1. Hi Mark, nice post. Apart from the differences in access pattern, does MyRocks lack at any of storage independent Innodb features, for eg: transaction processing, MVCC, referential integrity etc ?

    ReplyDelete
  2. Many features are missing. The plan is to have enough features to be used in production later this year by a web-scale internet company. The engine API in MySQL is a huge barrier to innovation in MySQL land. In less than one year the MongoDB engine API was defined, implemented for RocksDB (called MongoRocks) and put into production. It is too bad that we can't move at the same speed in MySQL and this is lousy for the community.

    In the "Disadvantage" above section I describe some of the missing features. There are more I have yet to mention, for example it doesn't support tables without a PK, but that is to be fixed real soon. It also doesn't support foreign keys. They are supported by InnoDB, but seem to have a huge number of limitations. Hopefully Oracle will improve that.

    MyRocks does multi-statement transactions and MVCC (repeatable read and read committed). Repeatable read is Postgres-style, not InnoDB style, so there are no gap locks. Notes on cursor isolation in Postgres, InnoDB and Oracle are here:
    https://github.com/mdcallag/mytools/wiki/Cursor-Isolation

    ReplyDelete
  3. More limitations are described at https://github.com/facebook/mysql-5.6/wiki/MyRocks-limitations

    ReplyDelete
  4. Hi Marc, I am reading this later than I should. It mostly make sense to me but I have a comment about the MyRocks Advantage "no page reads for secondary index maintenance": isn't this already achieved by InnoDB change buffering ? Thanks.

    ReplyDelete
    Replies
    1. The InnoDB change buffer usually means fewer page reads not no page reads. Eventually the buffered changes must be applied and when the change buffer gets too full then background threads do the reads.

      This is easy to see in the tests I run where I also run iostat to measure disk IO.

      Delete
    2. Thanks for your reply Mark, but isn't is the same for MyRocks: compaction eventually also needs to happen.

      Delete
    3. There are reads for compaction but they if you believe my hand-wavy math then the work isn't comparable. Lets assume:
      * LSM has 4 levels, first 3 fit in RAM, max level is not in RAM. This means the database is 10X the size of RAM
      * LSM write-amp is 20 and write-amp per level is 5 given there are 4 levels. In real life, the write-amp per level isn't constant, but this is good enough for hand waving.
      * database is 2X larger with InnoDB and the database is 20X the size of RAM, so 5% of the database is cached
      * InnoDB change buffer removes 90% of disk reads during secondary index maintainance. I am being optimistic about how much it helps as it has been good but not this good for me in production.

      The link table in linkbench has a PK index and one secondary index. So X updates/second means ~X disk reads/second without the change buffer and ~X/5 disk reads/second with the change buffer. For a disk that can do 128 random IO/second then and update consumes 1 / 5 / 128 of the IO capacity and if IO is only used for change buffer reads then I can do 640 updates/second.

      For RocksDB I will assume that write-amp from compaction is 20, so read-amp from compaction is 20. Each byte written into the database is re-read and re-written 20 times thanks to the LSM. With 4 levels in the LSM, the read-amp per level is 4 but all of the levels in the LSM prior to the max level are in RAM. So the total read-amp from disk is 4 (the read-amp for the max level).

      If my secondary index entry is 64 bytes then the read-amp is 4 * 64 bytes per update or 256 bytes / update. That is 256 bytes of sequential IO. Assuming my disk can do 32 MB/sec of sequential IO this consumes 2^8 / 2^25 = 1 / 2^17 of the IO capacity of the disk. If the only use of IO were for those reads then I could do 2^17 update/second before saturating the disk.

      Delete