Trees I have known

Trees I have known

Toby Lehman (


I introduced the T-Tree at the first HPTS in 1985.  I was investigating binary trees and B-Trees in the context of main memory database management systems (MMDBMS) for my thesis.  I came up with a mix of the two structures that had better overall characteristics as an index structure for an MMDBMS.  The T-Tree provided search speeds almost as fast as binary trees, and it provided for memory efficiency that was almost as good as a B+ Tree.  For a time, it got a fair amount of attention, and it was chosen as the preferred index structure in several MMDBMS products in the 1990’s.

One of the characteristics of the index structures that I chose for my work in MMDBMS was that the index structures themselves didn’t have any data – they only had references to the memory-resident record. That made them significantly more space efficient on the one hand, but it also made them a bit slower in search execution time, since every data reference required one or more indirections to get to the record data.  One of the significant advantages of this approach was that any predicate (i.e. a residual predicate) could be applied to the record once it was located via the index. 

The data reference could be a simple pointer (i.e. a memory address), or it could be a data structure reference.  A memory address is faster to traverse, but it is more fragile and would not likely survive a machine crash or shutdown.  A data structure address is robust, but requires more steps to traverse.   Although not used at the time, I learned that a hybrid mechanism (permanent data structure address, with a transient cached physical memory address) could provide faster access, at the expense of extra memory.

Since then, the database community and I have learned many lessons about data structures.  The first lesson is that memory efficiency still matters today.  When I started my work in the 1980’s, megabyte memories were still fairly expensive.  It was somewhat of a fanciful dream to suppose that one day there would be GIGABYTE memories.  However, even though we might have machines with terabyte memories today, we also have databases that have billions of records.  If each record is included in multiple indexes, then the index overhead in terms of bytes per record is still an issue. Binary trees tend to be the least space-efficient of all the index structures, and yet there are companies today that still use basic, space-wasting binary trees.  For example, if each index entry per record requires, say, 64 bytes, then EACH index of a billion record table requires 64 billion bytes.  For a large table of small records (common in NOSQL systems), the index memory cost could dwarf the table memory cost. Customers find this interesting when they learn that their 200 gigabytes of data actually requires 10 terabytes of memory.

The second lesson is that the underlying machine architecture has changed from 1 or two processors to 16, 24, 32, 64, 128 or 256, and yet many database products are still using index data structures that are designed for a single processor.  New index structures, such as the KISS-Tree, the Adaptive Radix Tree and the BuzzWord Tree, to name a few, are designed for high concurrency for both read and update.

We are on the cusp of a new age.   With increased memory densities, especially those brought to market by the new 3D XPoint memory technology, fast inexpensive stable memories in the terabyte sizes will be commonplace.  We must keep our collective eye on maintaining memory efficiency, but with 256-core processors and multi-hundred terabyte memories, the emphasis will certainly be on parallel efficiency of the index structures.

Is it time for "Bi-SQL"?

Don Haderle, IBM fellow and self-proclaimed "Mother of DB2", expressed an excellent observation about the evolution of database systems.  He said that database architecture begins with an index over a data file (which is fast and efficient), and then increasingly adds function and complexity to meet the growing customer demands.  IBM's IMS and DB2 both followed that trajectory, as did most other relational database products.  At the point that the (ever increasing) complexity slows down the simple transactions, such as single-record insert/update, a break-away fraction of the market declares a new approach (a clean slate, if you will) that is intended to revolutionize the simple transaction (CRUD: Create, Read, Update, Delete) operations.  The new design, an index over a file, is reborn.  Key-value database systems are a perfect example.

The NoSQL movement coupled the simplified architecture (index over file) with the power of parallelism -- horizontally distributed database nodes that could loosely support a single large database -- provided no insert/update/delete transaction affected more than a single record (and thus, did not trigger a distributed, multi-node transaction commit).  However, just as the first database systems were hit with increasing customer demands for more function, the same story applies to the NoSQL products.  Customers like the fast performance with the horizontal scaling capability, but then they want more function.  "Where are the joins?", is a comment that rings in the NoSQL development hallways, much to the annoyance of the designers that thought they could ride the simplicity wave a lot longer.

Looking at the two systems -- the very large/complex and slow relational systems, and the simple and fast NoSQL systems, I started to wonder ... Why can't we have both?  My own history is that I created an in-memory storage manager prototype in the IBM Research Starburst system -- the codebase that became the DB2 Version 2 Database Product.  That storage manager was not included in the product (the product employed only the disk-based storage manager), but I did learn a lot about what goes into a storage manager that is used by a relational database system.


 That really got me thinking.  Why can't we have both? 

Let me know if you have a comment or question.


The First Line

The creation of a database system comprising a million lines of code begins with a single line:

int main(int argc, char *argv[])

Yeah, real systems are written in C.  What can I say?  Sure, implementation in Java would go much faster, and coding/prototyping/testing/debugging would be so much more pleasant, but when it comes down to complete control of memory and code, for me, it's got to be C.  

From there, we can go anywhere.  Where do we start?  How much design up front?  How much design iteration and evolution? There are many opinions on these issues. The lessons learned from using classical engineering project management (i.e. "waterfall") for a software project tell us that "Big Design Up Front" (BDUF) is usually a huge mistake.  Although the benefits of discovering design bugs before implementation - and especially before deployment - are obviously attractive, the truth is that the initial product requirements that drove the design will not match the final (version 1) requirements by the time of the initial release.  Thus, much of the design work ends up being redone repeatedly as the requirements change.  Similarly, the other end of the spectrum, where coding starts with minimal design and the system evolves organically (usually using agile programming methods), has a different set of pitfalls.  Oftentimes the agile-driven and evolved systems end up missing critical sub-components because they were not designed in up front.  Then the engineering team learns (and re-learns) the pain of adding integral components to a fully operational system.

Many of the NoSQL products - the newcomers in the database market - appear to have learned that painful lesson.  I'm not naming products, but several NoSQL products have minimal or missing components that are normally associated with complete database systems; authorization, concurrency control, transaction management (with full recovery capability) and disaster recovery come to mind.  These are things that are not easily added to an existing system, or even worse, to a released product. I've experienced the pain of adding full transaction management and recovery after the fact.   I was lucky, as my system was still fairly young and not many ad hoc changes had been implemented; specifically I mean changes that would make the transaction and recovery implementation more difficult because of special handling of data types or unusual storage semantics.

So, what lessons have we learned?  Neither extreme (waterfall or agile) in its pure form is the right approach for something as complex as a database system.  There has to be a mix. There's a minimum number of components and capabilities that have to be included in the initial design, and then regular iteration of review and refactoring is needed as the system evolves.

My intention is to use this blog to track the evolution of the new Clean Slate Systems DB - a database system that doesn't fit any of the existing database categories.

Stay tuned.