HC-tree


Links:
Hctree Test Results
Hctree Design Documentation
Hctree File Format

SQLite is sometimes used as the core of a client/server database system.
While it works reliably well in such cases, the database backend module that it
uses to store b-tree structures in its database file was not designed with this
case in mind and can be improved upon in several ways. The HC-tree (hctree)
project is an attempt to develop a new database backend that improves upon
regular SQLite as follows:

  • Improved concurrency: Stock SQLite is limited to a single
    concurrent writer.

    Using the begin-concurrent extension changes this so that multiple
    writers may run concurrently using optimistic page-level locking. This
    improves concurrency somewhat, but page-level locking can detect
    conflicts between logically independant transactions, and COMMIT
    operations must still be serialized.

    Hctree uses optimistic row-level locking and is designed to support
    dozens of concurrent writers running at full-speed.
    Test results obtained from the prototype
    show that this is possible.

  • Support for replication: Stock SQLite supports the
    sessions extension, which
    allows the contents of a committed transaction to be serialized for
    tranmission and application to a second database.

    Hctree builds this into the database backend, and adds support for
    application of such transactions to follower databases in leader-follower
    configurations. In this case, transactions received from a leader database
    can be applied more quickly and with greater concurrency than with
    which they were originally applied to the leader database, because
    no transaction validation is required.

  • Removal of database size limitations: Stock SQLite uses 32-bit
    page numbers. Using the default 4KiB page-size, this leads to a maximum
    database size of 2^44 bytes, or 16TiB.

    Hctree uses 48-bit page numbers, allowing 2^60 byte databases, or 1EiB.
    Roughly one million TiB.

An implicit goal is that hctree must be as fast or faster than stock SQLite
for all single-threaded cases. There is no point in running dozens of
concurrent writers if each of them is an order of magnitude slower than
a single writer writing to a legacy database.

Hctree clients (those that use a version of SQLite compiled from this
repository) may read hctree databases and stock SQLite databases.

This project contains a fork of the SQLite project that has been modified
to include a prototype of the hctree database backend. It may be built and
used in all the same ways as the stock SQLite library code.

The prototype is advanced enough to experiment with and to conduct
multi-threaded performance tests with, but
is incomplete. The current code has, at least, the following shortcomings:

  1. “BEGIN EXCLUSIVE” is not implemented
    for hctree databases. It is parsed, but behaves just as a regular
    “BEGIN” does.

  2. All hctree databases currently use 32-bit page numbers (maximum
    database size of 16TiB with the default 4KiB byte page size).

  3. The database is always accessed using mmap(). It is not possible
    to fallback to pread()/pwrite().

  4. Transactions that do not fit entirely in main memory (e.g. large
    CREATE INDEX statements) are not handled efficiently.

  5. No support for replication is implemented.

  6. The prototype library only works on POSIX, not win32, systems.

  7. All clients of an hctree database, read or read/write, must access
    the database from within the same OS process.

  8. There is no support for recovering from a power failure or operating
    system crash. Such an event may corrupt the database.

The prototype is on the “hctree” branch (not trunk!) of the fossil repository
at https://sqlite.org/hctree. To obtain
the latest development snapshot, download:

or check the code out using fossil:

    fossil clone https://www.sqlite.org/hctree hctree.fossil
    mkdir hctree
    cd hctree
    fossil open ../hctree.fossil
    fossil up hctree

The sources may be built and deployed in the same ways as stock SQLite.

Creating/Opening Databases

When opening an existing database, the library determines automatically
whether or not it is a stock SQLite database or an hctree database. By
default, when creating a new database, the library creates an ordinary
SQLite database. To create a new hctree database, the
SQLite URI parameter “hctree=1”
must be specified the first time the new database is opened. e.g.

        file://new_hctree_database.db?hctree=1

The easiest ways to check if a database really is an hctree database
are:

  • To check the first 6 bytes of the database file. For an hctree
    database they are “hctree”, instead of the usual “SQLite” found in the first
    6 bytes of stock SQLite databases.

  • Open the database and evaluate an hctree specific pragma command like
    “PRAGMA hct_ncasfail”. If it returns an integer, the database is an hctree
    database. If it returns no rows, a legacy SQLite database.

Each database consists of two files – the database file itself and a second
file name “-pagemap”.

Concurrency Model

Hctree offers similar MVCC based optimistic concurrency to SQLite with the
begin-concurrent extension, except that it validates transactions based on the
keys and logical ranges of keys accessed instead of the set of pages.

As in any MVCC database system, including stock SQLite, both readers and
writers access a consistent snapshot of the database, in which writes committed
after the transaction was opened are never visible. Any writes made within the
transaction are accumulated privately and made visible only to the transaction
itself until they are committed.

When a transaction is to be committed in an optimistic concurrency system,
it must first be validated. To validate a transaction is to check that, when
considered in concert with all other past and current transactions, the
transaction might have been committed in a system that serializes all
transactions without changing the outcome in terms of the final state of the
database or query results returned to any client. In practice, there are two
ways for an hctree transaction to be validated:

  1. If no other client has committed a transaction to the database since
    the writer’s snapshot was opened, the transaction must be valid.

  2. If no other client has modified any b-tree (table or index) entry
    or range that the transaction being committed accessed by a range or
    stabbing query, then the transaction is valid.

For example, considering the following database:

    CREATE TABLE t1(a INTEGER PRIMARY KEY, b TEXT, c TEXT);
    CREATE INDEX i1 ON t1(b);
    INSERT INTO t1 VALUES(1, 'a', 'A');
    INSERT INTO t1 VALUES(2, 'b', 'B');
    INSERT INTO t1 VALUES(3, 'x', 'X');
    INSERT INTO t1 VALUES(4, 'c', 'C');
    INSERT INTO t1 VALUES(5, 'd', 'D');

And the following transaction:

    BEGIN CONCURRENT;
      SELECT * FROM t1 WHERE (b BETWEEN 'a' AND 'c') AND c!='B';
      INSERT INTO t1 VALUES(6, 'e');
    COMMIT;

Then during transaction execution, hctree records that the following were
accessed:

  • The range of keys between ‘a’ and ‘c’, inclusive, within the index
    b-tree.
  • Keys 1, 2, 4 and 6 of the table b-tree.

If, when the attempt to commit is made, the hctree client finds that
either the contents of the index b-tree range or one of the table b-tree
keys has been modified since the writer’s snapshot was opened, validation
does not succeed and the transaction is not committed.

Note that another client modifying the (2, ‘b’, ‘B’) tuple causes
validation to fail, even though the SELECT query excludes that row. This
is because validation is based on the entries and ranges of entries read
from the b-tree layer, not the subset of those actually used by the database
engine. If there were no index in the database schema, then the example SELECT
query would be implemented using a full-table scan of the table b-tree. In
this case any modification to any row of the table b-tree would cause
transaction validation to fail.

All database transactions implicitly read the database schema. So any
transaction that modifies the database schema may only be commited if no
other client has written to the database since the writer’s snapshot
was opened (case 1 above).

Transaction Types

Hctree allows clients to open three different transaction types, as
follows:

  1. BEGIN; Transactions opened with “BEGIN” do not collect
    validation data. This means that if they write to the database, they
    may only be committed if no other client has written the database
    since the writer’s snapshot was opened
    (validation case 1 above).

  2. BEGIN CONCURRENT; Transactions opened with “BEGIN CONCURRENT”
    do collect validation data as they run. So that if they write to the
    database, upon commit they may, subject to validation, be committed
    even if other transactions have been committed since the snapshot
    was opened.

  3. BEGIN EXCLUSIVE; A “BEGIN EXCLUSIVE” command not only opens
    a transaction, but also prevents any new transaction-commit operations
    from starting, and spin-locks until all ongoing commits have finished.
    Any new commit operations started while the BEGIN EXCLUSIVE transaction
    is active also spin-lock until it is finished. Thus a transaction started
    using “BEGIN EXCLUSIVE” always passes validation, since it is always
    committed against the same snapshot against which it is prepared.

Implicit transactions are handled in the same way as those opened using
“BEGIN”.

In general, “BEGIN” should be used for read-only transactions, as it does
not incur the overhead of accumulating data required for validation. “BEGIN
CONCURRENT” should be used for most read/write transactions and “BEGIN
EXCLUSIVE” as a fallback for transactions that are likely to conflict otherwise
(perhaps because they have already been tried and found to conflict several
times already, or perhaps because they contain schema modifications).

In all cases above, “spin-lock” actually means invoke the SQLite xBusy
callback if one is registered, or to literally spin-lock otherwise.

Read More