Links:
…Hctree Test Results
…Hctree Design Documentation
…Hctree File Format…
SQLite is sometimes used as the core of a client/server database system.
While it works reliably well in such cases, the database backend module that it
uses to store b-tree structures in its database file was not designed with this
case in mind and can be improved upon in several ways. The HC-tree (hctree)
project is an attempt to develop a new database backend that improves upon
regular SQLite as follows:
-
Improved concurrency: Stock SQLite is limited to a single
concurrent writer.Using the begin-concurrent extension changes this so that multiple
writers may run concurrently using optimistic page-level locking. This
improves concurrency somewhat, but page-level locking can detect
conflicts between logically independant transactions, and COMMIT
operations must still be serialized.Hctree uses optimistic row-level locking and is designed to support
dozens of concurrent writers running at full-speed.
Test results obtained from the prototype
show that this is possible. -
Support for replication: Stock SQLite supports the
sessions extension, which
allows the contents of a committed transaction to be serialized for
tranmission and application to a second database.Hctree builds this into the database backend, and adds support for
application of such transactions to follower databases in leader-follower
configurations. In this case, transactions received from a leader database
can be applied more quickly and with greater concurrency than with
which they were originally applied to the leader database, because
no transaction validation is required. -
Removal of database size limitations: Stock SQLite uses 32-bit
page numbers. Using the default 4KiB page-size, this leads to a maximum
database size of 2^44 bytes, or 16TiB.Hctree uses 48-bit page numbers, allowing 2^60 byte databases, or 1EiB.
Roughly one million TiB.
An implicit goal is that hctree must be as fast or faster than stock SQLite
for all single-threaded cases. There is no point in running dozens of
concurrent writers if each of them is an order of magnitude slower than
a single writer writing to a legacy database.
Hctree clients (those that use a version of SQLite compiled from this
repository) may read hctree databases and stock SQLite databases.
This project contains a fork of the SQLite project that has been modified
to include a prototype of the hctree database backend. It may be built and
used in all the same ways as the stock SQLite library code.
The prototype is advanced enough to experiment with and to conduct
multi-threaded performance tests with, but
is incomplete. The current code has, at least, the following shortcomings:
-
“BEGIN EXCLUSIVE” is not implemented
for hctree databases. It is parsed, but behaves just as a regular
“BEGIN” does. -
All hctree databases currently use 32-bit page numbers (maximum
database size of 16TiB with the default 4KiB byte page size). -
The database is always accessed using mmap(). It is not possible
to fallback to pread()/pwrite(). -
Transactions that do not fit entirely in main memory (e.g. large
CREATE INDEX statements) are not handled efficiently. -
No support for replication is implemented.
-
The prototype library only works on POSIX, not win32, systems.
-
All clients of an hctree database, read or read/write, must access
the database from within the same OS process. -
There is no support for recovering from a power failure or operating
system crash. Such an event may corrupt the database.
The prototype is on the “hctree” branch (not trunk!) of the fossil repository
at https://sqlite.org/hctree. To obtain
the latest development snapshot, download:
or check the code out using fossil:
fossil clone https://www.sqlite.org/hctree hctree.fossil mkdir hctree cd hctree fossil open ../hctree.fossil fossil up hctree
The sources may be built and deployed in the same ways as stock SQLite.
Creating/Opening Databases
When opening an existing database, the library determines automatically
whether or not it is a stock SQLite database or an hctree database. By
default, when creating a new database, the library creates an ordinary
SQLite database. To create a new hctree database, the
SQLite URI parameter “hctree=1”
must be specified the first time the new database is opened. e.g.
file://new_hctree_database.db?hctree=1
The easiest ways to check if a database really is an hctree database
are:
-
To check the first 6 bytes of the database file. For an hctree
database they are “hctree”, instead of the usual “SQLite” found in the first
6 bytes of stock SQLite databases. -
Open the database and evaluate an hctree specific pragma command like
“PRAGMA hct_ncasfail”. If it returns an integer, the database is an hctree
database. If it returns no rows, a legacy SQLite database.
Each database consists of two files – the database file itself and a second
file name “
Concurrency Model
Hctree offers similar MVCC based optimistic concurrency to SQLite with the
begin-concurrent extension, except that it validates transactions based on the
keys and logical ranges of keys accessed instead of the set of pages.
As in any MVCC database system, including stock SQLite, both readers and
writers access a consistent snapshot of the database, in which writes committed
after the transaction was opened are never visible. Any writes made within the
transaction are accumulated privately and made visible only to the transaction
itself until they are committed.
When a transaction is to be committed in an optimistic concurrency system,
it must first be validated. To validate a transaction is to check that, when
considered in concert with all other past and current transactions, the
transaction might have been committed in a system that serializes all
transactions without changing the outcome in terms of the final state of the
database or query results returned to any client. In practice, there are two
ways for an hctree transaction to be validated:
-
If no other client has committed a transaction to the database since
the writer’s snapshot was opened, the transaction must be valid. -
If no other client has modified any b-tree (table or index) entry
or range that the transaction being committed accessed by a range or
stabbing query, then the transaction is valid.
For example, considering the following database:
CREATE TABLE t1(a INTEGER PRIMARY KEY, b TEXT, c TEXT); CREATE INDEX i1 ON t1(b); INSERT INTO t1 VALUES(1, 'a', 'A'); INSERT INTO t1 VALUES(2, 'b', 'B'); INSERT INTO t1 VALUES(3, 'x', 'X'); INSERT INTO t1 VALUES(4, 'c', 'C'); INSERT INTO t1 VALUES(5, 'd', 'D');
And the following transaction:
BEGIN CONCURRENT; SELECT * FROM t1 WHERE (b BETWEEN 'a' AND 'c') AND c!='B'; INSERT INTO t1 VALUES(6, 'e'); COMMIT;
Then during transaction execution, hctree records that the following were
accessed:
- The range of keys between ‘a’ and ‘c’, inclusive, within the index
b-tree. - Keys 1, 2, 4 and 6 of the table b-tree.
If, when the attempt to commit is made, the hctree client finds that
either the contents of the index b-tree range or one of the table b-tree
keys has been modified since the writer’s snapshot was opened, validation
does not succeed and the transaction is not committed.
Note that another client modifying the (2, ‘b’, ‘B’) tuple causes
validation to fail, even though the SELECT query excludes that row. This
is because validation is based on the entries and ranges of entries read
from the b-tree layer, not the subset of those actually used by the database
engine. If there were no index in the database schema, then the example SELECT
query would be implemented using a full-table scan of the table b-tree. In
this case any modification to any row of the table b-tree would cause
transaction validation to fail.
All database transactions implicitly read the database schema. So any
transaction that modifies the database schema may only be commited if no
other client has written to the database since the writer’s snapshot
was opened (case 1 above).
Transaction Types
Hctree allows clients to open three different transaction types, as
follows:
-
BEGIN; Transactions opened with “BEGIN” do not collect
validation data. This means that if they write to the database, they
may only be committed if no other client has written the database
since the writer’s snapshot was opened
(validation case 1 above). -
BEGIN CONCURRENT; Transactions opened with “BEGIN CONCURRENT”
do collect validation data as they run. So that if they write to the
database, upon commit they may, subject to validation, be committed
even if other transactions have been committed since the snapshot
was opened. -
BEGIN EXCLUSIVE; A “BEGIN EXCLUSIVE” command not only opens
a transaction, but also prevents any new transaction-commit operations
from starting, and spin-locks until all ongoing commits have finished.
Any new commit operations started while the BEGIN EXCLUSIVE transaction
is active also spin-lock until it is finished. Thus a transaction started
using “BEGIN EXCLUSIVE” always passes validation, since it is always
committed against the same snapshot against which it is prepared.
Implicit transactions are handled in the same way as those opened using
“BEGIN”.
In general, “BEGIN” should be used for read-only transactions, as it does
not incur the overhead of accumulating data required for validation. “BEGIN
CONCURRENT” should be used for most read/write transactions and “BEGIN
EXCLUSIVE” as a fallback for transactions that are likely to conflict otherwise
(perhaps because they have already been tried and found to conflict several
times already, or perhaps because they contain schema modifications).
In all cases above, “spin-lock” actually means invoke the SQLite xBusy
callback if one is registered, or to literally spin-lock otherwise.
Leave A Comment