July 17, 2022

Extending SQLite with Rust to support Excel files as virtual tables

This article explains how SQLite can be extended with Rust. In particular, it will outline SQLite’s mechanism called virtual tables and showcase how we can use it from Rust programming language.

Extending SQLite with Rust

In the end, we will have a working extension that can be dynamically loaded and used from SQLite. This article does not claim to be an exhaustive guide about extending SQLite with Rust, but I hope the reader will have a basic understanding of how extensions work after reading it.

We will be dealing with some amount of unsafe Rust code. There are frameworks that can be used to abstract away some unsafe code, but we’ll not use them because rust-bindgen will take care of generating the bindings for us, and we will only use unsafe in the relatively thin layer of code that connects SQLite and Rust.

Processing Excel files with Rust won’t be covered. This task is delegated to excellent and easy to use Calamine library. I recommend reading their documentation.

Also, I won’t post and explain every piece of code for obvious reasons, but, of course, everyone is welcome at GitHub.

Objectives

Let’s say we have some useful data in Excel, and we would like to process it using SQL. One way of doing that would be importing data from a spreadsheet to some table inside SQLite. That is certainly a valid and useful approach: it’s possible to convert XLSX and XLS files to CSV and then import the CSV files for further processing.

There is another way to approach this problem - virtual tables. The sqlite.org site states: “The virtual table is an object that looks like any other table or view. But behind the scenes, queries and updates on a virtual table invoke callback methods of the virtual table object instead of reading and writing on the database file”. In other words, a virtual table is something that abstracts away some data source which could be used as a table.

If it looks like a table, swims like a table, and quacks like a table, then it probably is a table

There is already an implementation for CSV virtual tables. Let’s see what the typical usage of the CSV virtual table looks like:

CREATE VIRTUAL TABLE csv_data USING csv(
    FILENAME='/path/to/file.csv'
);
SELECT * FROM csv_data;

Really simple and self-explanatory, isn’t it? Load CSV data, then treat it like a regular table (even though it isn’t).

What if we could do the same trick but with spreadsheets? If it’s possible for CSV then it should be possible for XLSX and XLS files. We will write a SQLite extension for exactly that purpose.

Project structure

SQLite has the ability to load extensions as dynamic libraries. The library must declare an entry point which is called when the extension is loaded. Let’s create a Rust project named xlite and specify crate type to be dynamic library:

[package]
name = "xlite"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib"]

Running cargo build on this project will produce a dynamic library with one of the following filenames depending on your operation system: libxlite.so, libxlite.dylib, xlite.dll. The filenames are important because SQLite will use them to figure out the name of the entry point function (we will discuss that function next).

Different languages have different libraries and APIs to load SQLite extensions. For simplicity and portability we will use Command Line Shell For SQLite now and further to test our extension. Now let’s load the extension (assuming you have libxlite.so in your current directory):

sqlite3
> .load 'libxlite'

This command will tell SQLite to look for a dynamic library called libxlite.so. In that file SQLite will try to find a function named sqlite3_xlite_init. It’s fairly easy to see the pattern here: sqlite3_{EXTENSION_NAME}_init is the name of the function.

Naturally, the above command will fail with error: we have not yet defined the entry point. We will work on it shortly but before we do let’s mention one important detail: writing bindings for C libraries manually is a long and error-prone task, that is why we are going to delegate it to an automatic toolkit called rust-bindgen. If you are interested in how this toolkit works I recommend checking out their documentation. From now, I will assume that we generated Rust FFI code for sqlite.h - it will be widely used in our project.

You will often see the no_mangle and repr attributes in the code, this is also due to FFI and the need to keep memory layout the same as in C.

Entry point and modules

We already discussed that SQLite will look for a function with a specific name in the library. Now we’ll talk about how this function looks and what we could do with it.

#[no_mangle]
pub unsafe extern "C" fn sqlite3_xlite_init(
   db: *mut sqlite3,
   pz_err_msg: *mut *mut c_char,
   p_api: *mut sqlite3_api_routines,
) -> c_int

The function takes a pointer to a database instance, a pointer to an error message and a pointer to so called api routines. The latter parameter is the most interesting to us: it contains routines that can be used to extend SQLite with new modules, functions and other arguably less common features such as collations and virtual file systems.

Before we can create and register our module we need to write some tricky code. Let me explain: there is a macro called SQLITE_EXTENSION_INIT2, and SQLite demands that we call it before initializing the extension. Doing this trivial in C but we can’t call the macro from Rust, so instead we are going to replicate its behavior.

First of all we will define a global variable to hold an sqlite3_api_routines instance with a specific name.

#[no_mangle]
static mut sqlite3_api: *mut sqlite3_api_routines = std::ptr::null_mut();

Then, before doing any other work this global variable must be assigned:

// inside sqlite3_xlite_init
sqlite3_api = p_api;

After this, we can use the global variable to access routines. It is a somewhat unusual piece of code for Rust but it has to be written in order for our extension to work properly.

Now, let’s actually create our module.

Remember the CSV extension and its syntax for defining virtual tables? We will adapt it for our extension and highlight the important bits.

CREATE VIRTUAL TABLE xlsx_data USING xlite(
    FILENAME='/path/to/file.xlsx'
);

The CREATE VIRTUAL TABLE syntax tells SQLite to use the module defined in the USING part of this statement. As we can see we named our module xlite. The next important piece of data is the FILENAME parameter, we’ll need it later. For now it is important to understand that virtual tables are created and managed by modules.

So let’s create and register our module. One way to do it would be defining a structure to encapsulate required data. Strictly speaking, doing this is optional for simple modules that do not carry additional data (such as ours), but in this case we would like to store the name of our module along with the structure for convenience.

#[repr(C)]
pub struct Module {
   // must be at the beginning
   base: sqlite3_module,
   name: &'static [u8],
}

It is important to keep the base sqlite3_module structure as the first field in our newly-defined structure: in that way SQLite won’t notice the difference when we pass the pointer to our struct instead of the base sqlite3_module struct. Also layout in the memory will be exactly the same except for the additional name field at the end. The same technique will be applied for sqlite3_vtab and sqlite3_vtab_cursor structs as we’ll see later.

Now we’ll need an instance of our module. Let’s create it. The simplest way to achieve this is to define a constant which is going to live as long as our extension is loaded to the memory.

pub const XLITE_MODULE: Module = Module {
   base: sqlite3_module {
       iVersion: 0,
       xCreate: Some(x_create),
       xConnect: Some(x_connect),
       xBestIndex: Some(x_best_index),
       xDisconnect: Some(x_disconnect),
       xDestroy: Some(x_destroy),
       xOpen: Some(x_open),
       xClose: Some(x_close),
       xFilter: Some(x_filter),
       xNext: Some(x_next),
       xEof: Some(x_eof),
       xColumn: Some(x_column),
       xRowid: Some(x_rowid),
       xUpdate: None,
       xBegin: None,
       xSync: None,
       xCommit: None,
       xRollback: None,
       xFindFunction: None,
       xRename: None,
       xSavepoint: None,
       xRelease: None,
       xRollbackTo: None,
       xShadowName: None,
   },
   name: b"xlite\0",
};

As you can see the module is an entity that encapsulates a set of function pointers. Let’s describe them:

Name Description
x_create called when SQLite asks our extension to create a virtual table. The function must execute routines to define a virtual table (described later in this article)
x_connect for our purposes is exactly the same as x_create (our extension will be readonly and won’t have a backing storage)
x_destroy called when a table is being dropped. This is the place where resources allocated during x_create should be deallocated
x_disconnect for our purposes is the same as x_destroy
x_open creates a cursor for reading data for a given virtual table
x_close closes a cursor previously opened by x_open
x_next advances a cursor forward to a next row
x_eof checks if a cursor advanced beyond the last row and cannot produce more rows
x_column called when SQLite asks for the data inside current row and column N
x_rowid provides a stable identifier for current row

There are also x_best_index and x_filter but our implementation will just provide empty stubs for these functions.

Now we have the module and it should be registered with the create_module routine.

let name = XLITE_MODULE.name;
((*p_api).create_module.unwrap())(
   db,
   name.as_ptr() as *const c_char,
   &XLITE_MODULE as *const Module as *const sqlite3_module,
   std::ptr::null_mut(),
);

In this code, we cast the module reference to a pointer and then cast it to sqlite3_module pointer. This is a valid code thanks to the “base struct is the first field in the derived struct” rule.

Virtual table

To connect together everything we learned so far and understand the lifecycle of a virtual table we will write a sequence of SQL statements and corresponding functions (using pseudocode). Let’s examine the lifecycle of a virtual table with 2 columns and 2 rows:

CREATE VIRTUAL TABLE ...
-- x_create() -> VirtualTable
SELECT * FROM ...
-- x_open(VirtualTable) -> VirtualCursor

-- x_eof(VirtualCursor) -> false
-- x_column(VirtualCursor, 1) -> [0,0]
-- x_column(VirtualCursor, 2) -> [0,1]
-- x_next(VirtualCursor)
-- x_eof(VirtualCursor) -> false
-- x_column(VirtualCursor, 1) -> [1,0]
-- x_column(VirtualCursor, 2) -> [1,1]
-- x_next(VirtualCursor)
-- x_eof(VirtualCursor) -> true

-- x_close(VirtualCursor)
DROP TABLE ...
-- x_destroy(VirtualTable)

This can be summarized as: a module creates a virtual table using arguments provided to the CREATE VIRTUAL TABLE statement, SELECT query tells the module to create a cursor for the table, this cursor is then used to read all data row by row requesting selected columns.

When it comes to the management of virtual tables these two functions must be implemented: x_create and x_destroy. Actually, there are also x_connect and x_disconnect but we won’t spend time implementing them, these functions are only used in advanced scenarios (if we had some kind of backing storage for our data that could be initialized only once and then reused in some way.)

We will define a struct to hold the data for our virtual table:

#[repr(C)]
pub struct VirtualTable {
   // must be at the beginning
   base: sqlite3_vtab,
   manager: Arc<Mutex<DataManager>>,
}

This should look familiar. The struct holds SQLite’s base struct called sqlite3_vtab as the first field, and then adds arbitrary data at the end (in this case the data is another struct from our domain model called DataManager).

Now we will look at the actual x_create method implementation.

#[no_mangle]
unsafe extern "C" fn x_create(
   db: *mut sqlite3,
   _p_aux: *mut c_void,
   argc: c_int,
   argv: *const *const c_char,
   pp_vtab: *mut *mut sqlite3_vtab,
   pz_err: *mut *mut c_char,
) -> c_int

Let’s see what important parameters are passed into this function.

A virtual table can accept arguments when it is created (such is the FILENAME argument). So argc is the number of such arguments and argv is an array of actual argument strings. This info will be used to parse and extract data from the passed options.

The next important argument is pp_vtab. We should assign the instance of our virtual table to this pointer, but before we do that SQLite demands that we declare a table using the global routines object we defined earlier:

let sql = String::new("CREATE TABLE sheet(A,B,C)");
let cstr = CString::new(sql).unwrap();
((*api).declare_vtab.unwrap())(db, cstr.as_ptr() as _)

This is a simplified version of what is actually done, the real implementation dynamically constructs a SQL statement depending on which columns are provided by an XLS file or by RANGE option. What is important here: we allocate a string with a CREATE TABLE statement that defines the structure of the table, then the string is turned into a pointer and passed to the declare_vtab routine.

Now we will instantiate the virtual table, and assign the resulting pointer:

let p_new: Box<VirtualTable> = Box::new(VirtualTable {
   base: sqlite3_vtab {
       pModule: std::ptr::null_mut(),
       nRef: 0,
       zErrMsg: std::ptr::null_mut(),
   },
   manager: Arc::new(Mutex::new(manager)),
});

*pp_vtab = Box::into_raw(p_new) as *mut sqlite3_vtab;

The code above allocates memory for the VirtualTable struct on the heap then leaks it using the into_raw function preventing automatic deallocation. Finally the result is casted to the sqlite3_vtab pointer.

This is it, we implemented everything that should be implemented inside the x_create function. Now it’s time to write the x_destroy function:

#[no_mangle]
unsafe extern "C" fn x_destroy(p_vtab: *mut sqlite3_vtab) -> c_int

As we can see the function takes a single pointer to sqlite3_vab. Looks simple, and the implementation is simple too:

let table = Box::from_raw(p_vtab as *mut VirtualTable);
drop(table);

We cast the pointer to a VirtualTable pointer, then call the from_raw function to construct a Box from the raw pointer. In a sense this is a step to reverse leaking of the allocated memory. Calling the drop function is not strictly necessary, and Rust would do this for us as a part of deallocating the box.

Cursor

The last important entity we will describe is the cursor. This entity is very similar to the concept of iterators. At least it is okay to think about them in this way when things are simple: there are no writes (only reads), no locks, no transactions and so on.

The cursor is defined in the following way:

#[repr(C)]
struct VirtualCursor {
   // must be at the beginning
   base: sqlite3_vtab_cursor,
   reader: Arc<Mutex<DataReader>>,
}

We will omit the definition of DataReader structure. For now, we can think about it as a wrapper around an XLS file that can do the dirty work of actually reading data from Excel cells.

As always, there must be a place where a struct is created. For the cursor that place is x_open:

#[no_mangle]
unsafe extern "C" fn x_open(
   p_vtab: *mut sqlite3_vtab,
   pp_cursor: *mut *mut sqlite3_vtab_cursor,
) -> c_int

From this definition we can see that this function must assign the pp_cursor pointer. At this point we already know that we can instantiate and cast our struct to a pointer of a base struct type and we do exactly that inside the body of this function.

Now let’s briefly examine other functions related to cursors.

#[no_mangle]
unsafe extern "C" fn x_close(p_cursor: *mut sqlite3_vtab_cursor) -> c_int

Earlier we did something similar to the x_close function. This function downcasts the pointer to the derived struct bringing data from unsafe space to the memory managed by Rust. Then the cursor is dropped.

#[no_mangle]
unsafe extern "C" fn x_next(p_cursor: *mut sqlite3_vtab_cursor) -> c_int

The next function is x_next (the pun is intended). It takes a cursor as an argument and mutates it by advancing the internal pointer to the next row.

#[no_mangle]
unsafe extern "C" fn x_eof(p_cursor: *mut sqlite3_vtab_cursor) -> c_int

This function is very simple. The x_eof checks if the cursor was moved beyond the last row in a data set and cannot be advanced further.

#[no_mangle]
unsafe extern "C" fn x_column(
    p_cursor: *mut sqlite3_vtab_cursor,
    p_context: *mut sqlite3_context,
    column: c_int,
) -> c_int

The x_column function is the most interesting function that deals with a cursor. Its job is to return data for a requested column at the current row. There are many routines that can be used to return data: result_text, result_int, result_double, result_null and result_blob. You probably already guessed that they are used depending on the results data type.

Some of these routines are more interesting than others: there are occasions when data has to be allocated at the heap. In such cases we also pass a destructor function that will be called when SQLite decides it does not need the value anymore. Let’s look at an example with result_text:

let cstr = CString::new(s.as_bytes()).unwrap();
let len = cstr.as_bytes().len();
let raw = cstr.into_raw();

unsafe extern "C" fn destructor(raw: *mut c_void) {
   drop(CString::from_raw(raw as *mut c_char));
}

((*api).result_text.unwrap())(
   p_context,
   raw,
   len as c_int,
   Some(destructor),
);

In this example a CString is used and it allocates memory on the heap. In order to deallocate it we also provide a destructor function.

Conclusion

In this article I tried to describe the most important steps for creating a SQLite extension in Rust.

Needless to say, this article is not a complete guide on extending SQLite. I tried to cover parts that appeared essential and interesting. It’s always a trade-off when you write technical articles: you want to cover more and in more detail but you have to set some boundaries otherwise the article will be infinite in size and complexity.

Anyway, the code from this article is taken from the open-source project, and I hope the reader can always go to the repository for more detail if there is need for it.

And, of course, bug-reports and other forms of feedback are always welcome.