Python: the old bug in SQLite module strikes back

Introduction:⌗

NOTE: The problem is related to Python 2 but I’ll be referring to Python 3 documentation unless it’s not applicable.

I was recently writing some Python code that was using the SQLite module . Unfortunately it was still running on Python 2.7.9 and as of January 1st, 2020, Python 2 has officially reached the end of life but let’s focus on the actual problem.

The code was pulling data from different sources in chunks, saved the records into the local database and then data was sent to an API endpoint at each iteration. The logic was something like this:

with DB() as db:
    db.init()

    it = iter(data)
    chunk = tuple(islice(it, 0, 1000))
    while chunk:
        db.cur.add(chunk)
        # Make the HTTP request or raise an exception on error
        db.conn.commit() # <- THIS COMMIT SEEMS TO BE THE CULPRIT
        chunk = tuple(islice(it, 0, 1000))

Variables:⌗

CentOS 7.8.2003
Python 2.7.9
SQLite 3.7.17
pyenv 1.2.14
GDB 7.6.1

Problem:⌗

This is a simplified test case to reproduce the issue:

import sqlite3


def main():
    conn = sqlite3.connect(':memory:')
    cur = conn.cursor()
    
    sql = '''
        CREATE TABLE IF NOT EXISTS t1 (
          char TEXT NOT NULL
        )
    '''
    
    cur.executescript(sql)
    cur.execute('INSERT INTO t1 VALUES (?)', ('a',))
    cur.execute('SELECT char FROM t1 WHERE char = ?', ('a',))
    conn.commit()
    print(cur.fetchone())
    cur.execute('SELECT char FROM t1 WHERE char = ?', ('b',))


if __name__ == '__main__':
    main()

The code was working fine when I removed the commit statement otherwise it was throwing the following exception:

$ python test.py
(u'a',)
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    main()
  File "test.py", line 21, in main
    cur.execute('SELECT char FROM t1 WHERE char = ?', ('b',))
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.
[22472 refs]

Installing the required tools:⌗

NOTE: I’m using pyenv to manage multiple versions of Python so all path locations are referring to my pyenv installation.

We’ll install Python 2.7.9 using pyenv:

$ pyenv install -f -k -v -g -p 2.7.9

The SQLite module is written in C so we’ll use gdb (GNU Debugger) for debugging. We’ll also need to install debugging symbols for Python and SQLite:

$ sudo yum install python-debuginfo http://debuginfo.centos.org/7/x86_64/glibc-debuginfo-2.17-307.el7.1.x86_64.rpm http://debuginfo.centos.org/7/x86_64/sqlite-debuginfo-3.7.17-8.el7_7.1.x86_64.rpm

I had to pull some pkgs directly from the debuginfo repo because my system was using glibc-2.17-307.el7.1 and sqlite-3.7.17-8.el7_7.1 but the equivalent debuginfo pkgs were not available:

$ sudo yum -q list --showduplicates --enablerepo=base-debuginfo glibc-debuginfo sqlite-debuginfo
Installed Packages
glibc-debuginfo.x86_64                                                              2.17-307.el7.1                                                                @/glibc-debuginfo-2.17-307.el7.1.x86_64   
sqlite-debuginfo.x86_64                                                             3.7.17-8.el7_7.1                                                              @/sqlite-debuginfo-3.7.17-8.el7_7.1.x86_64
Available Packages
...                            
glibc-debuginfo.x86_64                                                              2.17-292.el7                                                                  base-debuginfo                            
...           
sqlite-debuginfo.x86_64                                                             3.7.17-8.el7                                                                  base-debuginfo

(A quick look into the repodata directory suggests that metadata wasn’t rebuilt for a while)

Add the following line into ~/.gdbinit file to add some Python debugging tools for gdb (py-bt, py-bt etc.) :

add-auto-load-safe-path ~/.pyenv/versions/2.7.9-debug/bin/python2.7-gdb.py

It’s time for some debugging, the -k flag instructs pyenv to keep the source tree, let’s find the relevant error message from the exception:

$ grep -R "Error binding parameter" ~/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/
/home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/statement.c:                    PyErr_Format(pysqlite_InterfaceError, "Error binding parameter %d - probably unsupported type.", i);
/home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/statement.c:                    PyErr_Format(pysqlite_InterfaceError, "Error binding parameter :%s - probably unsupported type.", binding_name);

So after we checked statement.c file, it looks like both errors are located in pysqlite_statement_bind_parameters function that has the following signature:

void pysqlite_statement_bind_parameters(pysqlite_Statement* self, PyObject* parameters, int allow_8bit_chars)

Let’s start gdb session and create a breakpoint to pause the execution at this location. The exception happens when we try to do select after the commit statement (SELECT char FROM t1 WHERE char = 'b') so we’ll add a conditional breakpoint.

We also know that “everything is an object in Python” so we need to use some C functions from the Python/C API :

break pysqlite_statement_bind_parameters if PyTuple_Size(parameters) == 1 && $_streq(PyString_AsString(PyTuple_GetItem(parameters, 0)), "b")

The condition says “pause” if parameters size is 1 and if that’s the case, we use $_streq convenience function to check if the first parameter is “b”.

This is equivalent to the following Python code:

>>> parameters = ('b',)
>>> len(parameters) and parameters[0] == 'b'
True

GDB session:⌗

$ gdb -q ~/.pyenv/versions/2.7.9-debug/bin/python
Reading symbols from /home/test/.pyenv/versions/2.7.9-debug/bin/python2.7...done.
(gdb) break pysqlite_statement_bind_parameters if PyTuple_Size(parameters) == 1 && $_streq(PyString_AsString(PyTuple_GetItem(parameters, 0)), "b")
Function "pysqlite_statement_bind_parameters" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 1 (pysqlite_statement_bind_parameters if PyTuple_Size(parameters) == 1 && $_streq(PyString_AsString(PyTuple_GetItem(parameters, 0)), "b")) pending.

We’ve set the breakpoint so it’s time to start our script:

(gdb) run test.py
Starting program: /home/test/.pyenv/versions/2.7.9-debug/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
(u'a',)

Breakpoint 1, pysqlite_statement_bind_parameters (self=0x7fffef7be2b0, parameters=('b',), allow_8bit_chars=0) at /home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/statement.c:226
226	    Py_BEGIN_ALLOW_THREADS

We’ve hit our breakpoint so we can take a look around:

(gdb) whatis self
type = pysqlite_Statement *
(gdb) ptype self
type = struct {
    struct _object *_ob_next;
    struct _object *_ob_prev;
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
    sqlite3 *db;
    sqlite3_stmt *st;
    PyObject *sql;
    int in_use;
    PyObject *in_weakreflist;
} *
(gdb) p self->sql
$1 = 'SELECT char FROM t1 WHERE char = ?'

The self argument is a pysqlite_Statement * structure and the sql member is our select statement.

Let’s hit next (a few times) to proceed to the next line. We use a tuple as a parameter so we go into this branch in the if statement and finally we see some interesting variable:

(gdb) n
227	    num_params_needed = sqlite3_bind_parameter_count(self->st);
(gdb) 
228	    Py_END_ALLOW_THREADS
(gdb) 
230	    if (PyTuple_CheckExact(parameters) || PyList_CheckExact(parameters) || (!PyDict_Check(parameters) && PySequence_Check(parameters))) {
(gdb) 
232	        if (PyTuple_CheckExact(parameters)) {
(gdb) 
233	            num_params = PyTuple_GET_SIZE(parameters);
...
270	            rc = pysqlite_statement_bind_parameter(self, i + 1, adapted, allow_8bit_chars);
(gdb) 
271	            Py_DECREF(adapted);
(gdb) p rc
$2 = 21

The pysqlite_statement_bind_parameter return code was 21 and according to SQLite docs this doesn’t look good:

(21) SQLITE_MISUSE

The SQLITE_MISUSE return code might be returned if the application uses any SQLite interface in a way that is undefined or unsupported. For example, using a prepared statement after that prepared statement has been finalized might result in an SQLITE_MISUSE error. …

We remember that the exception actually happens around the commit so let’s change our focus on that. The pysqlite_connection_commit subroutine is located in connection.c:

PyObject* pysqlite_connection_commit(pysqlite_Connection* self, PyObject* args)

We disable our first breakpoint, set a new one and re-start our script:

(gdb) disable 1
(gdb) break pysqlite_connection_commit
Breakpoint 2 at 0x7fffefabc76a: file /home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/connection.c, line 457.
(gdb) run test.py
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/test/.pyenv/versions/2.7.9-debug/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 2, pysqlite_connection_commit (self=0x7ffff7e5bd60, args=0x0) at /home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/connection.c:457
457	    if (!pysqlite_check_thread(self) || !pysqlite_check_connection(self)) {
(gdb) n
461	    if (self->inTransaction) {
(gdb) p self->inTransaction
$4 = 0
(gdb) p self->statements
$5 = []

We’re not in transaction and there are no statements so that must be our call to sqlite3.Cursor.executescript as it will run COMMIT first:

This is a nonstandard convenience method for executing multiple SQL statements at once. It issues a COMMIT statement first, then executes the SQL script it gets as a parameter.

Let’s continue until we hit our breakpoint again. This time we’re in transaction and we can see our statements. We also use again some Python Python/C API functions to get the data we want:

(gdb) cont
Continuing.

Breakpoint 4, pysqlite_connection_commit (self=0x7ffff7e5bd60, args=0x0) at /home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/connection.c:457
457	    if (!pysqlite_check_thread(self) || !pysqlite_check_connection(self)) {
(gdb) n
461	    if (self->inTransaction) {
(gdb) p self->inTransaction 
$6 = 1
(gdb) whatis self 
type = pysqlite_Connection *
(gdb) ptype self
type = struct {
    ...
    PyObject *statements;
    PyObject *cursors;
    int created_statements;
    int created_cursors;
    ...
} *
(gdb) p self->statements 
$7 = [<weakref at remote 0x7ffff095c528>, <weakref at remote 0x7ffff095c5b0>]
(gdb) p ((pysqlite_Statement *) PyWeakref_GetObject(PyList_GetItem(self->statements, 0)))->sql
$8 = 'INSERT INTO t1 VALUES (?)'
(gdb) p ((pysqlite_Statement *) PyWeakref_GetObject(PyList_GetItem(self->statements, 1)))->sql
$9 = 'SELECT char FROM t1 WHERE char = ?'

After that we see another function call pysqlite_do_all_statements that apparently is trying to “reset” our statements, let’s step into that function to take a look:

(gdb) n
462	        pysqlite_do_all_statements(self, ACTION_RESET, 0);
(gdb) s
pysqlite_do_all_statements (self=0x7ffff7e5bd60, action=2, reset_cursors=0) at /home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/connection.c:245
245	    for (i = 0; i < PyList_Size(self->statements); i++) {
(gdb) p action
$24 = 2
(gdb) n
246	        weakref = PyList_GetItem(self->statements, i);
(gdb) 
247	        statement = PyWeakref_GetObject(weakref);
(gdb) 
248	        if (statement != Py_None) {
(gdb) 
249	            if (action == ACTION_RESET) {
(gdb) 
250	                (void)pysqlite_statement_reset((pysqlite_Statement*)statement);
(gdb) p ((pysqlite_Statement *) statement)->sql 
$25 = 'INSERT INTO t1 VALUES (?)'

So it looks like it loops over our statements and use the same Python/C API functions like we did to read them. The first one is our insert so we can skip this iteration of the loop with until:

(gdb) u
245	    for (i = 0; i < PyList_Size(self->statements); i++) {
(gdb) n
246	        weakref = PyList_GetItem(self->statements, i);
(gdb) 
247	        statement = PyWeakref_GetObject(weakref);
(gdb) 
248	        if (statement != Py_None) {
(gdb) 
249	            if (action == ACTION_RESET) {
(gdb) 
250	                (void)pysqlite_statement_reset((pysqlite_Statement*)statement);
(gdb) p ((pysqlite_Statement *) statement)->sql 
$26 = 'SELECT char FROM t1 WHERE char = ?'

Here is our select query, this time we’ll step into the pysqlite_statement_reset function call:

(gdb) s
pysqlite_statement_reset (self=0x7fffef7be2b0) at /home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/statement.c:392
392	    rc = SQLITE_OK;
(gdb) n
394	    if (self->in_use && self->st) {
(gdb) p self->in_use 
$27 = 1
(gdb) n
395	        Py_BEGIN_ALLOW_THREADS
(gdb) 
396	        rc = sqlite3_reset(self->st);
(gdb) 
397	        Py_END_ALLOW_THREADS
gdb) 
399	        if (rc == SQLITE_OK) {
(gdb) 
400	            self->in_use = 0;
(gdb) p self->in_use 
$28 = 1
(gdb) n
404	    return rc;
(gdb) p self->in_use 
$29 = 0

The code invokes sqlite3_reset function to reset our statement and if that was successful it changes self->in_use from 1 to 0.

We can see that the life-cycle of a prepared statement object usually goes like this:

Create the prepared statement object using sqlite3_prepare_v2() .

Bind values to parameters using the sqlite3_bind_*() interfaces.

Run the SQL by calling sqlite3_step() one or more times.

Reset the prepared statement using sqlite3_reset() then go back to step 2. Do this zero or more times.

Destroy the object using sqlite3_finalize() .

Let’s test quickly what will happen if we skip this step:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) break pysqlite_connection_commit
Breakpoint 5 at 0x7fffefabc76a: file /home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/connection.c, line 457.
(gdb) run test.py
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/test/.pyenv/versions/2.7.9-debug/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 5, pysqlite_connection_commit (self=0x7ffff7e5bd60, args=0x0) at /home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/connection.c:457
457	    if (!pysqlite_check_thread(self) || !pysqlite_check_connection(self)) {
(gdb) cont
Continuing.

Breakpoint 5, pysqlite_connection_commit (self=0x7ffff7e5bd60, args=0x0) at /home/test/.pyenv/sources/2.7.9-debug/Python-2.7.9/Modules/_sqlite/connection.c:457
457	    if (!pysqlite_check_thread(self) || !pysqlite_check_connection(self)) {
(gdb) n
461	    if (self->inTransaction) {
(gdb) p self->inTransaction
$47 = 1
(gdb) set self->inTransaction = 0
(gdb) p self->inTransaction
$48 = 0
(gdb) cont
Continuing.
(u'a',)
[22472 refs]
[Inferior 1 (process 29338) exited normally]

We’ve set the self->inTransaction flag to 0 so the pysqlite_do_all_statements function wasn’t invoked and the program completed successfully.

This sounds like we shouldn’t reset the statements in commits, especially change their states.

At this point I’ve decided to check for any changes around that part of the code and bingo! The issue was reported in #10513 and the fix was committed in #dc60c75 (it seems to be fixed in Python 2.7.13).

The patch below can be easily applied with pyenv:

$ cat ~/issue10513.patch 
--- Modules/_sqlite/connection.c	2014-12-10 15:59:53.000000000 +0000
+++ connection.c	2020-05-15 14:23:36.232892608 +0100
@@ -459,7 +459,6 @@
     }
 
     if (self->inTransaction) {
-        pysqlite_do_all_statements(self, ACTION_RESET, 0);
 
         Py_BEGIN_ALLOW_THREADS
         rc = sqlite3_prepare(self->db, "COMMIT", -1, &statement, &tail);

$ pyenv install -f -k -v -g -p 2.7.9 < ~/issue10513.patch
~/.pyenv/sources/2.7.9-debug ~
~/.pyenv/sources/2.7.9-debug/Python-2.7.9 ~/.pyenv/sources/2.7.9-debug ~
Installing Python-2.7.9...
patching file Modules/_sqlite/connection.c
Hunk #1 succeeded at 458 (offset -1 lines).
...

$ pyenv local 2.7.9-debug
$ python --version
Python 2.7.9
$ python test.py
(u'a',)
[25767 refs]