Python list object representation
Introduction:⌗
In the previous blog post we were looking at internal representation of Python string objects in memory.
In this article we’ll delve into the C-level details and read the internals of another Python data type - a list .
NOTE: All examples in this post are specific to CPython (Python 3.9
) implementation so there is no guarantee that these structures won’t change in future releases. The results may also vary in case of a different platforms and their data models (LP64 or LLP64
).
The PyListObject is defined in cpython/Include/cpython/listobject.h file and has the following structure:
typedef struct {
PyObject_VAR_HEAD
/* Vector of pointers to list elements. list[0] is ob_item[0], etc. */
PyObject **ob_item;
/* ob_item contains space for 'allocated' elements. The number
* currently in use is ob_size.
* Invariants:
* 0 <= ob_size <= allocated
* len(list) == ob_size
* ob_item == NULL implies ob_size == allocated == 0
* list.sort() temporarily sets allocated to -1 to detect mutations.
*
* Items must normally not be NULL, except during construction when
* the list is not yet visible outside the function that builds it.
*/
Py_ssize_t allocated;
} PyListObject;
The ob_item
structure member holds pointers to the elements of the list and we’ll use ctypes
module to re-create the structs in Python:
test.py⌗
#!/usr/bin/env python3
import ctypes
import sys
lst = ["red", "blue", "green"]
class PyListObject(ctypes.Structure):
_fields_ = [
("ob_refcnt", ctypes.c_long),
("ob_type", ctypes.c_void_p),
("ob_size", ctypes.c_long),
("ob_item", ctypes.POINTER(ctypes.c_void_p)),
("allocated", ctypes.c_long),
]
class PyUnicodeObject(ctypes.Structure):
_fields_ = [
("ob_refcnt", ctypes.c_long),
("ob_type", ctypes.c_void_p),
("length", ctypes.c_ssize_t),
("hash", ctypes.c_ssize_t),
("interned", ctypes.c_uint, 2),
("kind", ctypes.c_uint, 3),
("compact", ctypes.c_uint, 1),
("ascii", ctypes.c_uint, 1),
("ready", ctypes.c_uint, 1),
]
def main():
pylist_obj = PyListObject.from_address(id(lst))
for idx, s in enumerate(lst):
addr = pylist_obj.ob_item[idx]
pyunicode_obj = PyUnicodeObject.from_address(addr)
s_mem = ctypes.string_at(addr, sys.getsizeof(s))
print(
f"{s_mem[-len(s) - 1 : -1].decode()}",
f"length: {pyunicode_obj.length}",
f"hash: {pyunicode_obj.hash} ({hash(s)})",
)
if __name__ == "__main__":
main()
So, in the example above we read directly from the underlying PyObject
objects:
$ python test.py
red length: 3 hash: -2831683529608114332 (-2831683529608114332)
blue length: 4 hash: -1701992703184244183 (-1701992703184244183)
green length: 5 hash: -1362686309841740019 (-1362686309841740019)