python-oxidized-importer/docs/oxidized_importer_packed_resources.rst
.. py:currentmodule:: oxidized_importer
.. _python_packed_resources:
This project has defined a custom data format for storing resources useful to the execution of a Python interpreter. We call this data format Python packed resources.
The way it works is that some producer collects resources required by a Python interpreter. These resources include Python module source and bytecode, non-module resource/data files, extension modules, and shared libraries. Metadata about these resources and sometimes the raw resource data itself is serialized to a binary data structure.
At Python interpreter run time, an instance of the :py:class:OxidizedFinder
meta path finder parses this data structure and uses it to power Python
module importing.
This functionality is similar to using a .zip file for holding
Python modules. However, the Python packed resources data structure
is far more advanced.
The canonical implementation of the writer and parser of this data
structure lives in the python-packed-resources Rust crate. The
canonical home of this crate is
https://github.com/indygreg/PyOxidizer/tree/main/python-packed-resources.
This crate is published to crates.io at https://crates.io/crates/python-packed-resources.
The oxidized_importer Rust crate / Python extension defines the
:py:class:OxidizedFinder Python class for using this data structure
to power importing. That extension also exposes APIs to interact with
instances of the data structure.
The data structure is logically an iterable of resources.
A resource is a sparse collection of attributes or fields.
Each attribute describes behavior of the resource or defines data for that resource. For example, there are attributes that denote the type of a resource. A Python module resource might have an attribute holding its Python sourcecode or bytecode.
In Rust speak, a resource is a struct and attributes are fields
in that struct. Many fields are Option<T> because they are
optional and not always defined.
The serialization format consists of:
All integers are little-endian.
The first 8 bytes of the data structure are a magic header identifying
the content as our data structure and the version of it. The first
7 bytes are pyembed and the following 1 byte denotes a version.
Semantics of each version are denoted in sections below.
The first 13 bytes after the magic header describe the blob and resource indices as follows:
u8 denoting the number of blob sections, blob_sections_count.u32 denoting the length of the blob index, blob_index_length.u32 denoting the total number of resources in this data,
resources_count.u32 denoting the length of the resources index,
resources_index_length.Following the global header is the blob index, which describes the blob sections present later in the data structure.
Each entry in the blob index logically consists of a set of fields defining
metadata about each blob section. This is encoded by a start of entry
u8 marker followed by N u8 field type values and their corresponding
metadata, followed by an end of entry u8 marker.
The blob index is terminated by an end of index u8 marker.
The total number of bytes in the blob index including the end of index
marker should be blob_index_length.
The blob index allows attributing a sparse set of metadata with every blob
section entry. The type of metadata being conveyed is defined by a u8.
Some field types have additional metadata following that field.
The various field types and their semantics follow.
0x00
End of index. This field indicates that there are no more blob
index entries and we've reached the end of the blob index.
0x01
Start of blob section entry. Encountering this value signals the
beginning of a new blob section. From a specification standpoint, this isn't
strictly required. But it helps ensure parser state.
0xff
End of blob section entry. Encountering this value signals the end
of the current blob section definition. The next encountered u8 in the
index should be 0x01 to denote a new entry or 0x00 to denote end of
index.
0x02
Resource field type. This field defines which resource field this
blob section is holding data for. A u8 following this one will contain
the resource field type value (see section below).
0x03
Raw payload length. This field defines the raw length in bytes of
the blob section in the payload. The u64 containing that length will
immediately follow this u8.
0x04
Interior padding mechanism. This field defines interior padding
between elements in the blob section. Following this u8 is another u8
denoting the padding mechanism.
0x01 indicates no padding.
0x02 indicates NULL padding (a 0x00 between elements).
If not present, no padding is assumed. If the payload data logically consists of discrete resources (e.g. Python package resource files), then padding applies to these sub-elements as well.
For example, a blob index byte sequence of
0x01 0x02 0x03 0x03 0x0000000000000042 0x04 0x01 0xff 0x00 would be decoded as:
0x01 - Start of blob section entry.0x02 0x03 - Resource field type definition (0x02) for field 0x03.0x03 0x0000000000000042 - Blob section length (0x03) of 0x42 bytes
long.0x04 0x01 - Interior padding in blob section (0x04) is defined as
no padding (0x01).0xff - End of blob section entry.0x00 - End of index.Following the blob index is the resources index.
Each entry in this index defines a sparse set of metadata describing a single resource.
Entries are composed of a series of u8 identifying pieces of metadata,
followed by field-specific supplementary descriptions.
The following u8 fields and their behavior/payloads are as follows:
0x00
End of index. Special type to denote the end of an index.
0x01
Start of resource entry. Signals the beginning of a new resource. From
a specification standpoint this isn't strictly required. But it helps ensure
parser state.
0x02
Previously held the resource flavor. This field is deprecated in version 2
in favor of the individual fields expressing presence of a resource type.
(See fields starting at 0x16.)
0xff
End of resource entry. The next encountered u8 in the index should
be an end of index or start of resource marker.
0x03
Resource name. A u16 denoting the length in bytes of the resource name
immediately follows this byte. The resource name must be valid UTF-8.
0x04
Package flag. If encountered, the resource is identified as a Python
package.
0x05
Namespace package flag. If encountered, the resource is identified as
a Python namespace package.
0x06
In-memory Python module source code. A u32 denoting the length in
bytes of the module's source code immediately follows this byte.
0x07
In-memory Python module bytecode. A u32 denoting the length in bytes
of the module's bytecode immediately follows this byte.
0x08
In-memory Python module optimized level 1 bytecode. A u32 denoting the
length in bytes of the module's optimization level 1 bytecode immediately
follows this byte.
0x09
In-memory Python module optimized level 2 bytecode. Same as previous,
except for bytecode optimization level 2.
0x0a
In-memory Python extension module shared library. A u32 denoting the
length in bytes of the extension module's machine code immediately follows
this byte.
0x0b
In-memory Python resources data. If encountered, the module/package
contains non-module resources files and the number of resources is contained in
a u32 that immediately follows. Following this u32 is an array of
(u16, u64) denoting the resource name and payload size for each resource
in this package.
0x0c
In-memory Python distribution resource. Defines resources accessed from
importlib.metadata APIs. If encountered, the module/package contains
distribution metadata describing the package. The number of files being
described is contained in a u32 that immediately follows this byte.
Following this u32 is an array of (u16, u64) denoting the
distribution file name and payload size for each virtual file in this
distribution.
0x0d
In-memory shared library. If set, this resource is a shared
library and not a Python module. The resource name field is the name of
this shared library, with file extension (as it would appear in a dynamic
binary's loader metadata to indicate a library dependency). A u64
denoting the length in bytes of the shared library data follows. This
shared library should be loaded from memory.
0x0e
Shared library dependency names. This field indicates the names
of shared libraries that this entity depends on. The number of library names
is contained in a u16 that immediately follows this byte. Following this
u16 is an array of u16 denoting the length of the library name for
each shared library dependency. Each described shared library dependency
may or may not be described by other entries in this data structure.
0x0f
Relative filesystem path to Python module source code. A u32 holding
the length in bytes of a filesystem path encoded in the platform-native file
path encoding follows. The source code for a Python module will be read from
a file at this path.
0x10
Relative filesystem path to Python module bytecode. Similar to the
previous except the filesystem path holds Python module bytecode.
0x11
Relative filesystem path to Python module bytecode at optimization
level 1. Similar to the previous except for what is being pointed to.
0x12
Relative filesystem path to Python module bytecode at optimization
level 2. Similar to the previous except for what is being pointed to.
0x13
Relative filesystem path to Python extension module shared library.
Similar to the previous except the file holds a Python extension module
loadable as a shared library.
0x14
Relative filesystem path to Python package resources. The number of
resources is contained in a u32 that immediately follows. Following
this u32 is an array of (u16, u32) denoting the resource name and
filesystem path to each resource in this package.
0x15
Relative filesystem path to Python distribution resources.
Defines resources accessed from importlib.metadata APIs. If encountered,
the module/package contains distribution metadata describing the package.
The number of files being described is contained in a u32 that
immediately follows this byte. Following this u32 is an array of
(u16, u32) denoting the distribution file name and filesystem path to
that distribution file.
0x16
Is Python module flag. If set, this resource contains data for
an importable Python module or package. Resource data is associated with
Python packages and is covered by this type.
0x17
Is builtin extension module flag. This type represents a Python
extension module that is built in (compiled into) the interpreter itself
or is otherwise made available to the interpreter via PyImport_Inittab
such that it should be imported with the builtin importer.
0x18
Is frozen Python module flag. This type represents a Python module
whose bytecode is frozen and made available to the Python interpreter
via the PyImport_FrozenModules array and should be imported with the
frozen importer.
0x19
Is Python extension flag. This type represents a compiled Python
extension. Extensions have specific requirements around how they are to be
loaded and are differentiated from regular Python modules.
0x1a
Is shared library flag. This type represents a shared library
that can be loaded into a process.
0x1b
Is utf-8 filename data flag. This type represents an arbitrary filename.
The resource name is a UTF-8 encoded filename of the file this resource
represents. The file's data is either embedded in memory or referred to
via a relative path reference.
0x1c
File data is executable flag.
If set, the arbitrary file this resource tracks should be marked as executable.
0x1d
Embedded file data.
If present, the resource should be a file resource and this field holds its raw file data in memory.
A u64 containing the length of the embedded data follows this field.
0x1e
UTF-8 relative path file data.
If present, the resource should be a file resource and this field defines the relative path containing that file's data. The relative path filename is UTF-8 encoded.
A u32 denoting the length of the UTF-8 relative path (in bytes) follows.
Following the resources index is blob data.
Blob data is logically composed of different sections holding data for different fields for different resources. But there is no internal structure or separators: all the individual blobs are just laid out next to each other. The resources index for a given field will describe where in a blob section a particular value occurs.
pyembed\x01 FormatThe initially released/formalized packed resources data format.
Supports resource field types up to and including 0x15.
pyembed\x02 FormatVersion 2 of the packed resources data format.
This version introduces field type values 0x16 to 0x1a. The
resource flavor field type (0x02) is deprecated and the individual
field types denoting resource types should be used instead.
(PyOxidizer removed run-time code looking at field type 0x02 when
this format was introduced.)
pyembed\x03 FormatVersion 3 of the packed resources data format.
This version introduces field type values 0x1b to 0x1e.
These fields provide the ability for a resource to identify itself as an arbitrary filename and for the arbitrary file data to be embedded within the data structure or referenced via a relative path.
Unlike previous fields that use OS-native encoding of filesystem
paths ([u8] on POSIX and [u16] on Windows), the paths for
these new fields use UTF-8. This can't represent all valid paths on
all platforms. But it is portable and works for most paths encountered
in the wild.
The design of the packed resources data format was influenced by a handful of considerations.
Performance is a significant consideration. We want everything to be as fast as possible. Possible dimensions influencing performance include parse time, payload size, and I/O access patterns.
The payload is designed such that the index data is at the beginning so a reader only has to read a contiguous slice of data to fully understand the data within. This is in opposition to jumping around the entire data structure to extract metadata of the data within. This means that we only need to page in a fraction of the total backing data structure in order to initialize our custom importer. In addition, the index data is read sequentially. Sequential I/O should always be faster than random access I/O.
x86 is little endian, so we use little endian integers so we don't need to waste cycles on endian transformation.
We store all data for the same field next to each other in the data structure. This is in opposition to say packing all of resource A's data then resource B's, etc. We do this to help maximize locality for similar data. This can help with performance because often the same field for multiple resources is accessed together. e.g. an importer will access a bunch of module bytecode entries at the same time. This locality helps minimize the number of pages that must be read. Locality can also help yield higher compression ratios.
Everything is designed to facilitate a reader leveraging 0-copy. If a
reader has the data structure in memory, we don't want to require it
to copy memory in order to reference entries. In Rust speak, we should
be able to hold &[u8] references everywhere.
There is no checksumming of the data because we don't want to incur I/O overhead to read the entire blob. It could be added as an optional feature.
This data structure is robust enough to be used by PyOxidizer to power importing of every Python module used by a Python interpreter. However, there are various aspects that could be improved.
A potential area for optimization is use of general compression. Various fields should compress well - either in streaming mode or by utilizing compression dictionaries. Compression would undermine 0-copy, of course. But in environments where we want to optimize for size, it could be desirable.