pyteleport - teleporting python runtime
I often find myself in a situation where different parts of a script are most optimal to run in different environments. Like, for example fetching the configuration can be done locally, dataset manipulation is best done in a many-CPU environment, while machine learning requires a GPU-capable machine. Managing this is a nightmare: you split jobs into multiple steps, save intermediate results, deposit and transmit data, schedule job chains, and, most annoying, ensure the integrity of the whole run. There are many infrastructure solutions for this problem which come at a price of having to deploy and to maintain them as well.
Instead, I was recently inspired by the idea
of forking a process to another machine right in the middle of its execution.
Indeed, it is rather tempting to not worry about intermediate variables
which you have to save and to load, often from cumbersome locations, and to
just swap the environment without interrupting the logic. I created a tiny
proof of concept which does this in
the context of python scripts. Also available on pip: pip install pyteleport
.
This post is about technical details of pyteleport
and transmitting runtime
states in general.
Contents
- Worthy mentions
- What pyteleport does?
- Overview
- How to make a snapshot?
- Capturing the value stack
- How to determine the top of the value stack in cPython?
- Morph: a bytecode to resume the execution
- The code
Worthy mentions
Both projects are for Unix and may require additional kernel modules. They are also low-level and capable of transmitting the entire Unix process, regardless of how they were compiled or run. While this approach does look more universal and capable than what I am going to present here, there are also disadvantages: namely, it is very hard to control the details of which parts of the runtime data are transferred and how exactly the (de-)serialization is performed.
In any case, here I am focusing on cPython processes and python runtime only: I will demonstrate how the state of python execution can be saved, transmitted and restored at the cPython bytecode and object level.
What pyteleport does?
From the project README
:
pyteleport
is capable of making snapshots of python runtime from an almost arbitrary state, including locals, globals and stack. It then transforms snapshots into a specially designed bytecode that resumes the execution from the snapshot state. The bytecode is portable and can be run locally or remotely: this way the “teleportation” is achieved.
Overview
Image: basics of teleporting
Pyteleport
aims at two basic tasks: (1) making the snapshot of python runtime
as complete as it is possible and (2) transforming the snapshot into a code
that will develop the runtime into the snapshot state and will resume afterwards.
Consider the following code.
x = sum(range(3))
tp_shell(...)
print(x)
It first computes the sum, then teleports and prints the result. At the point
of invoking the teleport function tp_shell
the result of the sum x=6
is
already known so pyteleport
will transform it into a state where x
is
explicitly set to x=6
.
x = 6 # sum(range(3))
tp_shell(...)
print(x)
Incomplete expressions are resumed from where the snapshot was taken. The following code
def teleport_and_return_10():
tp_shell(...)
return 10
print(1 + 2 + 3 + teleport_and_return_10() + 4)
will be transformed into the following.
def teleport_and_return_10():
tp_shell(...)
return 10
print(6 + teleport_and_return_10() + 4)
This example demonstrate the logic of teleporting only:
pyteleport
does not actually analyze your program nor it builds
any source code. pyteleport
simply picks up the stack where the
intermediate result 1+2+3=6
was already present at the point of calling
tp_shell
.
How to make a snapshot?
cPython runtime is a stack machine. This means that individual instructions
of cPython bytecode operate objects that are residing on the stack. Each
individual function call is wrapped into a runtime frame which includes
references to everything needed to run this frame. In particular, frames
include references to locals and globals which pyteleport
saves.
Schematics of cPython stack machine
Thread runtime
|- Frame 1 (paused)
|- Frame 2 (paused)
|- Frame 3 (paused)
| |- bytecode
| |- current instruction
| |- locals
| |- globals
| |- value stack
| |- block stack
|- Frame 4 (currently executed)
New frames are created through function calls. When returning or raising
unhandled exceptions frames are removed. The active (topmost) frame can be
accessed by inspect.currentframe()
. It returns a low-level immutable
object with a few fields: among them, f_locals
contains all local
variables visible by the current frame. There is also f_back
which
provides a reference to the previous frame. Other accessible fields are
f_globals
and f_builtins
, more details can be found in
inspect module docs.
Collecting all this data is relatively straightforward and the following
example demonstrates it.
import inspect
def get_all_data():
frame = inspect.currentframe().f_back
result = []
while frame is not None:
result.append((frame.f_locals, frame.f_globals, frame.f_builtins))
frame = frame.f_back
return result
print(get_all_data()) # produces a large output
Capturing the value stack
f_locals
and f_globals
are not enough to represent the runtime state of a
python frame. Consider the following example.
def f(x):
get_all_data()
return x + 1
def g(y):
return 2 * y
print(g(2) + f(3))
At the point of calling f(x)
the value of g(2)
has already been computed
but not assigned to anything. It resides inside the value stack of the frame
that called f(3)
. To properly represent the state of the runtime at the point
of calling get_all_data
, we need to know what g(2)
previously returned.
Unfortunately, the value stack information is not available from the frame data using python API. But arbitrary memory access fixes this overlooked feature. Because the value stack is a contiguous block of memory, knowing where it starts and where it ends is enough to know all its elements. Both memory addresses are available in the corresponding C struct.
struct _frame {
PyObject_VAR_HEAD
struct _frame *f_back;
PyCodeObject *f_code;
PyObject *f_builtins;
PyObject *f_globals;
PyObject *f_locals;
PyObject **f_valuestack; /* points after the last local (stack starts) */
/* Next free slot in f_valuestack. Frame creation sets to f_valuestack.
Frame evaluation usually NULLs it, but a frame that yields sets it
to the current stack top. */
PyObject **f_stacktop; /* stack ends */
...
};
Note the above struct depends on the cPython version.
With default compilation settings, _frame.f_valuestack
is 0x40
bytes after
the structure head (cPython 3.8 and 3.9). The value written in f_valuestack
is the address of the stack bottom (i.e. the least recent object).
Similarly, _frame.f_valuestack
points to the stack top (the most recent object).
By reading both, it should be possible to recover all objects in the value
stack. This is how it is done in pyteleport
.
def get_value_stack(frame):
"""
Collects frame stack for generator objects.
Parameters
----------
frame : FrameObject
Frame to process.
Returns
-------
stack : list
Stack contents.
"""
addr = id(frame) # get frame address
stack_bot = read_ptr(addr + 0x40)
stack_top = read_ptr(addr + 0x48)
data = Mem(stack_bot, stack_top - stack_bot)[:] # stack as bytes
result = []
for i in range(0, len(data), 8):
obj_ref = int.from_bytes(data[i:i + 8], "little")
result.append(ctypes.cast(obj_ref, ctypes.py_object).value)
return result
As it is evident from the docstring, this only works with stack frames of generator
objects that are not currently active.
There is an issue with stack_top
which is not set by default (it is optimized out
because cPython does not need this field for currently executed stack).
However, cPython still needs to save it in-between next(...)
calls.
Is it still possible to figure out the value stack top in all other cases?
How to determine the top of the value stack in cPython?
The idea that pyteleport
employs is simple: figure out the object on the top of the
stack and find it in the memory.
- Inject a beacon object into the frame by editing the corresponding bytecode.
- Find the beacon in the value stack, collect all values in-between, and, optionally, restore the bytecode to its original state.
Beacon injection - simple variant
Consider the following line of code.
beacon, get_value_stack_from_beacon(beacon)
It creates a 2-tuple of a beacon
and the function return value that is not
stored anywhere. Importantly, before calling get_value_stack_from_beacon
a beacon
object is put on top of the stack. Assuming the object is not used
elsewhere, get_value_stack_from_beacon
can scan for its argument in the
memory. pyteleport
implementation looks like the following.
def get_value_stack_from_beacon(frame, beacon, null=NULL()):
"""
Collects frame stack using beacon as an indicator of stack top.
Parameters
----------
frame : FrameObject
Frame to process.
beacon : int
Value on top of the stack.
null
The NULL object replacement.
Returns
-------
stack : list
Stack contents.
"""
beacon = id(beacon) # get object address
addr = id(frame) # get frame address
stack_bot = read_ptr(addr + 0x40) # get value stack bottom
stack_view = Mem(stack_bot, frame.f_code.co_stacksize * 8)
stack_view = stack_view[:] # memory as bytes
result = []
for i in range(0, len(stack_view), 8):
obj_ref = int.from_bytes(stack_view[i:i + 8], "little")
if obj_ref == beacon:
return result
if obj_ref == 0:
result.append(null)
else:
result.append(ctypes.cast(obj_ref, ctypes.py_object).value)
raise RuntimeError("Failed to find the beacon")
Provided get_value_stack_from_beacon
is called in this very special
way, it is possible to recover the value stack of the outermost frame.
It will not work for nested calls because the beacon object is not
injected when calling a()
or b()
in the following example.
def a():
def b():
beacon, get_value_stack_from_beacon(beacon)
b()
a()
get_value_stack_from_beacon
won’t work for a() and the root frame
To make it work, one has to anticipate the call to get_value_stack_from_beacon
throughout the entire function stack.
def a():
def b():
beacon, get_value_stack_from_beacon(beacon)
beacon, b()
beacon, a()
beacon
has to be inserted multiple times throughout the function stack
This is, of course, not always possible as the function stack may include library functions which are difficult to modify.
Beacon injection - hacking the bytecode
The idea pyteleport
currently employs is to inject the beacon object
by modifying the frame bytecode through memory editing.
Afterwards, the execution is briefly returned to the executing frame to
process the patch and then the memory is scanned for the beacon.
The following bytecode patch is the example of how a beacon is injected.
...
-2 LOAD_... get_value_stack_from_beacon2
0 CALL_FUNCTION 0 <-- Calls get_value_stack_from_beacon2 which returns
a two-tuple with the beacon and the callback
function which processes the value stack
2 UNPACK_SEQUENCE 2 <-- PATCH: unpacks a two-tuple (callback, beacon)
returned by get_all_data
4 CALL_FUNCTION 0 <-- PATCH: calls the callback while beacon remains
on top of the stack
The following code snippet gives a general idea behind the patching function which re-uses the previous implementation as a callback.
def get_value_stack_from_beacon2():
frame = inspect.currentframe().f_back # upper frame
beacon = object() # unique beacon object
stack_bottom = read_ptr(id(frame) + 0x40)
# patch the frame
mem = Mem.(frame.f_code.co_code)
mem[frame.f_lasti + 2:frame.f_lasti + 6] =\
[UNPACK_SEQUENCE, 2, CALL_FUNCTION, 0]
# the callback function with pre-set arguments
callback = partial(get_value_stack_from_beacon, frame, beacon)
# return the two-tuple to be processed by the patch
return callback, beacon
This implementation can be called as an ordinary function without worrying
about the beacon
: get_value_stack_from_beacon2()
. Note that some
post-processing still has to be done: namely,
- revert the bytecode and current instruction to their original values
- pop the
beacon
from the stack while maintaining the return value of thecallback
.
Hacking the rest of the stack
The only missing piece is the strategy to process outer stack calls.
pyteleport
does this by patching frames one-by-one and exiting (returning) them
once the value stacks for each specific frame has been collected.
Morph: a bytecode to resume the execution
The final outcome of pyteleport
is the so-called morph bytecode stored in
pyc
format which you can
send elsewhere to resume the execution from the snapshot. It can be self-contained
or pull the snapshot data from outside: it depends on how python object serialization
is implemented. By default, it uses dill to store
objects as bytes embedded in pyc
. The latter is run as easy as
python morph.pyc
The structure of the morph bytecode is simple: it starts with a specially designed header that develops locals and value stack and then jumps into the original bytecode that follows right after. The simplest version of such bytecode is the following.
# develop locals
LOAD_CONST saved_locals # load a tuple of all locals
UNPACK_SEQUENCE len(saved_locals) # unpack the tuple into a stack
STORE_FAST local_var_name_1 # assign the first local variable
STORE_FAST local_var_name_2 # assign the second one
...
STORE_FAST local_var_name_n # assign the last local variable
# develop the value stack
LOAD_CONST saved_value_stack
UNPACK_SEQUENCE len(saved_value_stack) # unpack the value stack
# jump to the saved bytecode position
JUMP_ABSOLUTE saved_position # jump forward to where to resume from
# the original bytecode follows from here
LOAD_CONST ...
...
...
...
RETURN_VALUE
The code
Feel free to check the code on my github.