cut_image

I often find myself in a situation where different parts of a script are most optimal to run in different environments. Like, for example fetching the configuration can be done locally, dataset manipulation is best done in a many-CPU environment, while machine learning requires a GPU-capable machine. Managing this is a nightmare: you split jobs into multiple steps, save intermediate results, deposit and transmit data, schedule job chains, and, most annoying, ensure the integrity of the whole run. There are many infrastructure solutions for this problem which come at a price of having to deploy and to maintain them as well.

Instead, I was recently inspired by the idea of forking a process to another machine right in the middle of its execution. Indeed, it is rather tempting to not worry about intermediate variables which you have to save and to load, often from cumbersome locations, and to just swap the environment without interrupting the logic. I created a tiny proof of concept which does this in the context of python scripts. Also available on pip: pip install pyteleport. This post is about technical details of pyteleport and transmitting runtime states in general.

Contents

  1. Worthy mentions
  2. What pyteleport does?
  3. Overview
  4. How to make a snapshot?
  5. Capturing the value stack
  6. How to determine the top of the value stack in cPython?
    1. Beacon injection - simple variant
    2. Beacon injection - hacking the bytecode
  7. Morph: a bytecode to resume the execution
  8. The code

Worthy mentions

Both projects are for Unix and may require additional kernel modules. They are also low-level and capable of transmitting the entire Unix process, regardless of how they were compiled or run. While this approach does look more universal and capable than what I am going to present here, there are also disadvantages: namely, it is very hard to control the details of which parts of the runtime data are transferred and how exactly the (de-)serialization is performed.

In any case, here I am focusing on cPython processes and python runtime only: I will demonstrate how the state of python execution can be saved, transmitted and restored at the cPython bytecode and object level.

What pyteleport does?

From the project README:

pyteleport is capable of making snapshots of python runtime from an almost arbitrary state, including locals, globals and stack. It then transforms snapshots into a specially designed bytecode that resumes the execution from the snapshot state. The bytecode is portable and can be run locally or remotely: this way the “teleportation” is achieved.

Overview

pyteleport structure

Image: basics of teleporting

Pyteleport aims at two basic tasks: (1) making the snapshot of python runtime as complete as it is possible and (2) transforming the snapshot into a code that will develop the runtime into the snapshot state and will resume afterwards.

Consider the following code.

x = sum(range(3))
tp_shell(...)
print(x)

It first computes the sum, then teleports and prints the result. At the point of invoking the teleport function tp_shell the result of the sum x=6 is already known so pyteleport will transform it into a state where x is explicitly set to x=6.

x = 6  # sum(range(3))
tp_shell(...)
print(x)

Incomplete expressions are resumed from where the snapshot was taken. The following code

def teleport_and_return_10():
    tp_shell(...)
    return 10

print(1 + 2 + 3 + teleport_and_return_10() + 4)

will be transformed into the following.

def teleport_and_return_10():
    tp_shell(...)
    return 10

print(6 + teleport_and_return_10() + 4)

This example demonstrate the logic of teleporting only: pyteleport does not actually analyze your program nor it builds any source code. pyteleport simply picks up the stack where the intermediate result 1+2+3=6 was already present at the point of calling tp_shell.

How to make a snapshot?

cPython runtime is a stack machine. This means that individual instructions of cPython bytecode operate objects that are residing on the stack. Each individual function call is wrapped into a runtime frame which includes references to everything needed to run this frame. In particular, frames include references to locals and globals which pyteleport saves.

Schematics of cPython stack machine

Thread runtime
|- Frame 1 (paused)
|- Frame 2 (paused)
|- Frame 3 (paused)
|  |- bytecode
|  |- current instruction
|  |- locals
|  |- globals
|  |- value stack
|  |- block stack
|- Frame 4 (currently executed)

New frames are created through function calls. When returning or raising unhandled exceptions frames are removed. The active (topmost) frame can be accessed by inspect.currentframe(). It returns a low-level immutable object with a few fields: among them, f_locals contains all local variables visible by the current frame. There is also f_back which provides a reference to the previous frame. Other accessible fields are f_globals and f_builtins, more details can be found in inspect module docs. Collecting all this data is relatively straightforward and the following example demonstrates it.

import inspect

def get_all_data():
    frame = inspect.currentframe().f_back
    result = []
    while frame is not None:
        result.append((frame.f_locals, frame.f_globals, frame.f_builtins))
        frame = frame.f_back
    return result

print(get_all_data())  # produces a large output

Capturing the value stack

f_locals and f_globals are not enough to represent the runtime state of a python frame. Consider the following example.

def f(x):
    get_all_data()
    return x + 1

def g(y):
    return 2 * y

print(g(2) + f(3))

At the point of calling f(x) the value of g(2) has already been computed but not assigned to anything. It resides inside the value stack of the frame that called f(3). To properly represent the state of the runtime at the point of calling get_all_data, we need to know what g(2) previously returned.

Unfortunately, the value stack information is not available from the frame data using python API. But arbitrary memory access fixes this overlooked feature. Because the value stack is a contiguous block of memory, knowing where it starts and where it ends is enough to know all its elements. Both memory addresses are available in the corresponding C struct.

struct _frame {
    PyObject_VAR_HEAD
    struct _frame *f_back;
    PyCodeObject *f_code;
    PyObject *f_builtins;
    PyObject *f_globals;
    PyObject *f_locals;
    PyObject **f_valuestack;    /* points after the last local (stack starts) */
    /* Next free slot in f_valuestack.  Frame creation sets to f_valuestack.
       Frame evaluation usually NULLs it, but a frame that yields sets it
       to the current stack top. */
    PyObject **f_stacktop;   /* stack ends */
    ...
};

Note the above struct depends on the cPython version.

With default compilation settings, _frame.f_valuestack is 0x40 bytes after the structure head (cPython 3.8 and 3.9). The value written in f_valuestack is the address of the stack bottom (i.e. the least recent object). Similarly, _frame.f_valuestack points to the stack top (the most recent object). By reading both, it should be possible to recover all objects in the value stack. This is how it is done in pyteleport.

def get_value_stack(frame):
    """
    Collects frame stack for generator objects.

    Parameters
    ----------
    frame : FrameObject
        Frame to process.

    Returns
    -------
    stack : list
        Stack contents.
    """
    addr = id(frame)  # get frame address
    stack_bot = read_ptr(addr + 0x40)
    stack_top = read_ptr(addr + 0x48)
    data = Mem(stack_bot, stack_top - stack_bot)[:]  # stack as bytes
    result = []
    for i in range(0, len(data), 8):
        obj_ref = int.from_bytes(data[i:i + 8], "little")
        result.append(ctypes.cast(obj_ref, ctypes.py_object).value)
    return result

As it is evident from the docstring, this only works with stack frames of generator objects that are not currently active. There is an issue with stack_top which is not set by default (it is optimized out because cPython does not need this field for currently executed stack). However, cPython still needs to save it in-between next(...) calls. Is it still possible to figure out the value stack top in all other cases?

How to determine the top of the value stack in cPython?

The idea that pyteleport employs is simple: figure out the object on the top of the stack and find it in the memory.

  1. Inject a beacon object into the frame by editing the corresponding bytecode.
  2. Find the beacon in the value stack, collect all values in-between, and, optionally, restore the bytecode to its original state.

Beacon injection - simple variant

Consider the following line of code.

beacon, get_value_stack_from_beacon(beacon)

It creates a 2-tuple of a beacon and the function return value that is not stored anywhere. Importantly, before calling get_value_stack_from_beacon a beacon object is put on top of the stack. Assuming the object is not used elsewhere, get_value_stack_from_beacon can scan for its argument in the memory. pyteleport implementation looks like the following.

def get_value_stack_from_beacon(frame, beacon, null=NULL()):
    """
    Collects frame stack using beacon as an indicator of stack top.

    Parameters
    ----------
    frame : FrameObject
        Frame to process.
    beacon : int
        Value on top of the stack.
    null
        The NULL object replacement.

    Returns
    -------
    stack : list
        Stack contents.
    """
    beacon = id(beacon)  # get object address
    addr = id(frame)  # get frame address
    stack_bot = read_ptr(addr + 0x40)  # get value stack bottom
    stack_view = Mem(stack_bot, frame.f_code.co_stacksize * 8)
    stack_view = stack_view[:]  # memory as bytes
    result = []
    for i in range(0, len(stack_view), 8):
        obj_ref = int.from_bytes(stack_view[i:i + 8], "little")
        if obj_ref == beacon:
            return result
        if obj_ref == 0:
            result.append(null)
        else:
            result.append(ctypes.cast(obj_ref, ctypes.py_object).value)
    raise RuntimeError("Failed to find the beacon")

Provided get_value_stack_from_beacon is called in this very special way, it is possible to recover the value stack of the outermost frame. It will not work for nested calls because the beacon object is not injected when calling a() or b() in the following example.

def a():
    def b():
        beacon, get_value_stack_from_beacon(beacon)
    b()
a()

get_value_stack_from_beacon won’t work for a() and the root frame

To make it work, one has to anticipate the call to get_value_stack_from_beacon throughout the entire function stack.

def a():
    def b():
        beacon, get_value_stack_from_beacon(beacon)
    beacon, b()
beacon, a()

beacon has to be inserted multiple times throughout the function stack

This is, of course, not always possible as the function stack may include library functions which are difficult to modify.

Beacon injection - hacking the bytecode

The idea pyteleport currently employs is to inject the beacon object by modifying the frame bytecode through memory editing. Afterwards, the execution is briefly returned to the executing frame to process the patch and then the memory is scanned for the beacon. The following bytecode patch is the example of how a beacon is injected.

...
-2 LOAD_...        get_value_stack_from_beacon2
 0 CALL_FUNCTION   0   <-- Calls get_value_stack_from_beacon2 which returns
                           a two-tuple with the beacon and the callback
                           function which processes the value stack
 2 UNPACK_SEQUENCE 2   <-- PATCH: unpacks a two-tuple (callback, beacon)
                                  returned by get_all_data
 4 CALL_FUNCTION   0   <-- PATCH: calls the callback while beacon remains
                                  on top of the stack

The following code snippet gives a general idea behind the patching function which re-uses the previous implementation as a callback.

def get_value_stack_from_beacon2():
    frame = inspect.currentframe().f_back  # upper frame
    beacon = object()  # unique beacon object
    stack_bottom = read_ptr(id(frame) + 0x40)

    # patch the frame
    mem = Mem.(frame.f_code.co_code)
    mem[frame.f_lasti + 2:frame.f_lasti + 6] =\
        [UNPACK_SEQUENCE, 2, CALL_FUNCTION, 0]

    # the callback function with pre-set arguments
    callback = partial(get_value_stack_from_beacon, frame, beacon)
    # return the two-tuple to be processed by the patch
    return callback, beacon

This implementation can be called as an ordinary function without worrying about the beacon: get_value_stack_from_beacon2(). Note that some post-processing still has to be done: namely,

  1. revert the bytecode and current instruction to their original values
  2. pop the beacon from the stack while maintaining the return value of the callback.

Hacking the rest of the stack

The only missing piece is the strategy to process outer stack calls. pyteleport does this by patching frames one-by-one and exiting (returning) them once the value stacks for each specific frame has been collected.

Morph: a bytecode to resume the execution

The final outcome of pyteleport is the so-called morph bytecode stored in pyc format which you can send elsewhere to resume the execution from the snapshot. It can be self-contained or pull the snapshot data from outside: it depends on how python object serialization is implemented. By default, it uses dill to store objects as bytes embedded in pyc. The latter is run as easy as

python morph.pyc

The structure of the morph bytecode is simple: it starts with a specially designed header that develops locals and value stack and then jumps into the original bytecode that follows right after. The simplest version of such bytecode is the following.

# develop locals
LOAD_CONST      saved_locals           # load a tuple of all locals
UNPACK_SEQUENCE len(saved_locals)      # unpack the tuple into a stack
STORE_FAST      local_var_name_1       # assign the first local variable
STORE_FAST      local_var_name_2       # assign the second one
...
STORE_FAST      local_var_name_n       # assign the last local variable

# develop the value stack
LOAD_CONST      saved_value_stack
UNPACK_SEQUENCE len(saved_value_stack) # unpack the value stack

# jump to the saved bytecode position
JUMP_ABSOLUTE   saved_position         # jump forward to where to resume from
# the original bytecode follows from here
LOAD_CONST ...
...
...
...
RETURN_VALUE

The code

Feel free to check the code on my github.