Persistable#

class jangada.serialization.Persistable(*args, **kwargs)#

Bases: Serializable

Base class for objects that can be persisted to HDF5 files.

Persistable extends Serializable to provide HDF5 file I/O, supporting both full object loading and lazy access via context managers. The HDF5 backend enables efficient storage of large arrays, hierarchical data structures, and metadata.

Parameters:
*argstuple

Variable positional arguments: - No args: Normal Serializable construction - Single Path/str: Load from file - Single Path/str + kwargs: Prepare for context manager

**kwargsdict

For normal construction: property values For context manager: must include ‘mode’ parameter

Raises:
ValueError

If kwargs are provided with filepath but ‘mode’ is missing, or if unknown kwargs are provided.

FileNotFoundError

If loading from a non-existent file.

See also

Serializable

Parent class for serialization

ProxyDataset

Lazy loading wrapper for HDF5 datasets

SerializableProperty

Property descriptor

Notes

HDF5 File Structure#

Files are organized as:

file.hdf5
└── root (group)
    ├── __class__ (attribute)
    ├── property1 (attribute or dataset)
    ├── property2 (group for lists/dicts)
    │   ├── __container_type__ (attribute)
    │   └── ... (items)
    └── property3 (dataset for arrays)
        ├── __dataset_type__ (attribute)
        └── ... (metadata attributes)

Type Mapping#

  • None: Stored as string ‘NoneType:None’ in attributes

  • str, Number: Stored directly in attributes

  • Path: Stored as ‘Path:/absolute/path’ in attributes

  • list, dict: Stored as groups with __container_type__ attribute

  • numpy arrays, pandas timestamps: Stored as datasets with __dataset_type__

  • Nested Serializable: Stored recursively as groups

ProxyDataset#

In context manager mode, array properties become ProxyDataset instances that load data on-demand. This enables efficient access to large files without loading everything into memory.

Performance#

  • Use context manager mode for large arrays

  • Append operations are efficient (no full rewrite)

  • First dimension of datasets is resizable

  • Consider HDF5 chunking for specific access patterns

Examples

Define a Persistable class:

class Experiment(Persistable):
    name = SerializableProperty(default="")
    temperature = SerializableProperty(default=293.15)
    data = SerializableProperty(default=None)

Normal construction and saving:

exp = Experiment(name="Test1", temperature=373.15)
exp.save('experiment.hdf5')

Load from file:

exp = Experiment.load('experiment.hdf5')
print(exp.name)  # 'Test1'

Or use constructor:

exp = Experiment('experiment.hdf5')
print(exp.name)  # 'Test1'

Context manager for lazy loading:

with Experiment('experiment.hdf5', mode='r') as exp:
    # data is ProxyDataset - not loaded until accessed
    chunk = exp.data[100:200]

Append data efficiently:

with Experiment('experiment.hdf5', mode='r+') as exp:
    exp.data.append(np.array([new, values]))

Nested objects work automatically:

class Trial(Persistable):
    trial_num = SerializableProperty(default=0)
    experiment = SerializableProperty(default=None)

trial = Trial(trial_num=1, experiment=exp)
trial.save('trial.hdf5')

Initialization#

Persistable.__init__(*args, **kwargs)

Initialize a Persistable object.

Context Manager Protocol#

Methods for interactive file access with lazy loading.

Persistable.__enter__()

Enter context manager mode.

Persistable.__exit__(exc_type, exc_val, exc_tb)

Exit context manager mode.

File I/O Methods#

High-level methods for saving and loading objects.

Persistable.save(path[, overwrite, ...])

Save this object to an HDF5 file.

Persistable.load(path)

Load an object from an HDF5 file.

Low-Level Serialization#

Low-level methods for HDF5 data conversion.

Persistable.save_serialized_data(path, data)

Save serialized data dictionary to HDF5 file.

Persistable.load_serialized_data(path)

Load serialized data dictionary from HDF5 file.

Nested Classes#

Persistable.ProxyDataset(dataset)

Lazy-loading wrapper for HDF5 datasets.

Module-Level Functions#

load(path)

Module-level convenience function for loading Persistable objects.

Overview#

Persistable extends Serializable with HDF5 file I/O capabilities, enabling efficient storage and retrieval of scientific data. It provides both full object loading and lazy access via context managers.

Key Features#

Three Access Modes

Load entire objects, access data lazily, or use context managers for interactive file manipulation.

ProxyDataset

Lazy-loading wrapper for large arrays - access slices without loading entire datasets into memory.

Automatic Type Mapping

Handles primitives, arrays, collections, and nested objects automatically.

Resizable Datasets

Append to arrays without rewriting entire files.

Metadata Preservation

Stores class information and dataset metadata for accurate reconstruction.

Usage Patterns#

Basic Save and Load#

Define a Persistable class:

class Experiment(Persistable):
    name = SerializableProperty(default="")
    temperature = SerializableProperty(default=293.15)
    data = SerializableProperty(default=None)

Save to file:

exp = Experiment(
    name="Test1",
    temperature=373.15,
    data=np.array([1.2, 3.4, 5.6])
)
exp.save('experiment.hdf5')

Load from file:

# Method 1: Class method
exp = Experiment.load('experiment.hdf5')

# Method 2: Constructor
exp = Experiment('experiment.hdf5')

# Method 3: Module-level function
from jangada.serialization import load
exp = load('experiment.hdf5')

Context Manager for Lazy Loading#

For large files, use context manager mode to avoid loading everything:

with Experiment('large_data.hdf5', mode='r') as exp:
    # exp.data is a ProxyDataset - not loaded yet
    print(exp.data.shape)  # (1000000,)

    # Load only what you need
    chunk = exp.data[100:200]  # Loads only 100 elements

    # Access metadata without loading data
    print(exp.name)
    print(exp.temperature)
Read modes:
  • 'r': Read-only

  • 'r+': Read and write

  • 'w': Write (create new, truncate if exists)

  • 'a': Read/write, create if doesn’t exist

Modifying Data Efficiently#

Update existing files without full reload:

with Experiment('data.hdf5', mode='r+') as exp:
    # Modify single values
    exp.data[50] = 99.9

    # Modify slices
    exp.data[10:20] = np.zeros(10)

    # Append new data (efficient - no full rewrite)
    exp.data.append(np.array([7, 8, 9]))

Incremental Data Collection#

Build datasets over time:

# Initial creation
exp = Experiment(
    name="Time Series",
    data=np.array([1.0, 2.0, 3.0])
)
exp.save('timeseries.hdf5')

# Append data later
for i in range(10):
    with Experiment('timeseries.hdf5', mode='r+') as exp:
        new_data = collect_measurements()
        exp.data.append(new_data)

Nested Objects#

Hierarchical structures work automatically:

class Measurement(Persistable):
    timestamp = SerializableProperty(default=None)
    value = SerializableProperty(default=0.0)

class Experiment(Persistable):
    name = SerializableProperty(default="")
    measurements = SerializableProperty(default=None)

exp = Experiment(
    name="Multi-point",
    measurements=[
        Measurement(timestamp=pd.Timestamp('2024-01-01'), value=1.2),
        Measurement(timestamp=pd.Timestamp('2024-01-02'), value=3.4),
        Measurement(timestamp=pd.Timestamp('2024-01-03'), value=5.6)
    ]
)

exp.save('experiment.hdf5')

# Load preserves structure
loaded = Experiment.load('experiment.hdf5')
assert len(loaded.measurements) == 3
assert loaded.measurements[0].value == 1.2

ProxyDataset#

Overview#

ProxyDataset is a lazy-loading wrapper for HDF5 datasets that provides array-like access without loading data into memory until accessed.

When to Use#

Use ProxyDataset (context manager mode) when:
  • Working with large arrays (GB-scale)

  • Only need to access parts of the data

  • Want to append data incrementally

  • Memory is limited

Use full loading (load() method) when:
  • Arrays are small (MB-scale)

  • Need full numpy array operations

  • Will access most/all of the data

  • Memory is not a concern

Supported Operations#

ProxyDataset supports:

# Indexing and slicing
value = proxy[10]
chunk = proxy[100:200]
section = proxy[10:20, 5:15]  # Multidimensional

# Assignment
proxy[10] = 99
proxy[10:20] = new_values

# Appending
proxy.append(new_data)

# Properties
proxy.shape
proxy.dtype
proxy.ndim
proxy.size
proxy.nbytes
proxy.attrs  # Metadata

Not supported (load data first for these):

# Arithmetic operations
result = proxy * 2  # ✗ Not supported

# Instead:
data = proxy[:]  # Load all
result = data * 2  # ✓ Now works

# Iteration
for item in proxy:  # ✗ Not supported
    pass

# Universal functions
np.mean(proxy)  # ✗ Not supported
np.mean(proxy[:])  # ✓ Load first

Examples#

Efficient slicing of large data:

with Experiment('big_data.hdf5', mode='r') as exp:
    # Dataset is 10GB, only load what you need
    morning_data = exp.measurements[0:1000]
    afternoon_data = exp.measurements[5000:6000]

Appending to time series:

# Start with initial data
ts = TimeSeries(data=np.array([1, 2, 3]))
ts.save('timeseries.hdf5')

# Append over time
for day in range(30):
    with TimeSeries('timeseries.hdf5', mode='r+') as ts:
        daily_data = collect_daily_measurements()
        ts.data.append(daily_data)

Auto-resizing datasets:

with Experiment('data.hdf5', mode='r+') as exp:
    # Dataset currently has 100 elements
    print(exp.data.shape)  # (100,)

    # Setting beyond bounds auto-resizes
    exp.data[200] = 99.9
    print(exp.data.shape)  # (201,)

HDF5 File Structure#

File Organization#

Files are organized hierarchically:

experiment.hdf5
└── root (group)
    ├── __class__ = "mymodule.Experiment" (attribute)
    ├── name = "Test1" (attribute)
    ├── temperature = 373.15 (attribute)
    └── data (dataset)
        ├── __dataset_type__ = "numpy.ndarray" (attribute)
        └── [array data]

Type Mapping#

Python types map to HDF5 structures:

Stored as Attributes
  • None'NoneType:None'

  • str → Direct storage

  • int, float, complex → Direct storage

  • Path'Path:/absolute/path'

Stored as Groups
  • list → Group with __container_type__ = 'list'

  • dict → Group with __container_type__ = 'dict'

  • Nested Serializable → Group with __class__ attribute

Stored as Datasets
  • numpy.ndarray → HDF5 dataset

  • pandas.Timestamp → Int64 dataset + timezone attribute

  • pandas.DatetimeIndex → Int64 dataset + timezone attribute

  • Custom dataset types → Via register_dataset_type

Example Structure#

For a complex object:

class Experiment(Persistable):
    name = SerializableProperty(default="")
    metadata = SerializableProperty(default=None)
    measurements = SerializableProperty(default=None)

exp = Experiment(
    name="Test",
    metadata={"pi": 3.14, "items": [1, 2, 3]},
    measurements=np.array([1.1, 2.2, 3.3])
)

Produces:

file.hdf5
└── root/
    ├── @__class__ = "mymodule.Experiment"
    ├── @name = "Test"
    ├── metadata/  (group)
    │   ├── @__container_type__ = "dict"
    │   ├── @pi = 3.14
    │   └── items/  (group)
    │       ├── @__container_type__ = "list"
    │       ├── @0 = 1
    │       ├── @1 = 2
    │       └── @2 = 3
    └── measurements  (dataset)
        ├── @__dataset_type__ = "numpy.ndarray"
        └── [1.1, 2.2, 3.3]

(@ denotes attributes, / denotes groups/datasets)

Advanced Usage#

Custom File Extensions#

Override the default extension:

class MyData(Persistable):
    extension = '.h5'
    data = SerializableProperty(default=None)

obj = MyData(data=np.array([1, 2, 3]))
obj.save('output')  # Saves as 'output.h5'

Multiple Files#

Save different objects to different files:

exp1 = Experiment(name="Morning")
exp1.save('morning.hdf5')

exp2 = Experiment(name="Afternoon")
exp2.save('afternoon.hdf5')

# Load them back
experiments = [
    Experiment.load('morning.hdf5'),
    Experiment.load('afternoon.hdf5')
]

Batch Processing#

Process large datasets in chunks:

with LargeDataset('data.hdf5', mode='r') as dataset:
    chunk_size = 1000
    num_chunks = dataset.data.shape[0] // chunk_size

    results = []
    for i in range(num_chunks):
        start = i * chunk_size
        end = start + chunk_size
        chunk = dataset.data[start:end]
        results.append(process(chunk))

Compression#

HDF5 supports compression, but this requires modifying the internal _save_data_in_group method. Future versions may support:

class CompressedData(Persistable):
    compression = 'gzip'
    compression_opts = 4

(Not currently implemented - see Future Features in test file)

Performance Tips#

Memory Efficiency#

Use context manager for large files:

# Bad: Loads entire 10GB array into memory
exp = Experiment.load('huge_data.hdf5')
chunk = exp.data[100:200]

# Good: Loads only requested chunk
with Experiment('huge_data.hdf5', mode='r') as exp:
    chunk = exp.data[100:200]

Append instead of rewriting:

# Bad: Loads, modifies, saves entire file
exp = Experiment.load('data.hdf5')
exp.data = np.concatenate([exp.data, new_values])
exp.save('data.hdf5')

# Good: Appends only new data
with Experiment('data.hdf5', mode='r+') as exp:
    exp.data.append(new_values)

I/O Performance#

Batch small operations:

# Bad: Many small appends
for value in values:
    with Experiment('data.hdf5', mode='r+') as exp:
        exp.data.append(np.array([value]))

# Good: Single append with all data
with Experiment('data.hdf5', mode='r+') as exp:
    exp.data.append(np.array(values))

Access contiguous slices:

# Good: Contiguous access
chunk = proxy[100:200]

# Slower: Non-contiguous access
elements = [proxy[i] for i in [10, 50, 100, 500]]

Storage Efficiency#

Use appropriate dtypes:

# Wasteful: float64 for integer data
data = np.array([1, 2, 3], dtype=np.float64)  # 24 bytes

# Efficient: int8 for small integers
data = np.array([1, 2, 3], dtype=np.int8)  # 3 bytes

Mark non-essential data as non-copiable:

class Analysis(Persistable):
    raw_data = SerializableProperty(default=None, copiable=True)
    # Cached result - don't save
    _cached = SerializableProperty(default=None, copiable=False)

Limitations#

Current Limitations#

  1. Circular References Circular references will cause infinite recursion during save.

  2. Scalar Datasets 0-dimensional datasets cannot be resized or appended to.

  3. First Dimension Only Datasets can only be resized along axis 0.

  4. No Compression Control Cannot currently configure HDF5 compression from Python.

  5. No Partial Updates Must save entire object or use context manager.

  6. ProxyDataset API Limited Not a full numpy array replacement - supports slicing only.

  7. No File Locking Multiple processes accessing same file may cause issues.

  8. No Version Migration Changing class structure between saves requires manual handling.

See test_persistable.py header for complete list of limitations and future enhancement plans.

Troubleshooting#

File Not Found#

Problem: FileNotFoundError when loading.

Solution: Check path is correct:

from pathlib import Path
path = Path('data.hdf5')
assert path.exists()

Permission Errors#

Problem: Cannot write to file.

Solution: Check file permissions and that file isn’t open elsewhere:

# Close any open context managers first
with exp:
    pass  # File closes here

# Now can reopen
exp.save('data.hdf5')

Type Errors#

Problem: TypeError: instances of X cannot be saved

Solution: Register the type or make it Serializable:

# Option 1: Register as primitive
Serializable.register_primitive_type(MyType)

# Option 2: Make it Serializable
class MyType(Serializable):
    ...

Corrupted Files#

Problem: File won’t load after crash.

Solution: HDF5 files can be checked and repaired:

# Check file integrity (external tool)
$ h5debug data.hdf5

# Or in Python
import h5py
try:
    with h5py.File('data.hdf5', 'r') as f:
        pass  # If this works, file is OK
except Exception as e:
    print(f"File corrupted: {e}")

See Also#