coco_pipe.io.structures¶

Standardized containers for passing data between Datasets, Preprocessing, and main modules.

This module provides the DataContainer, an N-dimensional tensor wrapper that manages metadata, coordinates, and labels alongside the raw data matrix. It serves as the common currency for the entire pipeline.

Examples

>>> import numpy as np
>>> from coco_pipe.io import DataContainer

# 1. Creating a container for EEG Epochs (N_epochs, N_channels, N_time) >>> X = np.random.randn(10, 64, 500) >>> container = DataContainer( … X=X, … dims=(‘obs’, ‘channel’, ‘time’), … coords={ … ‘channel’: [‘Fz’, ‘Cz’, ‘Pz’], # … etc … ‘time’: np.linspace(0, 1.0, 500) … }, … y=np.random.randint(0, 2, 10), … ids=[f’sub-01_trial-{i}’ for i in range(10)] … )

# 2. Creating a container for simple Tabular Features (N_subjects, N_features) >>> X_tab = np.random.randn(20, 5) >>> container_tab = DataContainer( … X=X_tab, … dims=(‘obs’, ‘feature’), … coords={‘feature’: [‘age’, ‘IQ’, ‘response_time’, ‘power_alpha’, ‘power_beta’]} … )

Attributes¶

logger

Classes¶

DataContainer

Generic container for N-dimensional neurophysiological data.

Module Contents¶

coco_pipe.io.structures.logger¶

class coco_pipe.io.structures.DataContainer[source]¶

Generic container for N-dimensional neurophysiological data.

Acts as a lightweight labelled array (like xarray but simpler), managing dimensions, coordinates, and associated target labels (y) and IDs.

X¶

The primary data tensor. Shape must match dims.

Type:: np.ndarray

dims¶

Labels for each dimension of X. Examples: (‘obs’, ‘feature’), (‘obs’, ‘channel’, ‘time’). Note: The ‘obs’ dimension is special and typically represents independent samples.

Type:: Tuple[str, …]

coords¶

Coordinates/Labels for dimensions. Keys must be in dims. Values must match the length of the corresponding dimension in X.

Type:: Dict[str, Union[List, np.ndarray]]

y¶

Target labels corresponding to the ‘obs’ dimension. Used for supervised learning or coloring plots.

Type:: Optional[np.ndarray], optional

ids¶

Identifiers for observations (e.g., subject IDs, trial names). Should correspond to ‘obs’ dim in coords if provided. Kept separate from coords for convenient tracking.

Type:: Optional[np.ndarray], optional

meta¶

Arbitrary metadata (sfreq, units, source path, etc).

Type:: Dict[str, Any]

Examples

Accessing data: >>> container.X.shape (10, 64, 500)

Accessing coordinates: >>> container.coords[‘channel’][:3] [‘Fz’, ‘Cz’, ‘Pz’]

X: numpy.ndarray¶

dims: Tuple[str, Ellipsis]¶

coords: Dict[str, List | numpy.ndarray | Sequence]¶

y: numpy.ndarray | None = None¶

ids: numpy.ndarray | None = None¶

meta: Dict[str, Any]¶

__post_init__()[source]¶

property shape: Tuple[int, Ellipsis]¶

save(path: str | Any) → None[source]¶

Save the DataContainer to disk using joblib.

Parameters:: path (str or Path) – Destination file path.

classmethod load(path: str | Any) → DataContainer[source]¶

Load a DataContainer from disk.

Parameters:: path (str or Path) – Source file path.
Return type:: DataContainer

__repr__() → str[source]¶

obs_table(include_ids: bool = False, id_col: str = 'obs_id', include_y: bool = False, y_col: str = 'y', include_obs_coord: bool = False) → pandas.DataFrame[source]¶

Return one-dimensional coordinates aligned to the observation axis.

This helper is useful when exporting a row-wise table from a container. It only materializes metadata that can map cleanly to one row per observation, skipping coordinates that belong to other axes such as channel, time, feature, or stat.

Parameters:

include_ids (bool, default=False) – If True, include self.ids as the first column.
id_col (str, default="obs_id") – Column name used when exporting self.ids.
include_y (bool, default=False) – If True, include self.y as a column when present.
y_col (str, default="y") – Column name used when exporting self.y.
include_obs_coord (bool, default=False) – If True, include coords["obs"] when present.

Returns:

DataFrame containing only one-dimensional observation-aligned metadata columns.

Return type:

pandas.DataFrame

Raises:

ValueError – If the container has no obs dimension, or if include_ids is requested when self.ids is missing.

isel(**indexers) → DataContainer[source]¶

Select data by integer indices on specified dimensions.

This method is the integer-index equivalent of select. It operates directly on the dimensions of the data tensor X. It is robust and handles metadata splitting/alignment automatically.

Parameters:

**indexers (dict) –

Key: Dimension name (e.g., ‘obs’, ‘channel’, ‘time’). Value: Integer indices to select. Can be:

List or numpy array of integers: [0, 1, 5]

Slice object: slice(0, 10)

Single integer: 0

Note: If you provide a list of indices with repeats (e.g., [0, 0, 1]), the output will be oversampled accordingly.

Returns:

A new DataContainer instance with the sliced data and coordinates.

Return type:

DataContainer

Examples

>>> # Select first 10 observations
>>> subset = container.isel(obs=slice(0, 10))

>>> # Select specific channels by index
>>> subset = container.isel(channel=[0, 5, 12])

>>> # Select time range by index
>>> subset = container.isel(time=slice(100, 200))

>>> # Bootstrap/Resample (Select index 0 five times)
>>> bootstrap = container.isel(obs=[0, 0, 0, 0, 0])

balance(target: str = 'y', strategy: str = 'undersample', covariates: List[str] | None = None, random_state: int = 42, **kwargs) → DataContainer[source]¶

Balance the dataset classes using undersampling or oversampling.

This method adjusts the number of observations (rows) in the container so that class counts in target are equalized. It supports simple random sampling and stratified sampling based on covariates.

Parameters:

target (str, default='y') – Name of the target variable. - ‘y’: Uses self.y. - Any other string: Looks for the variable in self.coords.
strategy ({'undersample', 'oversample', 'auto'}, default='undersample') –
- ‘undersample’: Downsample majority classes to match the minority class count.
- ’oversample’: Upsample minority classes (with replacement) to match the majority class.
- ’auto’: Heuristic choice. Uses undersampling if total size remains > 50% of original, else oversampling.
covariates (list of str, optional) – List of covariate names in self.coords to preserve distribution of. If provided, the balancing is performed within strata defined by these covariates.
random_state (int, default=42) – Seed for the random number generator. Change this value to produce different random subsets (e.g., for bagging).
**kwargs (dict) –
Additional arguments passed to internal logic: - n_bins (int): Number of bins for continuous covariates (default 5). - binning (str): ‘quantile’ (default) or ‘uniform’ binning. - prefer_clean_rows (bool): If True, weighs sampling to prefer rows

with fewer NaNs/artifacts.

Returns:

A new DataContainer instance with balanced classes.

Return type:

DataContainer

Examples

>>> # 1. Simple Undersampling of 'y'
>>> balanced = container.balance(strategy='undersample')

>>> # 2. Balance based on a metadata column 'condition'
>>> balanced = container.balance(target='condition')

>>> # 3. Stratified Balancing (Balance 'y' while preserving 'sex' and 'age'
>>> #    ratios)
>>> balanced = container.balance(target='y', covariates=['sex', 'age'])

>>> # 4. Iterative Bootstrapping (Different seeds)
>>> for seed in [1, 2, 3]:
...     subset = container.balance(strategy='undersample', random_state=seed)
...     # process subset...

select(ignore_case: bool = False, fuzzy: bool = False, **selections) → DataContainer[source]¶

Select data subsets based on coordinates, ids, or y.

This method supports exact matching, wildcard matching, operator-based filtering, and custom callable filters.

Parameters:

ignore_case (bool, default=False) – If True, string matching is case-insensitive (e.g., ‘fz’ matches ‘Fz’).
fuzzy (bool, default=False) – If True, uses difflib to find closest matches for string queries (e.g., ‘Alpha’ matches ‘alpha’). Useful for handling typos.
**selections (dict) –
Key is the dimension name (or special keys ‘y’, ‘ids’). Value is the query. Supported query types:
1. List/Array (Exact or Wildcard): Matches values present in the list. Strings can use shell-style wildcards (‘*’, ‘?’).
2. Dictionary (Operator Queries): Filters numerical or string values using operators. Keys: ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’, ‘in’.
3. Callable: A function taking the coordinate array and returning a boolean mask.

Returns:

A new DataContainer instance containing the selected subset.

Return type:

DataContainer

Examples

>>> # 1. Exact Selection (Sensors)
>>> sub = container.select(channel=['Fz', 'Cz'])

>>> # 2. Wildcard Selection (All Alpha features)
>>> sub = container.select(feature='*alpha*')

>>> # 3. Range Selection (Time)
>>> sub = container.select(time={'>=': 0.1, '<': 0.5})

>>> # 4. Case-Insensitive Fuzzy Matching
>>> sub = container.select(channel=['fz'], ignore_case=True)

>>> # 5. Filter by Target (y)
>>> sub = container.select(y=['Patient'])

>>> # 6. Complex Logic (Subjects 1-5 via Operator)
>>> sub = container.select(subject_id={'>=': 1, '<=': 5})

>>> # 7. Stratified Selection (First 2 epochs per subject via Callable)
>>> def first_n(ids, n=2):
...     # ... logic ...
...     return mask
>>> sub = container.select(ids=first_n)

flatten(preserve: str | List[str] = 'obs') → DataContainer[source]¶

Flatten dimensions NOT in preserve into a single ‘feature’ dimension.

This is useful for preparing N-dimensional data for standard 2D machine learning algorithms (scikit-learn). It automatically generates composite feature names (e.g., ‘Fz_0.1s’) for tracking.

Parameters:

preserve (str or List[str], default='obs') –

Dimensions to keep. All other dimensions will be collapsed into a single ‘feature’ dimension. - ‘obs’: Result shape (N_obs, N_features). Standard specifiction. - [‘obs’, ‘time’]: Result shape (N_obs, N_time, N_features).

Useful for time-resolved decoding distributions.

Returns:

A new DataContainer with reshaped X and generated ‘feature’ coordinates.

Return type:

DataContainer

Examples

>>> # Flatten (10, 64, 500) -> (10, 32000)
>>> flat = container.flatten(preserve='obs')
>>> flat.shape
(10, 32000)
>>> flat.coords['feature'][0]
'Fz_0.0'

>>> # Flatten spatial only, keep time (10, 64, 500) -> (10, 500, 64)
>>> time_resolved = container.flatten(preserve=['obs', 'time'])

stack(dims: Sequence[str], new_dim: str = 'obs') → DataContainer[source]¶

Stack multiple dimensions into a single new dimension.

This reshapes N-dimensional data into (N-K) dimensions by combining specified dimensions. It is useful for transforming spatiotemporal data (Trials, Channels, Time) -> (Trials*Time, Channels) for trajectory analysis.

Parameters:

dims (sequence of str) – Dimensions to stack. The order determines the nesting (slowest to fastest). e.g., (‘obs’, ‘time’) means ‘obs’ changes slowly, ‘time’ cycles fast.
new_dim (str, default='obs') – Name of the resulting stacked dimension.

Returns:

New container with stacked dimension. Metadata (coords/ids) are expanded/tiled to match the new shape.

Return type:

DataContainer

Examples

>>> # Stack time into observations:
>>> # (10 obs, 64 ch, 500 time) -> (5000 obs, 64 ch)
>>> stacked = container.stack(dims=('obs', 'time'), new_dim='obs')
>>> stacked.shape
(5000, 64)

unstack(dim: str) → DataContainer[source]¶

Unstack a dimension into multiple dimensions.

Inverse operation of stack. Reshapes the data tensor by splitting one dimension into multiple using metadata stored during the stack operation.

Parameters:: dim (str) – Dimension to unstack (e.g. ‘obs’).
Returns:: New container with unstacked dimensions.
Return type:: DataContainer
Raises:: ValueError – If the container was not previously stacked (missing metadata).

Examples

>>> # Stack 'trials' and 'time' -> 'obs'
>>> stacked = container.stack(('trials', 'time'), new_dim='obs')
>>> # Unstack 'obs' -> ('trials', 'time') (automatically inferred)
>>> unstacked = stacked.unstack('obs')

center(dim: str = 'time', inplace: bool = False) → DataContainer[source]¶

Remove mean along a specified dimension (Centering/Baseline Correction).

This operation computes the mean along dim (ignoring NaNs) and subtracts it. Commonly used in EEG for baseline correction (subtracting mean of pre-stimulus interval) or centering features before covariance calculation.

Parameters:

dim (str, default='time') – Dimension name to center over (e.g., ‘time’, ‘channel’, ‘obs’).
inplace (bool, default=False) – If True, modifies X in-place to save memory. Returns self.

Returns:

Container with centered data.

Return type:

DataContainer

Examples

>>> # Baseline correction over time
>>> container.center(dim='time')

zscore(dim: str = 'time', eps: float = 1e-08, inplace: bool = False) → DataContainer[source]¶

Standardize (Z-score) along a specified dimension.

Computes (X - mean) / std along the given dimension. Robust to NaNs. Useful for normalizing features or standardizing temporal dynamics.

Parameters:

dim (str) – Dimension to standardize.
eps (float) – Stability epsilon to avoid division by zero.
inplace (bool)

Return type:

DataContainer

Examples

>>> # Standardize each channel's timecourse
>>> container.zscore(dim='time')

rms_scale(dim: str = 'time', eps: float = 1e-08, inplace: bool = False) → DataContainer[source]¶

Scale by Root Mean Square (RMS) amplitude along a dimension.

Divides data by sqrt(mean(X**2)) along the dimension. Preserves relative shape but normalizes energy.

Parameters:

dim (str) – Dimension to scale.
eps (float) – Stability epsilon.
inplace (bool)

Return type:

DataContainer

baseline_correction(dim: str = 'time', inplace: bool = False) → DataContainer[source]¶: Alias for center(). Common in EEG.

aggregate(by: str | numpy.ndarray | List[Any], stats: str | Sequence[str] = 'mean', min_count: int = 1, on_insufficient: str = 'raise') → DataContainer[source]¶

Aggregate observations into grouped summaries along the obs axis.

Parameters:

by (str or array-like) –
Group definition for the observation axis. - If str: resolve the key from self.coords or from self.y

when by == "y".
- If array-like: explicit group labels aligned with obs.
stats (str or sequence of str, default="mean") – Aggregation statistic or ordered list of statistics. Supported tokens are "mean", "median", "std", "var", "sem", "mad", "iqr", "min", "max", "count", and "first". Legacy "obs-*" aliases are accepted and normalized.
min_count (int, default=1) – Minimum number of valid observations required per group. A valid observation is one with at least one finite value across the non-observation axes.
on_insufficient ({"raise", "warn", "collect"}, default="raise") – Policy applied when a group has fewer than min_count valid observations.

Returns:

Aggregated container with grouped observations on the obs axis. When multiple stats are requested, a stat dimension is inserted immediately after obs.

Return type:

DataContainer

Raises:

ValueError – If the container has no obs dimension, grouping is invalid, requested stats are unsupported, or min_count / on_insufficient are invalid.

aggregate_groups(by: str | numpy.ndarray | List[Any], groups: Sequence[Dict[str, Any]], min_count: int = 1, on_insufficient: str = 'raise', skip_empty: bool = True) → DataContainer[source]¶

Aggregate selected feature groups with different statistics.

This is a thin wrapper around aggregate() for tabular feature containers. Each group spec selects a subset of feature columns and applies one or more stats to that subset. The outputs are concatenated along the feature dimension, and each resulting feature name is prefixed with its stat (for example "mean_band_log_abs_alpha").

Parameters:

by (str or array-like) – Group definition for the observation axis. Passed through to aggregate().
groups (sequence of dict) –
Ordered group specifications. Each group must provide "stats" and may optionally provide include/exclude selectors:
- names / exclude_names
- prefixes / exclude_prefixes
- suffixes / exclude_suffixes
- contains / exclude_contains
- regex / exclude_regex
If a group provides no include selectors, it starts from all features and then applies exclusions.
min_count (int, default=1) – Minimum number of valid observations required per group. Passed through to aggregate().
on_insufficient ({"raise", "warn", "collect"}, default="raise") – Policy applied when a group has fewer than min_count valid observations. Passed through to aggregate().
skip_empty (bool, default=True) – If True, silently skip group specs that match no features. If False, raise a ValueError when a group matches nothing.

Returns:

Aggregated container with dims ("obs", "feature") and stat-prefixed feature names.

Return type:

DataContainer

Raises:

ValueError – If the container lacks a feature dimension or coord, no groups are provided, a group spec is invalid, multiple groups would emit the same output feature name, or no non-empty grouped outputs are produced.