coco_pipe.io.structures ======================= .. py:module:: coco_pipe.io.structures .. autoapi-nested-parse:: Data Structures =============== Standardized containers for passing data between Datasets, Preprocessing, and main modules. This module provides the `DataContainer`, an N-dimensional tensor wrapper that manages metadata, coordinates, and labels alongside the raw data matrix. It serves as the common currency for the entire pipeline. .. rubric:: Examples >>> import numpy as np >>> from coco_pipe.io import DataContainer # 1. Creating a container for EEG Epochs (N_epochs, N_channels, N_time) >>> X = np.random.randn(10, 64, 500) >>> container = DataContainer( ... X=X, ... dims=('obs', 'channel', 'time'), ... coords={ ... 'channel': ['Fz', 'Cz', 'Pz'], # ... etc ... 'time': np.linspace(0, 1.0, 500) ... }, ... y=np.random.randint(0, 2, 10), ... ids=[f'sub-01_trial-{i}' for i in range(10)] ... ) # 2. Creating a container for simple Tabular Features (N_subjects, N_features) >>> X_tab = np.random.randn(20, 5) >>> container_tab = DataContainer( ... X=X_tab, ... dims=('obs', 'feature'), ... coords={'feature': ['age', 'IQ', 'response_time', 'power_alpha', 'power_beta']} ... ) Attributes ---------- .. autoapisummary:: coco_pipe.io.structures.logger Classes ------- .. autoapisummary:: coco_pipe.io.structures.DataContainer Module Contents --------------- .. py:data:: logger .. py:class:: DataContainer Generic container for N-dimensional neurophysiological data. Acts as a lightweight labelled array (like xarray but simpler), managing dimensions, coordinates, and associated target labels (y) and IDs. .. attribute:: X The primary data tensor. Shape must match `dims`. :type: np.ndarray .. attribute:: dims Labels for each dimension of X. Examples: ('obs', 'feature'), ('obs', 'channel', 'time'). Note: The 'obs' dimension is special and typically represents independent samples. :type: Tuple[str, ...] .. attribute:: coords Coordinates/Labels for dimensions. Keys must be in `dims`. Values must match the length of the corresponding dimension in X. :type: Dict[str, Union[List, np.ndarray]] .. attribute:: y Target labels corresponding to the 'obs' dimension. Used for supervised learning or coloring plots. :type: Optional[np.ndarray], optional .. attribute:: ids Identifiers for observations (e.g., subject IDs, trial names). Should correspond to 'obs' dim in coords if provided. Kept separate from coords for convenient tracking. :type: Optional[np.ndarray], optional .. attribute:: meta Arbitrary metadata (sfreq, units, source path, etc). :type: Dict[str, Any] .. rubric:: Examples Accessing data: >>> container.X.shape (10, 64, 500) Accessing coordinates: >>> container.coords['channel'][:3] ['Fz', 'Cz', 'Pz'] .. py:attribute:: X :type: numpy.ndarray .. py:attribute:: dims :type: Tuple[str, Ellipsis] .. py:attribute:: coords :type: Dict[str, Union[List, numpy.ndarray, Sequence]] .. py:attribute:: y :type: Optional[numpy.ndarray] :value: None .. py:attribute:: ids :type: Optional[numpy.ndarray] :value: None .. py:attribute:: meta :type: Dict[str, Any] .. py:method:: __post_init__() .. py:property:: shape :type: Tuple[int, Ellipsis] .. py:method:: save(path: Union[str, Any]) -> None Save the DataContainer to disk using joblib. :param path: Destination file path. :type path: str or Path .. py:method:: load(path: Union[str, Any]) -> DataContainer :classmethod: Load a DataContainer from disk. :param path: Source file path. :type path: str or Path :rtype: DataContainer .. py:method:: __repr__() -> str .. py:method:: obs_table(include_ids: bool = False, id_col: str = 'obs_id', include_y: bool = False, y_col: str = 'y', include_obs_coord: bool = False) -> pandas.DataFrame Return one-dimensional coordinates aligned to the observation axis. This helper is useful when exporting a row-wise table from a container. It only materializes metadata that can map cleanly to one row per observation, skipping coordinates that belong to other axes such as ``channel``, ``time``, ``feature``, or ``stat``. :param include_ids: If True, include ``self.ids`` as the first column. :type include_ids: bool, default=False :param id_col: Column name used when exporting ``self.ids``. :type id_col: str, default="obs_id" :param include_y: If True, include ``self.y`` as a column when present. :type include_y: bool, default=False :param y_col: Column name used when exporting ``self.y``. :type y_col: str, default="y" :param include_obs_coord: If True, include ``coords["obs"]`` when present. :type include_obs_coord: bool, default=False :returns: DataFrame containing only one-dimensional observation-aligned metadata columns. :rtype: pandas.DataFrame :raises ValueError: If the container has no ``obs`` dimension, or if ``include_ids`` is requested when ``self.ids`` is missing. .. py:method:: isel(**indexers) -> DataContainer Select data by integer indices on specified dimensions. This method is the integer-index equivalent of `select`. It operates directly on the dimensions of the data tensor `X`. It is robust and handles metadata splitting/alignment automatically. :param \*\*indexers: Key: Dimension name (e.g., 'obs', 'channel', 'time'). Value: Integer indices to select. Can be: - List or numpy array of integers: [0, 1, 5] - Slice object: slice(0, 10) - Single integer: 0 Note: If you provide a list of indices with repeats (e.g., [0, 0, 1]), the output will be oversampled accordingly. :type \*\*indexers: dict :returns: A new DataContainer instance with the sliced data and coordinates. :rtype: DataContainer .. rubric:: Examples >>> # Select first 10 observations >>> subset = container.isel(obs=slice(0, 10)) >>> # Select specific channels by index >>> subset = container.isel(channel=[0, 5, 12]) >>> # Select time range by index >>> subset = container.isel(time=slice(100, 200)) >>> # Bootstrap/Resample (Select index 0 five times) >>> bootstrap = container.isel(obs=[0, 0, 0, 0, 0]) .. py:method:: balance(target: str = 'y', strategy: str = 'undersample', covariates: Optional[List[str]] = None, random_state: int = 42, **kwargs) -> DataContainer Balance the dataset classes using undersampling or oversampling. This method adjusts the number of observations (rows) in the container so that class counts in `target` are equalized. It supports simple random sampling and stratified sampling based on covariates. :param target: Name of the target variable. - 'y': Uses `self.y`. - Any other string: Looks for the variable in `self.coords`. :type target: str, default='y' :param strategy: - 'undersample': Downsample majority classes to match the minority class count. - 'oversample': Upsample minority classes (with replacement) to match the majority class. - 'auto': Heuristic choice. Uses undersampling if total size remains > 50% of original, else oversampling. :type strategy: {'undersample', 'oversample', 'auto'}, default='undersample' :param covariates: List of covariate names in `self.coords` to preserve distribution of. If provided, the balancing is performed *within* strata defined by these covariates. :type covariates: list of str, optional :param random_state: Seed for the random number generator. Change this value to produce different random subsets (e.g., for bagging). :type random_state: int, default=42 :param \*\*kwargs: Additional arguments passed to internal logic: - n_bins (int): Number of bins for continuous covariates (default 5). - binning (str): 'quantile' (default) or 'uniform' binning. - prefer_clean_rows (bool): If True, weighs sampling to prefer rows with fewer NaNs/artifacts. :type \*\*kwargs: dict :returns: A new DataContainer instance with balanced classes. :rtype: DataContainer .. rubric:: Examples >>> # 1. Simple Undersampling of 'y' >>> balanced = container.balance(strategy='undersample') >>> # 2. Balance based on a metadata column 'condition' >>> balanced = container.balance(target='condition') >>> # 3. Stratified Balancing (Balance 'y' while preserving 'sex' and 'age' >>> # ratios) >>> balanced = container.balance(target='y', covariates=['sex', 'age']) >>> # 4. Iterative Bootstrapping (Different seeds) >>> for seed in [1, 2, 3]: ... subset = container.balance(strategy='undersample', random_state=seed) ... # process subset... .. py:method:: select(ignore_case: bool = False, fuzzy: bool = False, **selections) -> DataContainer Select data subsets based on coordinates, ids, or y. This method supports exact matching, wildcard matching, operator-based filtering, and custom callable filters. :param ignore_case: If True, string matching is case-insensitive (e.g., 'fz' matches 'Fz'). :type ignore_case: bool, default=False :param fuzzy: If True, uses `difflib` to find closest matches for string queries (e.g., 'Alpha' matches 'alpha'). Useful for handling typos. :type fuzzy: bool, default=False :param \*\*selections: Key is the dimension name (or special keys 'y', 'ids'). Value is the query. Supported query types: 1. **List/Array (Exact or Wildcard)**: Matches values present in the list. Strings can use shell-style wildcards ('*', '?'). 2. **Dictionary (Operator Queries)**: Filters numerical or string values using operators. Keys: '>', '<', '>=', '<=', '==', '!=', 'in'. 3. **Callable**: A function taking the coordinate array and returning a boolean mask. :type \*\*selections: dict :returns: A new DataContainer instance containing the selected subset. :rtype: DataContainer .. rubric:: Examples >>> # 1. Exact Selection (Sensors) >>> sub = container.select(channel=['Fz', 'Cz']) >>> # 2. Wildcard Selection (All Alpha features) >>> sub = container.select(feature='*alpha*') >>> # 3. Range Selection (Time) >>> sub = container.select(time={'>=': 0.1, '<': 0.5}) >>> # 4. Case-Insensitive Fuzzy Matching >>> sub = container.select(channel=['fz'], ignore_case=True) >>> # 5. Filter by Target (y) >>> sub = container.select(y=['Patient']) >>> # 6. Complex Logic (Subjects 1-5 via Operator) >>> sub = container.select(subject_id={'>=': 1, '<=': 5}) >>> # 7. Stratified Selection (First 2 epochs per subject via Callable) >>> def first_n(ids, n=2): ... # ... logic ... ... return mask >>> sub = container.select(ids=first_n) .. py:method:: flatten(preserve: Union[str, List[str]] = 'obs') -> DataContainer Flatten dimensions NOT in `preserve` into a single 'feature' dimension. This is useful for preparing N-dimensional data for standard 2D machine learning algorithms (scikit-learn). It automatically generates composite feature names (e.g., 'Fz_0.1s') for tracking. :param preserve: Dimensions to keep. All other dimensions will be collapsed into a single 'feature' dimension. - 'obs': Result shape (N_obs, N_features). Standard specifiction. - ['obs', 'time']: Result shape (N_obs, N_time, N_features). Useful for time-resolved decoding distributions. :type preserve: str or List[str], default='obs' :returns: A new DataContainer with reshaped X and generated 'feature' coordinates. :rtype: DataContainer .. rubric:: Examples >>> # Flatten (10, 64, 500) -> (10, 32000) >>> flat = container.flatten(preserve='obs') >>> flat.shape (10, 32000) >>> flat.coords['feature'][0] 'Fz_0.0' >>> # Flatten spatial only, keep time (10, 64, 500) -> (10, 500, 64) >>> time_resolved = container.flatten(preserve=['obs', 'time']) .. py:method:: stack(dims: Sequence[str], new_dim: str = 'obs') -> DataContainer Stack multiple dimensions into a single new dimension. This reshapes N-dimensional data into (N-K) dimensions by combining specified dimensions. It is useful for transforming spatiotemporal data (Trials, Channels, Time) -> (Trials*Time, Channels) for trajectory analysis. :param dims: Dimensions to stack. The order determines the nesting (slowest to fastest). e.g., ('obs', 'time') means 'obs' changes slowly, 'time' cycles fast. :type dims: sequence of str :param new_dim: Name of the resulting stacked dimension. :type new_dim: str, default='obs' :returns: New container with stacked dimension. Metadata (coords/ids) are expanded/tiled to match the new shape. :rtype: DataContainer .. rubric:: Examples >>> # Stack time into observations: >>> # (10 obs, 64 ch, 500 time) -> (5000 obs, 64 ch) >>> stacked = container.stack(dims=('obs', 'time'), new_dim='obs') >>> stacked.shape (5000, 64) .. py:method:: unstack(dim: str) -> DataContainer Unstack a dimension into multiple dimensions. Inverse operation of `stack`. Reshapes the data tensor by splitting one dimension into multiple using metadata stored during the `stack` operation. :param dim: Dimension to unstack (e.g. 'obs'). :type dim: str :returns: New container with unstacked dimensions. :rtype: DataContainer :raises ValueError: If the container was not previously stacked (missing metadata). .. rubric:: Examples >>> # Stack 'trials' and 'time' -> 'obs' >>> stacked = container.stack(('trials', 'time'), new_dim='obs') >>> # Unstack 'obs' -> ('trials', 'time') (automatically inferred) >>> unstacked = stacked.unstack('obs') .. py:method:: center(dim: str = 'time', inplace: bool = False) -> DataContainer Remove mean along a specified dimension (Centering/Baseline Correction). This operation computes the mean along `dim` (ignoring NaNs) and subtracts it. Commonly used in EEG for baseline correction (subtracting mean of pre-stimulus interval) or centering features before covariance calculation. :param dim: Dimension name to center over (e.g., 'time', 'channel', 'obs'). :type dim: str, default='time' :param inplace: If True, modifies X in-place to save memory. Returns self. :type inplace: bool, default=False :returns: Container with centered data. :rtype: DataContainer .. rubric:: Examples >>> # Baseline correction over time >>> container.center(dim='time') .. py:method:: zscore(dim: str = 'time', eps: float = 1e-08, inplace: bool = False) -> DataContainer Standardize (Z-score) along a specified dimension. Computes `(X - mean) / std` along the given dimension. Robust to NaNs. Useful for normalizing features or standardizing temporal dynamics. :param dim: Dimension to standardize. :type dim: str :param eps: Stability epsilon to avoid division by zero. :type eps: float :param inplace: :type inplace: bool :rtype: DataContainer .. rubric:: Examples >>> # Standardize each channel's timecourse >>> container.zscore(dim='time') .. py:method:: rms_scale(dim: str = 'time', eps: float = 1e-08, inplace: bool = False) -> DataContainer Scale by Root Mean Square (RMS) amplitude along a dimension. Divides data by `sqrt(mean(X**2))` along the dimension. Preserves relative shape but normalizes energy. :param dim: Dimension to scale. :type dim: str :param eps: Stability epsilon. :type eps: float :param inplace: :type inplace: bool :rtype: DataContainer .. py:method:: baseline_correction(dim: str = 'time', inplace: bool = False) -> DataContainer Alias for center(). Common in EEG. .. py:method:: aggregate(by: Union[str, numpy.ndarray, List[Any]], stats: Union[str, Sequence[str]] = 'mean', min_count: int = 1, on_insufficient: str = 'raise') -> DataContainer Aggregate observations into grouped summaries along the ``obs`` axis. :param by: Group definition for the observation axis. - If str: resolve the key from ``self.coords`` or from ``self.y`` when ``by == "y"``. - If array-like: explicit group labels aligned with ``obs``. :type by: str or array-like :param stats: Aggregation statistic or ordered list of statistics. Supported tokens are ``"mean"``, ``"median"``, ``"std"``, ``"var"``, ``"sem"``, ``"mad"``, ``"iqr"``, ``"min"``, ``"max"``, ``"count"``, and ``"first"``. Legacy ``"obs-*"`` aliases are accepted and normalized. :type stats: str or sequence of str, default="mean" :param min_count: Minimum number of valid observations required per group. A valid observation is one with at least one finite value across the non-observation axes. :type min_count: int, default=1 :param on_insufficient: Policy applied when a group has fewer than ``min_count`` valid observations. :type on_insufficient: {"raise", "warn", "collect"}, default="raise" :returns: Aggregated container with grouped observations on the ``obs`` axis. When multiple stats are requested, a ``stat`` dimension is inserted immediately after ``obs``. :rtype: DataContainer :raises ValueError: If the container has no ``obs`` dimension, grouping is invalid, requested stats are unsupported, or ``min_count`` / ``on_insufficient`` are invalid. .. py:method:: aggregate_groups(by: Union[str, numpy.ndarray, List[Any]], groups: Sequence[Dict[str, Any]], min_count: int = 1, on_insufficient: str = 'raise', skip_empty: bool = True) -> DataContainer Aggregate selected feature groups with different statistics. This is a thin wrapper around :meth:`aggregate` for tabular feature containers. Each group spec selects a subset of feature columns and applies one or more stats to that subset. The outputs are concatenated along the ``feature`` dimension, and each resulting feature name is prefixed with its stat (for example ``"mean_band_log_abs_alpha"``). :param by: Group definition for the observation axis. Passed through to :meth:`aggregate`. :type by: str or array-like :param groups: Ordered group specifications. Each group must provide ``"stats"`` and may optionally provide include/exclude selectors: - ``names`` / ``exclude_names`` - ``prefixes`` / ``exclude_prefixes`` - ``suffixes`` / ``exclude_suffixes`` - ``contains`` / ``exclude_contains`` - ``regex`` / ``exclude_regex`` If a group provides no include selectors, it starts from all features and then applies exclusions. :type groups: sequence of dict :param min_count: Minimum number of valid observations required per group. Passed through to :meth:`aggregate`. :type min_count: int, default=1 :param on_insufficient: Policy applied when a group has fewer than ``min_count`` valid observations. Passed through to :meth:`aggregate`. :type on_insufficient: {"raise", "warn", "collect"}, default="raise" :param skip_empty: If True, silently skip group specs that match no features. If False, raise a ``ValueError`` when a group matches nothing. :type skip_empty: bool, default=True :returns: Aggregated container with dims ``("obs", "feature")`` and stat-prefixed feature names. :rtype: DataContainer :raises ValueError: If the container lacks a ``feature`` dimension or coord, no groups are provided, a group spec is invalid, multiple groups would emit the same output feature name, or no non-empty grouped outputs are produced.