coco_pipe.io.structures
=======================

.. py:module:: coco_pipe.io.structures

.. autoapi-nested-parse::

   Data Structures
   ===============

   Standardized containers for passing data between Datasets, Preprocessing, and main
   modules.

   This module provides the `DataContainer`, an N-dimensional tensor wrapper that manages
   metadata, coordinates, and labels alongside the raw data matrix. It serves as the
   common currency for the entire pipeline.

   .. rubric:: Examples

   >>> import numpy as np
   >>> from coco_pipe.io import DataContainer

   # 1. Creating a container for EEG Epochs (N_epochs, N_channels, N_time)
   >>> X = np.random.randn(10, 64, 500)
   >>> container = DataContainer(
   ...     X=X,
   ...     dims=('obs', 'channel', 'time'),
   ...     coords={
   ...         'channel': ['Fz', 'Cz', 'Pz'], # ... etc
   ...         'time': np.linspace(0, 1.0, 500)
   ...     },
   ...     y=np.random.randint(0, 2, 10),
   ...     ids=[f'sub-01_trial-{i}' for i in range(10)]
   ... )

   # 2. Creating a container for simple Tabular Features (N_subjects, N_features)
   >>> X_tab = np.random.randn(20, 5)
   >>> container_tab = DataContainer(
   ...     X=X_tab,
   ...     dims=('obs', 'feature'),
   ...     coords={'feature': ['age', 'IQ', 'response_time', 'power_alpha', 'power_beta']}
   ... )


Attributes
----------

.. autoapisummary::

   coco_pipe.io.structures.logger


Classes
-------

.. autoapisummary::

   coco_pipe.io.structures.DataContainer


Module Contents
---------------

.. py:data:: logger

.. py:class:: DataContainer

   Generic container for N-dimensional neurophysiological data.

   Acts as a lightweight labelled array (like xarray but simpler), managing
   dimensions, coordinates, and associated target labels (y) and IDs.

   .. attribute:: X

      The primary data tensor. Shape must match `dims`.

      :type: np.ndarray

   .. attribute:: dims

      Labels for each dimension of X.
      Examples: ('obs', 'feature'), ('obs', 'channel', 'time').
      Note: The 'obs' dimension is special and typically represents independent
      samples.

      :type: Tuple[str, ...]

   .. attribute:: coords

      Coordinates/Labels for dimensions. Keys must be in `dims`.
      Values must match the length of the corresponding dimension in X.

      :type: Dict[str, Union[List, np.ndarray]]

   .. attribute:: y

      Target labels corresponding to the 'obs' dimension.
      Used for supervised learning or coloring plots.

      :type: Optional[np.ndarray], optional

   .. attribute:: ids

      Identifiers for observations (e.g., subject IDs, trial names).
      Should correspond to 'obs' dim in coords if provided.
      Kept separate from coords for convenient tracking.

      :type: Optional[np.ndarray], optional

   .. attribute:: meta

      Arbitrary metadata (sfreq, units, source path, etc).

      :type: Dict[str, Any]

   .. rubric:: Examples

   Accessing data:
   >>> container.X.shape
   (10, 64, 500)

   Accessing coordinates:
   >>> container.coords['channel'][:3]
   ['Fz', 'Cz', 'Pz']


   .. py:attribute:: X
      :type:  numpy.ndarray


   .. py:attribute:: dims
      :type:  Tuple[str, Ellipsis]


   .. py:attribute:: coords
      :type:  Dict[str, Union[List, numpy.ndarray, Sequence]]


   .. py:attribute:: y
      :type:  Optional[numpy.ndarray]
      :value: None


   .. py:attribute:: ids
      :type:  Optional[numpy.ndarray]
      :value: None


   .. py:attribute:: meta
      :type:  Dict[str, Any]


   .. py:method:: __post_init__()


   .. py:property:: shape
      :type: Tuple[int, Ellipsis]


   .. py:method:: save(path: Union[str, Any]) -> None

      Save the DataContainer to disk using joblib.

      :param path: Destination file path.
      :type path: str or Path


   .. py:method:: load(path: Union[str, Any]) -> DataContainer
      :classmethod:


      Load a DataContainer from disk.

      :param path: Source file path.
      :type path: str or Path

      :rtype: DataContainer


   .. py:method:: __repr__() -> str


   .. py:method:: obs_table(include_ids: bool = False, id_col: str = 'obs_id', include_y: bool = False, y_col: str = 'y', include_obs_coord: bool = False) -> pandas.DataFrame

      Return one-dimensional coordinates aligned to the observation axis.

      This helper is useful when exporting a row-wise table from a container.
      It only materializes metadata that can map cleanly to one row per
      observation, skipping coordinates that belong to other axes such as
      ``channel``, ``time``, ``feature``, or ``stat``.

      :param include_ids: If True, include ``self.ids`` as the first column.
      :type include_ids: bool, default=False
      :param id_col: Column name used when exporting ``self.ids``.
      :type id_col: str, default="obs_id"
      :param include_y: If True, include ``self.y`` as a column when present.
      :type include_y: bool, default=False
      :param y_col: Column name used when exporting ``self.y``.
      :type y_col: str, default="y"
      :param include_obs_coord: If True, include ``coords["obs"]`` when present.
      :type include_obs_coord: bool, default=False

      :returns: DataFrame containing only one-dimensional observation-aligned
                metadata columns.
      :rtype: pandas.DataFrame

      :raises ValueError: If the container has no ``obs`` dimension, or if ``include_ids`` is
          requested when ``self.ids`` is missing.


   .. py:method:: isel(**indexers) -> DataContainer

      Select data by integer indices on specified dimensions.

      This method is the integer-index equivalent of `select`. It operates
      directly on the dimensions of the data tensor `X`. It is robust and
      handles metadata splitting/alignment automatically.

      :param \*\*indexers: Key: Dimension name (e.g., 'obs', 'channel', 'time').
                           Value: Integer indices to select. Can be:
                               - List or numpy array of integers: [0, 1, 5]
                               - Slice object: slice(0, 10)
                               - Single integer: 0

                           Note: If you provide a list of indices with repeats (e.g., [0, 0, 1]),
                           the output will be oversampled accordingly.
      :type \*\*indexers: dict

      :returns: A new DataContainer instance with the sliced data and coordinates.
      :rtype: DataContainer

      .. rubric:: Examples

      >>> # Select first 10 observations
      >>> subset = container.isel(obs=slice(0, 10))

      >>> # Select specific channels by index
      >>> subset = container.isel(channel=[0, 5, 12])

      >>> # Select time range by index
      >>> subset = container.isel(time=slice(100, 200))

      >>> # Bootstrap/Resample (Select index 0 five times)
      >>> bootstrap = container.isel(obs=[0, 0, 0, 0, 0])


   .. py:method:: balance(target: str = 'y', strategy: str = 'undersample', covariates: Optional[List[str]] = None, random_state: int = 42, **kwargs) -> DataContainer

      Balance the dataset classes using undersampling or oversampling.

      This method adjusts the number of observations (rows) in the container
      so that class counts in `target` are equalized. It supports simple
      random sampling and stratified sampling based on covariates.

      :param target: Name of the target variable.
                     - 'y': Uses `self.y`.
                     - Any other string: Looks for the variable in `self.coords`.
      :type target: str, default='y'
      :param strategy:
                       - 'undersample': Downsample majority classes to match the minority
                         class count.
                       - 'oversample': Upsample minority classes (with replacement) to match
                         the majority class.
                       - 'auto': Heuristic choice. Uses undersampling if total size remains >
                         50% of original, else oversampling.
      :type strategy: {'undersample', 'oversample', 'auto'}, default='undersample'
      :param covariates: List of covariate names in `self.coords` to preserve distribution of.
                         If provided, the balancing is performed *within* strata defined by these
                         covariates.
      :type covariates: list of str, optional
      :param random_state: Seed for the random number generator.
                           Change this value to produce different random subsets (e.g., for bagging).
      :type random_state: int, default=42
      :param \*\*kwargs: Additional arguments passed to internal logic:
                         - n_bins (int): Number of bins for continuous covariates (default 5).
                         - binning (str): 'quantile' (default) or 'uniform' binning.
                         - prefer_clean_rows (bool): If True, weighs sampling to prefer rows
                           with fewer NaNs/artifacts.
      :type \*\*kwargs: dict

      :returns: A new DataContainer instance with balanced classes.
      :rtype: DataContainer

      .. rubric:: Examples

      >>> # 1. Simple Undersampling of 'y'
      >>> balanced = container.balance(strategy='undersample')

      >>> # 2. Balance based on a metadata column 'condition'
      >>> balanced = container.balance(target='condition')

      >>> # 3. Stratified Balancing (Balance 'y' while preserving 'sex' and 'age'
      >>> #    ratios)
      >>> balanced = container.balance(target='y', covariates=['sex', 'age'])

      >>> # 4. Iterative Bootstrapping (Different seeds)
      >>> for seed in [1, 2, 3]:
      ...     subset = container.balance(strategy='undersample', random_state=seed)
      ...     # process subset...


   .. py:method:: select(ignore_case: bool = False, fuzzy: bool = False, **selections) -> DataContainer

      Select data subsets based on coordinates, ids, or y.

      This method supports exact matching, wildcard matching, operator-based
      filtering, and custom callable filters.

      :param ignore_case: If True, string matching is case-insensitive (e.g., 'fz' matches 'Fz').
      :type ignore_case: bool, default=False
      :param fuzzy: If True, uses `difflib` to find closest matches for string queries
                    (e.g., 'Alpha' matches 'alpha'). Useful for handling typos.
      :type fuzzy: bool, default=False
      :param \*\*selections: Key is the dimension name (or special keys 'y', 'ids').
                             Value is the query. Supported query types:

                             1. **List/Array (Exact or Wildcard)**:
                                Matches values present in the list. Strings can use shell-style
                                wildcards ('*', '?').

                             2. **Dictionary (Operator Queries)**:
                                Filters numerical or string values using operators.
                                Keys: '>', '<', '>=', '<=', '==', '!=', 'in'.

                             3. **Callable**:
                                A function taking the coordinate array and returning a boolean mask.
      :type \*\*selections: dict

      :returns: A new DataContainer instance containing the selected subset.
      :rtype: DataContainer

      .. rubric:: Examples

      >>> # 1. Exact Selection (Sensors)
      >>> sub = container.select(channel=['Fz', 'Cz'])

      >>> # 2. Wildcard Selection (All Alpha features)
      >>> sub = container.select(feature='*alpha*')

      >>> # 3. Range Selection (Time)
      >>> sub = container.select(time={'>=': 0.1, '<': 0.5})

      >>> # 4. Case-Insensitive Fuzzy Matching
      >>> sub = container.select(channel=['fz'], ignore_case=True)

      >>> # 5. Filter by Target (y)
      >>> sub = container.select(y=['Patient'])

      >>> # 6. Complex Logic (Subjects 1-5 via Operator)
      >>> sub = container.select(subject_id={'>=': 1, '<=': 5})

      >>> # 7. Stratified Selection (First 2 epochs per subject via Callable)
      >>> def first_n(ids, n=2):
      ...     # ... logic ...
      ...     return mask
      >>> sub = container.select(ids=first_n)


   .. py:method:: flatten(preserve: Union[str, List[str]] = 'obs') -> DataContainer

      Flatten dimensions NOT in `preserve` into a single 'feature' dimension.

      This is useful for preparing N-dimensional data for standard 2D machine
      learning algorithms (scikit-learn). It automatically generates composite
      feature names (e.g., 'Fz_0.1s') for tracking.

      :param preserve: Dimensions to keep. All other dimensions will be collapsed into a
                       single 'feature' dimension.
                       - 'obs': Result shape (N_obs, N_features). Standard specifiction.
                       - ['obs', 'time']: Result shape (N_obs, N_time, N_features).
                         Useful for time-resolved decoding distributions.
      :type preserve: str or List[str], default='obs'

      :returns: A new DataContainer with reshaped X and generated 'feature' coordinates.
      :rtype: DataContainer

      .. rubric:: Examples

      >>> # Flatten (10, 64, 500) -> (10, 32000)
      >>> flat = container.flatten(preserve='obs')
      >>> flat.shape
      (10, 32000)
      >>> flat.coords['feature'][0]
      'Fz_0.0'

      >>> # Flatten spatial only, keep time (10, 64, 500) -> (10, 500, 64)
      >>> time_resolved = container.flatten(preserve=['obs', 'time'])


   .. py:method:: stack(dims: Sequence[str], new_dim: str = 'obs') -> DataContainer

      Stack multiple dimensions into a single new dimension.

      This reshapes N-dimensional data into (N-K) dimensions by combining
      specified dimensions. It is useful for transforming spatiotemporal data
      (Trials, Channels, Time) -> (Trials*Time, Channels) for trajectory analysis.

      :param dims: Dimensions to stack. The order determines the nesting (slowest to fastest).
                   e.g., ('obs', 'time') means 'obs' changes slowly, 'time' cycles fast.
      :type dims: sequence of str
      :param new_dim: Name of the resulting stacked dimension.
      :type new_dim: str, default='obs'

      :returns: New container with stacked dimension. Metadata (coords/ids) are
                expanded/tiled to match the new shape.
      :rtype: DataContainer

      .. rubric:: Examples

      >>> # Stack time into observations:
      >>> # (10 obs, 64 ch, 500 time) -> (5000 obs, 64 ch)
      >>> stacked = container.stack(dims=('obs', 'time'), new_dim='obs')
      >>> stacked.shape
      (5000, 64)


   .. py:method:: unstack(dim: str) -> DataContainer

      Unstack a dimension into multiple dimensions.

      Inverse operation of `stack`. Reshapes the data tensor by splitting one
      dimension into multiple using metadata stored during the `stack` operation.

      :param dim: Dimension to unstack (e.g. 'obs').
      :type dim: str

      :returns: New container with unstacked dimensions.
      :rtype: DataContainer

      :raises ValueError: If the container was not previously stacked (missing metadata).

      .. rubric:: Examples

      >>> # Stack 'trials' and 'time' -> 'obs'
      >>> stacked = container.stack(('trials', 'time'), new_dim='obs')
      >>> # Unstack 'obs' -> ('trials', 'time') (automatically inferred)
      >>> unstacked = stacked.unstack('obs')


   .. py:method:: center(dim: str = 'time', inplace: bool = False) -> DataContainer

      Remove mean along a specified dimension (Centering/Baseline Correction).

      This operation computes the mean along `dim` (ignoring NaNs) and subtracts it.
      Commonly used in EEG for baseline correction (subtracting mean of
      pre-stimulus interval) or centering features before covariance calculation.

      :param dim: Dimension name to center over (e.g., 'time', 'channel', 'obs').
      :type dim: str, default='time'
      :param inplace: If True, modifies X in-place to save memory.
                      Returns self.
      :type inplace: bool, default=False

      :returns: Container with centered data.
      :rtype: DataContainer

      .. rubric:: Examples

      >>> # Baseline correction over time
      >>> container.center(dim='time')


   .. py:method:: zscore(dim: str = 'time', eps: float = 1e-08, inplace: bool = False) -> DataContainer

      Standardize (Z-score) along a specified dimension.

      Computes `(X - mean) / std` along the given dimension. Robust to NaNs.
      Useful for normalizing features or standardizing temporal dynamics.

      :param dim: Dimension to standardize.
      :type dim: str
      :param eps: Stability epsilon to avoid division by zero.
      :type eps: float
      :param inplace:
      :type inplace: bool

      :rtype: DataContainer

      .. rubric:: Examples

      >>> # Standardize each channel's timecourse
      >>> container.zscore(dim='time')


   .. py:method:: rms_scale(dim: str = 'time', eps: float = 1e-08, inplace: bool = False) -> DataContainer

      Scale by Root Mean Square (RMS) amplitude along a dimension.

      Divides data by `sqrt(mean(X**2))` along the dimension.
      Preserves relative shape but normalizes energy.

      :param dim: Dimension to scale.
      :type dim: str
      :param eps: Stability epsilon.
      :type eps: float
      :param inplace:
      :type inplace: bool

      :rtype: DataContainer


   .. py:method:: baseline_correction(dim: str = 'time', inplace: bool = False) -> DataContainer

      Alias for center(). Common in EEG.


   .. py:method:: aggregate(by: Union[str, numpy.ndarray, List[Any]], stats: Union[str, Sequence[str]] = 'mean', min_count: int = 1, on_insufficient: str = 'raise') -> DataContainer

      Aggregate observations into grouped summaries along the ``obs`` axis.

      :param by: Group definition for the observation axis.
                 - If str: resolve the key from ``self.coords`` or from ``self.y``
                   when ``by == "y"``.
                 - If array-like: explicit group labels aligned with ``obs``.
      :type by: str or array-like
      :param stats: Aggregation statistic or ordered list of statistics. Supported
                    tokens are ``"mean"``, ``"median"``, ``"std"``, ``"var"``,
                    ``"sem"``, ``"mad"``, ``"iqr"``, ``"min"``, ``"max"``,
                    ``"count"``, and ``"first"``. Legacy ``"obs-*"`` aliases are
                    accepted and normalized.
      :type stats: str or sequence of str, default="mean"
      :param min_count: Minimum number of valid observations required per group. A valid
                        observation is one with at least one finite value across the
                        non-observation axes.
      :type min_count: int, default=1
      :param on_insufficient: Policy applied when a group has fewer than ``min_count`` valid
                              observations.
      :type on_insufficient: {"raise", "warn", "collect"}, default="raise"

      :returns: Aggregated container with grouped observations on the ``obs`` axis.
                When multiple stats are requested, a ``stat`` dimension is inserted
                immediately after ``obs``.
      :rtype: DataContainer

      :raises ValueError: If the container has no ``obs`` dimension, grouping is invalid,
          requested stats are unsupported, or ``min_count`` /
          ``on_insufficient`` are invalid.


   .. py:method:: aggregate_groups(by: Union[str, numpy.ndarray, List[Any]], groups: Sequence[Dict[str, Any]], min_count: int = 1, on_insufficient: str = 'raise', skip_empty: bool = True) -> DataContainer

      Aggregate selected feature groups with different statistics.

      This is a thin wrapper around :meth:`aggregate` for tabular feature
      containers. Each group spec selects a subset of feature columns and
      applies one or more stats to that subset. The outputs are concatenated
      along the ``feature`` dimension, and each resulting feature name is
      prefixed with its stat (for example ``"mean_band_log_abs_alpha"``).

      :param by: Group definition for the observation axis. Passed through to
                 :meth:`aggregate`.
      :type by: str or array-like
      :param groups: Ordered group specifications. Each group must provide ``"stats"``
                     and may optionally provide include/exclude selectors:

                     - ``names`` / ``exclude_names``
                     - ``prefixes`` / ``exclude_prefixes``
                     - ``suffixes`` / ``exclude_suffixes``
                     - ``contains`` / ``exclude_contains``
                     - ``regex`` / ``exclude_regex``

                     If a group provides no include selectors, it starts from all
                     features and then applies exclusions.
      :type groups: sequence of dict
      :param min_count: Minimum number of valid observations required per group. Passed
                        through to :meth:`aggregate`.
      :type min_count: int, default=1
      :param on_insufficient: Policy applied when a group has fewer than ``min_count`` valid
                              observations. Passed through to :meth:`aggregate`.
      :type on_insufficient: {"raise", "warn", "collect"}, default="raise"
      :param skip_empty: If True, silently skip group specs that match no features. If
                         False, raise a ``ValueError`` when a group matches nothing.
      :type skip_empty: bool, default=True

      :returns: Aggregated container with dims ``("obs", "feature")`` and
                stat-prefixed feature names.
      :rtype: DataContainer

      :raises ValueError: If the container lacks a ``feature`` dimension or coord, no groups
          are provided, a group spec is invalid, multiple groups would emit
          the same output feature name, or no non-empty grouped outputs are
          produced.