coco_pipe.io.dataset
====================

.. py:module:: coco_pipe.io.dataset

.. autoapi-nested-parse::

   coco_pipe/io/dataset.py
   -----------------------
   Specialized Dataset classes that produce standardized DataContainer objects.


Attributes
----------

.. autoapisummary::

   coco_pipe.io.dataset.logger


Classes
-------

.. autoapisummary::

   coco_pipe.io.dataset.BaseDataset
   coco_pipe.io.dataset.TabularDataset
   coco_pipe.io.dataset.EmbeddingDataset
   coco_pipe.io.dataset.BIDSDataset


Module Contents
---------------

.. py:data:: logger

.. py:class:: BaseDataset

   Bases: :py:obj:`abc.ABC`


   Helper class that provides a standard way to create an ABC using
   inheritance.


   .. py:method:: load() -> coco_pipe.io.structures.DataContainer
      :abstractmethod:


.. py:class:: TabularDataset(path: Union[str, pathlib.Path], target_col: Optional[str] = None, index_col: Optional[Union[str, int]] = None, sep: str = '\t', header: Optional[Union[int, List[int]]] = 0, sheet_name: Optional[Union[str, int]] = 0, columns_to_dims: Optional[List[str]] = None, col_sep: str = '_', meta_columns: Optional[List[str]] = None, clean: bool = False, clean_kwargs: Optional[Dict[str, Any]] = None, select_kwargs: Optional[Dict[str, Any]] = None)

   Bases: :py:obj:`BaseDataset`


   Dataset for loading tabular feature data (CSV, TSV, Excel).

   This class handles loading, optional clearing, and reshaping of 2D tabular data
   into multi-dimensional DataContainers.

   :param path: Path to the tabular file (csv, tsv, txt, xls, xlsx).
   :type path: str or Path
   :param target_col: Name of the column to extract as target `y`. Removed from features `X`.
   :type target_col: str, optional
   :param index_col: Column to use as index (observation IDs).
   :type index_col: str or int, optional
   :param sep: Separator for text files.
   :type sep: str, default='\t'
   :param header: Row number(s) to use as column names.
   :type header: int or list of int, default=0
   :param sheet_name: Sheet name or index for Excel files.
   :type sheet_name: str or int, default=0
   :param columns_to_dims: If provided, attempts to reshape the 2D feature columns into N-D dimensions.
                           Columns must follow the naming convention: `dim1_dim2_..._feature`.
   :type columns_to_dims: list of str, optional
   :param col_sep: Separator used in column names for reshaping.
   :type col_sep: str, default='_'
   :param meta_columns: List of columns to extract as metadata coordinates instead of features.
   :type meta_columns: list of str, optional
   :param clean: Whether to perform automated cleaning (drop NaNs/Infs).
   :type clean: bool, default=False
   :param clean_kwargs: Arguments passed to `TabularDataset.clean`.
   :type clean_kwargs: dict, optional
   :param select_kwargs: Arguments for feature selection (not yet implemented in load directly).
   :type select_kwargs: dict, optional

   .. rubric:: Examples

   >>> # Load a simple CSV
   >>> ds = TabularDataset("data.csv", target_col="label")
   >>> container = ds.load()

   >>> # Load and reshape wide data (e.g. time series in columns)
   >>> # Columns: T0_F1, T0_F2, T1_F1... -> dims=('time', 'freq')
   >>> ds = TabularDataset("wide.csv", columns_to_dims=['time', 'freq'], col_sep='_')


   .. py:attribute:: path


   .. py:attribute:: target_col
      :value: None


   .. py:attribute:: index_col
      :value: None


   .. py:attribute:: sep
      :value: '\t'


   .. py:attribute:: header
      :value: 0


   .. py:attribute:: sheet_name
      :value: 0


   .. py:attribute:: columns_to_dims
      :value: None


   .. py:attribute:: col_sep
      :value: '_'


   .. py:attribute:: meta_columns
      :value: []


   .. py:attribute:: do_clean
      :value: False


   .. py:attribute:: clean_kwargs


   .. py:attribute:: select_kwargs


   .. py:attribute:: strict_reshaping
      :value: True


   .. py:method:: load() -> coco_pipe.io.structures.DataContainer


   .. py:method:: clean(X: pandas.DataFrame, mode: str = 'any', sep: str = '_', reverse: bool = False, verbose: bool = False, min_abs_value: Optional[float] = None, min_abs_fraction: float = 0.0) -> Tuple[pandas.DataFrame, Dict[str, List[str]]]
      :staticmethod:


      Remove invalid feature columns containing NaN, ±Inf, and optionally very
      small values.


.. py:class:: EmbeddingDataset(path: Union[str, pathlib.Path], pattern: str = '*.pkl', dims: Tuple[str, Ellipsis] = ('obs', 'feature'), coords: Optional[Dict[str, Union[List, numpy.ndarray]]] = None, reader: Optional[Any] = None, id_fn: Optional[Any] = None, task: Optional[str] = None, run: Optional[str] = None, processing: Optional[str] = None, subjects: Optional[Union[int, List[int]]] = None)

   Bases: :py:obj:`BaseDataset`


   Generic Dataset for loading embedding files (Pickle, NPY, JSON, H5).

   This class decouples file discovery (via patterns and IDs) from content reading.
   It supports structured formats (e.g., Layers x Features) and user-supplied
   metadata coordinates.

   :param path: Root directory containing the embedding files.
   :type path: str or Path
   :param pattern: Glob pattern to match files (e.g., "*.npy", "sub-*_emb.pkl").
   :type pattern: str, default='*.pkl'
   :param dims: Dimension labels for the data arrays (excluding the observation dimension if
                implicit). Typically ('feature',) or ('layer', 'feature').
   :type dims: tuple of str, default=('obs', 'feature')
   :param coords: Dictionary of coordinates for dimensions. E.g., {'layer': ['L1', 'L2']}.
   :type coords: dict, optional
   :param reader: Custom function to read a Path and return a numpy array or dict.
                  If None, uses `smart_reader` based on file extension.
   :type reader: callable, optional
   :param id_fn: Custom function to extract subject ID from a Path.
                 If None, uses `default_id_extractor`.
   :type id_fn: callable, optional
   :param task: (Legacy BIDS) Task name to construct search pattern.
   :type task: str, optional
   :param run: (Legacy BIDS) Run name to construct search pattern.
   :type run: str, optional
   :param processing: (Legacy BIDS) Processing label.
   :type processing: str, optional
   :param subjects: If int, loads first N subjects. If list, loads specific subjects
                    (matched by `id_fn`).
   :type subjects: int or list, optional

   .. rubric:: Examples

   >>> # Load loose numpy files
   >>> ds = EmbeddingDataset("./embeddings", pattern="*.npy", dims=('feature',))
   >>> container = ds.load()


   .. py:attribute:: path


   .. py:attribute:: subjects
      :value: None


   .. py:attribute:: dims
      :value: ('obs', 'feature')


   .. py:attribute:: coords_in


   .. py:attribute:: reader


   .. py:attribute:: id_fn


   .. py:method:: load() -> coco_pipe.io.structures.DataContainer


.. py:class:: BIDSDataset(root: Union[str, pathlib.Path], task: Optional[str] = None, session: Optional[Union[str, List[str]]] = None, datatype: str = 'eeg', suffix: Optional[str] = None, mode: str = 'epochs', target_col: Optional[str] = None, window_length: Optional[float] = None, stride: Optional[float] = None, subjects: Optional[Union[str, List[str]]] = None, runs: Optional[Union[str, List[str]]] = None, event_id: Optional[Union[Dict[str, int], str, List[str]]] = None, subject_metadata_df: Optional[pandas.DataFrame] = None, subject_key: Optional[str] = None, tmin: float = -0.2, tmax: float = 0.5, baseline: Optional[Tuple[Optional[float], Optional[float]]] = None)

   Bases: :py:obj:`BaseDataset`


   Dataset for loading M/EEG data formatted according to the BIDS standard.

   This class supports loading valid BIDS structures, handling multiple subjects,
   sessions, and data types (Raw, Epoched, Evoked). It automatically extracts
   metadata from `participants.tsv` and aligns it with the loaded data.

   :param root: The root directory of the BIDS dataset.
   :type root: str or Path
   :param task: The task name (e.g., 'rest', 'audiovisual').
   :type task: str, optional
   :param session: The session ID(s) to load. If None, detects all available sessions.
   :type session: str or List[str], optional
   :param datatype: The data type to load (e.g., 'eeg', 'meg').
   :type datatype: str, default='eeg'
   :param suffix: The suffix of the files to load.
                  - If None, defaults to `datatype`.
                  - Use 'epo' to load pre-computed epochs.
                  - Use 'ave' to load evoked data.
   :type suffix: str, optional
   :param mode: The loading mode:
                - 'epochs': Splices raw continuous data into fixed-length windows.
                - 'continuous': Loads raw data as single continuous segments (1 epoch per run).
                - 'load_existing': treated as pre-computed epochs (requires `suffix='epo'`).
   :type mode: str, default='epochs'
   :param window_length: Length of window in seconds for 'epochs' mode.
   :type window_length: float, optional
   :param stride: Stride between windows in seconds. If None, defaults to `window_length`
                  (no overlap).
   :type stride: float, optional
   :param subjects: Specific subject IDs to load (without 'sub-' prefix). If None, detects all
                    subjects.
   :type subjects: str or List[str], optional

   .. rubric:: Examples

   >>> # Load resting state EEG for all subjects, sliced into 1s windows
   >>> ds = BIDSDataset(root="/data/bids", task="rest", window_length=1.0)
   >>> container = ds.load()


   .. py:attribute:: root


   .. py:attribute:: task
      :value: None


   .. py:attribute:: session
      :value: None


   .. py:attribute:: datatype
      :value: 'eeg'


   .. py:attribute:: suffix
      :value: None


   .. py:attribute:: mode
      :value: 'epochs'


   .. py:attribute:: target_col
      :value: None


   .. py:attribute:: window_length
      :value: None


   .. py:attribute:: stride
      :value: None


   .. py:attribute:: subjects
      :value: None


   .. py:attribute:: runs
      :value: None


   .. py:attribute:: event_id
      :value: None


   .. py:attribute:: subject_metadata_df
      :value: None


   .. py:attribute:: subject_key
      :value: None


   .. py:attribute:: tmin
      :value: -0.2


   .. py:attribute:: tmax
      :value: 0.5


   .. py:attribute:: baseline
      :value: None


   .. py:method:: load() -> coco_pipe.io.structures.DataContainer

      Load the BIDS dataset into a DataContainer.

      :returns: A container with:
                - X: Data array of shape (N_obs, N_channels, N_time).
                - ids: Unique identifiers for each observation.
                - coords: Dictionary containing 'channel', 'time', 'obs', and metadata.
                - dims: ('obs', 'channel', 'time').
      :rtype: DataContainer