coco_pipe.io.dataset ==================== .. py:module:: coco_pipe.io.dataset .. autoapi-nested-parse:: coco_pipe/io/dataset.py ----------------------- Specialized Dataset classes that produce standardized DataContainer objects. Attributes ---------- .. autoapisummary:: coco_pipe.io.dataset.logger Classes ------- .. autoapisummary:: coco_pipe.io.dataset.BaseDataset coco_pipe.io.dataset.TabularDataset coco_pipe.io.dataset.EmbeddingDataset coco_pipe.io.dataset.BIDSDataset Module Contents --------------- .. py:data:: logger .. py:class:: BaseDataset Bases: :py:obj:`abc.ABC` Helper class that provides a standard way to create an ABC using inheritance. .. py:method:: load() -> coco_pipe.io.structures.DataContainer :abstractmethod: .. py:class:: TabularDataset(path: Union[str, pathlib.Path], target_col: Optional[str] = None, index_col: Optional[Union[str, int]] = None, sep: str = '\t', header: Optional[Union[int, List[int]]] = 0, sheet_name: Optional[Union[str, int]] = 0, columns_to_dims: Optional[List[str]] = None, col_sep: str = '_', meta_columns: Optional[List[str]] = None, clean: bool = False, clean_kwargs: Optional[Dict[str, Any]] = None, select_kwargs: Optional[Dict[str, Any]] = None) Bases: :py:obj:`BaseDataset` Dataset for loading tabular feature data (CSV, TSV, Excel). This class handles loading, optional clearing, and reshaping of 2D tabular data into multi-dimensional DataContainers. :param path: Path to the tabular file (csv, tsv, txt, xls, xlsx). :type path: str or Path :param target_col: Name of the column to extract as target `y`. Removed from features `X`. :type target_col: str, optional :param index_col: Column to use as index (observation IDs). :type index_col: str or int, optional :param sep: Separator for text files. :type sep: str, default='\t' :param header: Row number(s) to use as column names. :type header: int or list of int, default=0 :param sheet_name: Sheet name or index for Excel files. :type sheet_name: str or int, default=0 :param columns_to_dims: If provided, attempts to reshape the 2D feature columns into N-D dimensions. Columns must follow the naming convention: `dim1_dim2_..._feature`. :type columns_to_dims: list of str, optional :param col_sep: Separator used in column names for reshaping. :type col_sep: str, default='_' :param meta_columns: List of columns to extract as metadata coordinates instead of features. :type meta_columns: list of str, optional :param clean: Whether to perform automated cleaning (drop NaNs/Infs). :type clean: bool, default=False :param clean_kwargs: Arguments passed to `TabularDataset.clean`. :type clean_kwargs: dict, optional :param select_kwargs: Arguments for feature selection (not yet implemented in load directly). :type select_kwargs: dict, optional .. rubric:: Examples >>> # Load a simple CSV >>> ds = TabularDataset("data.csv", target_col="label") >>> container = ds.load() >>> # Load and reshape wide data (e.g. time series in columns) >>> # Columns: T0_F1, T0_F2, T1_F1... -> dims=('time', 'freq') >>> ds = TabularDataset("wide.csv", columns_to_dims=['time', 'freq'], col_sep='_') .. py:attribute:: path .. py:attribute:: target_col :value: None .. py:attribute:: index_col :value: None .. py:attribute:: sep :value: '\t' .. py:attribute:: header :value: 0 .. py:attribute:: sheet_name :value: 0 .. py:attribute:: columns_to_dims :value: None .. py:attribute:: col_sep :value: '_' .. py:attribute:: meta_columns :value: [] .. py:attribute:: do_clean :value: False .. py:attribute:: clean_kwargs .. py:attribute:: select_kwargs .. py:attribute:: strict_reshaping :value: True .. py:method:: load() -> coco_pipe.io.structures.DataContainer .. py:method:: clean(X: pandas.DataFrame, mode: str = 'any', sep: str = '_', reverse: bool = False, verbose: bool = False, min_abs_value: Optional[float] = None, min_abs_fraction: float = 0.0) -> Tuple[pandas.DataFrame, Dict[str, List[str]]] :staticmethod: Remove invalid feature columns containing NaN, ±Inf, and optionally very small values. .. py:class:: EmbeddingDataset(path: Union[str, pathlib.Path], pattern: str = '*.pkl', dims: Tuple[str, Ellipsis] = ('obs', 'feature'), coords: Optional[Dict[str, Union[List, numpy.ndarray]]] = None, reader: Optional[Any] = None, id_fn: Optional[Any] = None, task: Optional[str] = None, run: Optional[str] = None, processing: Optional[str] = None, subjects: Optional[Union[int, List[int]]] = None) Bases: :py:obj:`BaseDataset` Generic Dataset for loading embedding files (Pickle, NPY, JSON, H5). This class decouples file discovery (via patterns and IDs) from content reading. It supports structured formats (e.g., Layers x Features) and user-supplied metadata coordinates. :param path: Root directory containing the embedding files. :type path: str or Path :param pattern: Glob pattern to match files (e.g., "*.npy", "sub-*_emb.pkl"). :type pattern: str, default='*.pkl' :param dims: Dimension labels for the data arrays (excluding the observation dimension if implicit). Typically ('feature',) or ('layer', 'feature'). :type dims: tuple of str, default=('obs', 'feature') :param coords: Dictionary of coordinates for dimensions. E.g., {'layer': ['L1', 'L2']}. :type coords: dict, optional :param reader: Custom function to read a Path and return a numpy array or dict. If None, uses `smart_reader` based on file extension. :type reader: callable, optional :param id_fn: Custom function to extract subject ID from a Path. If None, uses `default_id_extractor`. :type id_fn: callable, optional :param task: (Legacy BIDS) Task name to construct search pattern. :type task: str, optional :param run: (Legacy BIDS) Run name to construct search pattern. :type run: str, optional :param processing: (Legacy BIDS) Processing label. :type processing: str, optional :param subjects: If int, loads first N subjects. If list, loads specific subjects (matched by `id_fn`). :type subjects: int or list, optional .. rubric:: Examples >>> # Load loose numpy files >>> ds = EmbeddingDataset("./embeddings", pattern="*.npy", dims=('feature',)) >>> container = ds.load() .. py:attribute:: path .. py:attribute:: subjects :value: None .. py:attribute:: dims :value: ('obs', 'feature') .. py:attribute:: coords_in .. py:attribute:: reader .. py:attribute:: id_fn .. py:method:: load() -> coco_pipe.io.structures.DataContainer .. py:class:: BIDSDataset(root: Union[str, pathlib.Path], task: Optional[str] = None, session: Optional[Union[str, List[str]]] = None, datatype: str = 'eeg', suffix: Optional[str] = None, mode: str = 'epochs', target_col: Optional[str] = None, window_length: Optional[float] = None, stride: Optional[float] = None, subjects: Optional[Union[str, List[str]]] = None, runs: Optional[Union[str, List[str]]] = None, event_id: Optional[Union[Dict[str, int], str, List[str]]] = None, subject_metadata_df: Optional[pandas.DataFrame] = None, subject_key: Optional[str] = None, tmin: float = -0.2, tmax: float = 0.5, baseline: Optional[Tuple[Optional[float], Optional[float]]] = None) Bases: :py:obj:`BaseDataset` Dataset for loading M/EEG data formatted according to the BIDS standard. This class supports loading valid BIDS structures, handling multiple subjects, sessions, and data types (Raw, Epoched, Evoked). It automatically extracts metadata from `participants.tsv` and aligns it with the loaded data. :param root: The root directory of the BIDS dataset. :type root: str or Path :param task: The task name (e.g., 'rest', 'audiovisual'). :type task: str, optional :param session: The session ID(s) to load. If None, detects all available sessions. :type session: str or List[str], optional :param datatype: The data type to load (e.g., 'eeg', 'meg'). :type datatype: str, default='eeg' :param suffix: The suffix of the files to load. - If None, defaults to `datatype`. - Use 'epo' to load pre-computed epochs. - Use 'ave' to load evoked data. :type suffix: str, optional :param mode: The loading mode: - 'epochs': Splices raw continuous data into fixed-length windows. - 'continuous': Loads raw data as single continuous segments (1 epoch per run). - 'load_existing': treated as pre-computed epochs (requires `suffix='epo'`). :type mode: str, default='epochs' :param window_length: Length of window in seconds for 'epochs' mode. :type window_length: float, optional :param stride: Stride between windows in seconds. If None, defaults to `window_length` (no overlap). :type stride: float, optional :param subjects: Specific subject IDs to load (without 'sub-' prefix). If None, detects all subjects. :type subjects: str or List[str], optional .. rubric:: Examples >>> # Load resting state EEG for all subjects, sliced into 1s windows >>> ds = BIDSDataset(root="/data/bids", task="rest", window_length=1.0) >>> container = ds.load() .. py:attribute:: root .. py:attribute:: task :value: None .. py:attribute:: session :value: None .. py:attribute:: datatype :value: 'eeg' .. py:attribute:: suffix :value: None .. py:attribute:: mode :value: 'epochs' .. py:attribute:: target_col :value: None .. py:attribute:: window_length :value: None .. py:attribute:: stride :value: None .. py:attribute:: subjects :value: None .. py:attribute:: runs :value: None .. py:attribute:: event_id :value: None .. py:attribute:: subject_metadata_df :value: None .. py:attribute:: subject_key :value: None .. py:attribute:: tmin :value: -0.2 .. py:attribute:: tmax :value: 0.5 .. py:attribute:: baseline :value: None .. py:method:: load() -> coco_pipe.io.structures.DataContainer Load the BIDS dataset into a DataContainer. :returns: A container with: - X: Data array of shape (N_obs, N_channels, N_time). - ids: Unique identifiers for each observation. - coords: Dictionary containing 'channel', 'time', 'obs', and metadata. - dims: ('obs', 'channel', 'time'). :rtype: DataContainer