coco_pipe.io.dataset¶
Specialized Dataset classes that produce standardized DataContainer objects.
Attributes¶
Classes¶
Helper class that provides a standard way to create an ABC using |
|
Dataset for loading tabular feature data (CSV, TSV, Excel). |
|
Generic Dataset for loading embedding files (Pickle, NPY, JSON, H5). |
|
Dataset for loading M/EEG data formatted according to the BIDS standard. |
Module Contents¶
- coco_pipe.io.dataset.logger¶
- class coco_pipe.io.dataset.BaseDataset[source]¶
Bases:
abc.ABCHelper class that provides a standard way to create an ABC using inheritance.
- abstract load() coco_pipe.io.structures.DataContainer[source]¶
- class coco_pipe.io.dataset.TabularDataset(path: str | pathlib.Path, target_col: str | None = None, index_col: str | int | None = None, sep: str = '\t', header: int | List[int] | None = 0, sheet_name: str | int | None = 0, columns_to_dims: List[str] | None = None, col_sep: str = '_', meta_columns: List[str] | None = None, clean: bool = False, clean_kwargs: Dict[str, Any] | None = None, select_kwargs: Dict[str, Any] | None = None)[source]¶
Bases:
BaseDatasetDataset for loading tabular feature data (CSV, TSV, Excel).
This class handles loading, optional clearing, and reshaping of 2D tabular data into multi-dimensional DataContainers.
- Parameters:
path (str or Path) – Path to the tabular file (csv, tsv, txt, xls, xlsx).
target_col (str, optional) – Name of the column to extract as target y. Removed from features X.
index_col (str or int, optional) – Column to use as index (observation IDs).
sep (str, default='t') – Separator for text files.
header (int or list of int, default=0) – Row number(s) to use as column names.
sheet_name (str or int, default=0) – Sheet name or index for Excel files.
columns_to_dims (list of str, optional) – If provided, attempts to reshape the 2D feature columns into N-D dimensions. Columns must follow the naming convention: dim1_dim2_…_feature.
col_sep (str, default='_') – Separator used in column names for reshaping.
meta_columns (list of str, optional) – List of columns to extract as metadata coordinates instead of features.
clean (bool, default=False) – Whether to perform automated cleaning (drop NaNs/Infs).
clean_kwargs (dict, optional) – Arguments passed to TabularDataset.clean.
select_kwargs (dict, optional) – Arguments for feature selection (not yet implemented in load directly).
Examples
>>> # Load a simple CSV >>> ds = TabularDataset("data.csv", target_col="label") >>> container = ds.load()
>>> # Load and reshape wide data (e.g. time series in columns) >>> # Columns: T0_F1, T0_F2, T1_F1... -> dims=('time', 'freq') >>> ds = TabularDataset("wide.csv", columns_to_dims=['time', 'freq'], col_sep='_')
- path¶
- target_col = None¶
- index_col = None¶
- sep = '\t'¶
- header = 0¶
- sheet_name = 0¶
- columns_to_dims = None¶
- col_sep = '_'¶
- meta_columns = []¶
- do_clean = False¶
- clean_kwargs¶
- select_kwargs¶
- strict_reshaping = True¶
- static clean(X: pandas.DataFrame, mode: str = 'any', sep: str = '_', reverse: bool = False, verbose: bool = False, min_abs_value: float | None = None, min_abs_fraction: float = 0.0) Tuple[pandas.DataFrame, Dict[str, List[str]]][source]¶
Remove invalid feature columns containing NaN, ±Inf, and optionally very small values.
- class coco_pipe.io.dataset.EmbeddingDataset(path: str | pathlib.Path, pattern: str = '*.pkl', dims: Tuple[str, Ellipsis] = ('obs', 'feature'), coords: Dict[str, List | numpy.ndarray] | None = None, reader: Any | None = None, id_fn: Any | None = None, task: str | None = None, run: str | None = None, processing: str | None = None, subjects: int | List[int] | None = None)[source]¶
Bases:
BaseDatasetGeneric Dataset for loading embedding files (Pickle, NPY, JSON, H5).
This class decouples file discovery (via patterns and IDs) from content reading. It supports structured formats (e.g., Layers x Features) and user-supplied metadata coordinates.
- Parameters:
path (str or Path) – Root directory containing the embedding files.
pattern (str, default=’*.pkl’) – Glob pattern to match files (e.g., “*.npy”, “sub-*_emb.pkl”).
dims (tuple of str, default=('obs', 'feature')) – Dimension labels for the data arrays (excluding the observation dimension if implicit). Typically (‘feature’,) or (‘layer’, ‘feature’).
coords (dict, optional) – Dictionary of coordinates for dimensions. E.g., {‘layer’: [‘L1’, ‘L2’]}.
reader (callable, optional) – Custom function to read a Path and return a numpy array or dict. If None, uses smart_reader based on file extension.
id_fn (callable, optional) – Custom function to extract subject ID from a Path. If None, uses default_id_extractor.
task (str, optional) – (Legacy BIDS) Task name to construct search pattern.
run (str, optional) – (Legacy BIDS) Run name to construct search pattern.
processing (str, optional) – (Legacy BIDS) Processing label.
subjects (int or list, optional) – If int, loads first N subjects. If list, loads specific subjects (matched by id_fn).
Examples
>>> # Load loose numpy files >>> ds = EmbeddingDataset("./embeddings", pattern="*.npy", dims=('feature',)) >>> container = ds.load()
- path¶
- subjects = None¶
- dims = ('obs', 'feature')¶
- coords_in¶
- reader¶
- id_fn¶
- class coco_pipe.io.dataset.BIDSDataset(root: str | pathlib.Path, task: str | None = None, session: str | List[str] | None = None, datatype: str = 'eeg', suffix: str | None = None, mode: str = 'epochs', target_col: str | None = None, window_length: float | None = None, stride: float | None = None, subjects: str | List[str] | None = None, runs: str | List[str] | None = None, event_id: Dict[str, int] | str | List[str] | None = None, subject_metadata_df: pandas.DataFrame | None = None, subject_key: str | None = None, tmin: float = -0.2, tmax: float = 0.5, baseline: Tuple[float | None, float | None] | None = None)[source]¶
Bases:
BaseDatasetDataset for loading M/EEG data formatted according to the BIDS standard.
This class supports loading valid BIDS structures, handling multiple subjects, sessions, and data types (Raw, Epoched, Evoked). It automatically extracts metadata from participants.tsv and aligns it with the loaded data.
- Parameters:
root (str or Path) – The root directory of the BIDS dataset.
task (str, optional) – The task name (e.g., ‘rest’, ‘audiovisual’).
session (str or List[str], optional) – The session ID(s) to load. If None, detects all available sessions.
datatype (str, default='eeg') – The data type to load (e.g., ‘eeg’, ‘meg’).
suffix (str, optional) – The suffix of the files to load. - If None, defaults to datatype. - Use ‘epo’ to load pre-computed epochs. - Use ‘ave’ to load evoked data.
mode (str, default='epochs') – The loading mode: - ‘epochs’: Splices raw continuous data into fixed-length windows. - ‘continuous’: Loads raw data as single continuous segments (1 epoch per run). - ‘load_existing’: treated as pre-computed epochs (requires suffix=’epo’).
window_length (float, optional) – Length of window in seconds for ‘epochs’ mode.
stride (float, optional) – Stride between windows in seconds. If None, defaults to window_length (no overlap).
subjects (str or List[str], optional) – Specific subject IDs to load (without ‘sub-’ prefix). If None, detects all subjects.
Examples
>>> # Load resting state EEG for all subjects, sliced into 1s windows >>> ds = BIDSDataset(root="/data/bids", task="rest", window_length=1.0) >>> container = ds.load()
- root¶
- task = None¶
- session = None¶
- datatype = 'eeg'¶
- suffix = None¶
- mode = 'epochs'¶
- target_col = None¶
- window_length = None¶
- stride = None¶
- subjects = None¶
- runs = None¶
- event_id = None¶
- subject_metadata_df = None¶
- subject_key = None¶
- tmin = -0.2¶
- tmax = 0.5¶
- baseline = None¶
- load() coco_pipe.io.structures.DataContainer[source]¶
Load the BIDS dataset into a DataContainer.
- Returns:
A container with: - X: Data array of shape (N_obs, N_channels, N_time). - ids: Unique identifiers for each observation. - coords: Dictionary containing ‘channel’, ‘time’, ‘obs’, and metadata. - dims: (‘obs’, ‘channel’, ‘time’).
- Return type: