# Dim Reduction Workflows ## Current State The dim-reduction stack is centered on the existing `DimReduction` manager and reducer contracts. - Optional dependencies are now lazy at import time. - `DimReduction` caches normalized state on: - `metrics_` - `metric_records_` - `quality_metadata_` - `diagnostics_` - `interpretation_` - `interpretation_records_` - Plotting is done through `coco_pipe.viz.dim_reduction`, not through manager methods on `DimReduction`. - `DimReduction` does not cache embeddings. Embeddings are returned explicitly from `transform()` and `fit_transform()` and must be passed explicitly to `score()`, plotting, or report-building paths that need them. - `coco_pipe.dim_reduction.evaluation.core` is the evaluation authority used by `DimReduction.score()`. - `MethodSelector` is now a post-hoc comparison layer over already-scored `DimReduction` objects and exposes tidy metric observations via `to_frame()`. ## Core Interfaces Use `DimReduction` directly for most workflows: ```python from coco_pipe.dim_reduction import DimReduction reducer = DimReduction("PCA", n_components=2, random_state=42) embedding = reducer.fit_transform(X, y=labels) scores = reducer.score(embedding, X=X, labels=labels, times=timepoints) interpretation = reducer.interpret( X, X_emb=embedding, analyses=["correlation"], feature_names=feature_names, ) summary = reducer.get_summary() ``` `DimReduction.get_summary()` returns cached scalar metrics, reducer metadata, diagnostics, tidy metric records, cached interpretation payloads, cached interpretation records, and capability flags. It does not carry an embedding payload. Evaluation can be narrowed to specific metric families: ```python from coco_pipe.dim_reduction.config import EvaluationConfig config = EvaluationConfig( metrics=["trustworthiness", "continuity"], selection_metric="trustworthiness", selection_k=10, tie_breakers=["continuity"], separation_method="centroid", ) ``` Each reducer is scored directly: ```python for reducer in reducers: embedding = reducer.fit_transform(X, y=labels) reducer.score( embedding, X=X, metrics=config.metrics, k_values=config.k_range, separation_method=config.separation_method, ) reducer.interpret( X, X_emb=embedding, analyses=["correlation"], feature_names=feature_names, ) ``` Then compare the scored reducers: ```python from coco_pipe.dim_reduction.evaluation import MethodSelector selector = MethodSelector(reducers).collect() ranked = selector.rank_methods( selection_metric=config.selection_metric, selection_k=config.selection_k, tie_breakers=config.tie_breakers, ) best_name = ranked.iloc[0]["method"] best = selector.reducers[best_name] ``` When trajectory labels are available, `separation_method` is passed through during `score()` to `trajectory_separation(..., method=...)` for evaluator-level separation summaries. Feature interpretation is separate from preservation scoring: - `score()` evaluates whether the embedding preserves structure - `interpret()` evaluates which input features appear to drive the embedding `interpret()` delegates to the pure backend `coco_pipe.dim_reduction.analysis.interpret_features(...)` and currently supports: - `correlation` - `perturbation` - `gradient` ## Custom Reducers `BaseReducer` is a supported extension point and is re-exported from `coco_pipe.dim_reduction`. ```python from sklearn.decomposition import PCA from coco_pipe.dim_reduction import BaseReducer class CustomPCAReducer(BaseReducer): @property def capabilities(self): caps = super().capabilities caps.update({"is_linear": True, "has_components": True}) return caps def fit(self, X, y=None): self.model = PCA(n_components=self.n_components, **self.params) self.model.fit(X) return self def transform(self, X): return self.model.transform(X) ``` For reducers with nonstandard inputs, override `capabilities` to declare the expected `input_ndim` and `input_layout`. If a reducer depends on heavy optional libraries, keep those imports inside `fit()` / `transform()` paths. The helper `coco_pipe.utils.import_optional_dependency(...)` exists for built-in reducers and custom advanced integrations, but it is not the main public entry point. ## Supported Metric Shapes Metric plotting and reporting now work from tidy observations with these columns: - `method` - `metric` - `value` - `scope` - `scope_value` Optional columns such as `group`, `condition`, `pair`, `subject`, `session`, `seed`, and `fold` are preserved when present. ## Metric Plot Types Use `plot_metrics(..., plot_type=...)` or the report comparison helpers. Embedding visualizations are also external to `DimReduction`; pass the explicit embedding array to the plotting function you need. - `grouped_bar`: one scalar per method/metric - `box` / `boxen`: repeated observations - `violin`: dense repeated observations - `raincloud`: violin + box + points - `strip` / `swarm`: small repeated samples - `heatmap`: method x metric or method x scope - `line`: metric sweeps over `k`, time, or windows - `dumbbell`: direct two-method deltas Default behavior: - global scalars -> grouped bars - repeated observations -> raincloud - varying `scope_value` -> line - explicit matrix summaries -> heatmap ## Visualization Entry Points The dim-reduction viz surface is data-first and explicit. Plotting helpers do not read manager-owned embedding or context state. - `plot_embedding(embedding, labels=..., metadata=...)` - `plot_metrics(metric_records, metric=..., scope=..., method=...)` - `plot_shepard_diagram(X, embedding, distances=...)` - `plot_trajectory(trajectories, times=..., labels=..., values=...)` - `plot_trajectory_metric_series(series, times=..., labels=...)` - `plot_feature_importance(scores_or_records, analysis=..., method=..., dimension=...)` - `plot_feature_correlation_heatmap(correlation_payload, method=...)` - `plot_interpretation(interpretation_payload, analysis=..., method=..., dimension=...)` `plot_trajectory(...)` and `plot_trajectory_metric_series(...)` require native trajectory tensors or explicit time-series arrays. They do not reshape flat 2D embeddings or infer grouping metadata. ## Generic Trajectories Trajectory scoring is not EEG-specific. Any grouped or ordered embedding can use trajectory-native metrics when: - the embedding is already a 3D tensor `(trajectory, time, dim)` Trajectory reshaping or unstacking must happen upstream. The evaluation module does not reconstruct 3D trajectories from flat 2D embeddings. Trajectory outputs include: - `trajectory_speed_mean` - `trajectory_speed_peak` - `trajectory_acceleration_mean` - `trajectory_acceleration_peak` - `trajectory_curvature_mean` - `trajectory_curvature_peak` - `trajectory_turning_angle_mean` - `trajectory_turning_angle_peak` - `trajectory_dispersion_mean` - `trajectory_dispersion_peak` - `trajectory_path_length_final` - `trajectory_displacement_final` - `trajectory_tortuosity_final` - pairwise separation AUC / peak summaries when labels exist per trajectory Detailed timecourses are cached under `diagnostics_`. `trajectory_dispersion` in the evaluation pipeline is currently the global, unlabeled dispersion over all trajectories. This is narrower than the lower-level `geometry.py` primitive, which can also compute label-conditioned dispersion. Trajectory labels are only used automatically for `trajectory_separation`. Trajectory metrics are descriptive outputs for plotting and reporting. They are not used as automatic method-selection metrics by default. ## Reports `Report.add_reduction()` consumes `get_summary()` when available and accepts an explicit embedding payload when the section should render an embedding or trajectory plot. It can render: - interactive embeddings when `X_emb` is provided explicitly - trajectory plots for 3D embeddings - scalar metric tables and charts - loss and scree diagnostics - co-ranking heatmaps - trajectory metric timecourses - interpretation plots from `interpretation` / `interpretation_records` `Report.add_comparison()` accepts tidy metric frames or `MethodSelector` instances directly. `from_reductions(...)` follows the same rule: pass `embeddings=[...]` explicitly when the report should include embedding or trajectory plots. Shepard plots and comparison/report views reuse cached diagnostics such as `shepard_distances_` and `coranking_matrix_` when those artifacts already exist. ## End-to-End Execution Batch execution should use `coco_pipe.io.load_data` plus `DimReduction` directly. The old `DimReductionPipeline` compatibility wrapper has been removed. ## IO Workflow The recommended IO-to-dim-reduction flow is explicit: ```python from coco_pipe.dim_reduction import DimReduction from coco_pipe.io import load_data container = load_data("data.csv", mode="tabular", target_col="label", sep=",") X = container.X labels = container.y reducer = DimReduction("UMAP", n_components=2, random_state=42) embedding = reducer.fit_transform(X, y=labels) scores = reducer.score(embedding, X=X, labels=labels) ``` Use `DataContainer` to inspect and reshape data upstream: - `container.dims` - `container.coords` - `container.flatten(...)` - `container.stack(...)` - `container.unstack(...)` For example: ```python container = load_data("embeddings/", mode="embedding") container_2d = container.flatten(preserve="obs") X = container_2d.X embedding = reducer.fit_transform(X) ``` Trajectory scoring requires embeddings that are already shaped as `(n_trajectories, n_times, n_dims)`. The evaluation module does not reconstruct 3D trajectories from flat 2D arrays. Any reshaping or unstacking must happen upstream through IO or explicit user code before calling `score()`. `coco_pipe.io.utils` remains available for lower-level and advanced workflows, but it is not the recommended starting point for dim reduction. Prefer `load_data(...)`, `DataContainer`, and explicit reshaping operations first. ## Dependency Notes Heavy optional libraries such as `torch`, `umap`, `meegkit`, and `pydmd` are loaded inside reducer methods rather than at package import time. For a complete dim-reduction install, use the umbrella extra: ```bash pip install coco-pipe[dim-red] ``` Selective extras remain available when you only need part of the reducer stack: ```bash pip install coco-pipe[dask] pip install coco-pipe[neighbor] pip install coco-pipe[parametric-umap] pip install coco-pipe[ivis] pip install coco-pipe[topology] pip install coco-pipe[spatiotemporal] pip install coco-pipe[eeg] ``` The `neighbor` and `dim-red` extras include `faiss-cpu`, so Pacmap can use `nn_backend="faiss"` by default on supported platforms. Base imports that should remain lightweight: - `import coco_pipe.io` - `import coco_pipe.report` - `import coco_pipe.dim_reduction` ## Migration Notes Legacy local note files under `coco_pipe/dim_reduction/` were consolidated into this page and can be removed once no longer referenced.