Time-Series Clustering¶

Functions in pyflow_acdc.Time_series_clustering. Representative-period clustering reduces long time-series inputs to a weighted set of scenarios for multi_scenario_TEP(), multi_period_MS_TEP(), and related TEP drivers.

Workflow guides: Transmission Expansion Planning (TEP and MS TEP) (TEP and MS TEP), Multi-period Transmission Expansion Planning (MP TEP and MP+MS TEP) (MP TEP and MP+MS TEP).

`clustering_options`¶

TEP functions accept a clustering_options dict, normally processed by cluster_analysis(). To reload saved clusters without re-running the algorithms, pass precomputed_clusters_path (see Precomputed clusters and the MS TEP example in pyflow_tests/doc_examples/tep/02_multi_scenario_tep.py).

Use with years_data="23,24" on NS_MTDC_2025 so time-series length matches the JSON payload.

Key	Role
`n_clusters`	Number of representative periods (scenarios) passed to the clustering algorithm.
`time_series`	TS types to include (e.g. `["price", "Load", "WPP"]`). Series on the grid whose `type` is not listed are dropped before clustering.
`central_market`	List of price-zone names whose attached series are kept for market-linked types (`price`, `Load`, `PGL_min`, `PGL_max`, `a_CG`, `b_CG`, `c_CG`). An empty list (`[]`) keeps all zones in the grid — this is the usual choice on multi-zone cases such as `NS_MTDC_2025`. To cluster only around selected hubs, pass e.g. `["NL", "DE"]`; series tied to other zones are then excluded.
`thresholds`	Two-element list `[cv_threshold, correlation_threshold]` used in `filter_data()` and identify_correlations`(). With `[0, 0.8]` (typical): no CV pre-filtering (`0` disables it), and pairs with `\|correlation\| > 0.8` form correlated groups. When `cv_threshold > 0`, series whose coefficient of variation exceeds that value are removed before clustering.
`correlation_decisions`	Three-element list `[clean, method, scale_groups]` for identify_correlations`(). `[True, 3, True]` means: reduce redundant series in each correlated group (clean), use method `3` (PCA representative: keep the member most aligned with the group’s first principal component), and scale the kept series by `sqrt(group size)` so merged information is not under-weighted. Set `clean` to `False` to skip correlation reduction. Methods `1` and `2` keep the highest-variance member or replace the group with a single PC1 column respectively (see `identify_correlations()`).
`cluster_algorithm`	e.g. `kmedoids`, `kmeans_medoids`, `Kmeans`
`precomputed_clusters_path`	JSON path; skips re-clustering when set (see `load_precomputed_clusters_to_grid()`)
`print_details`	When `True`, print filtering statistics (mean, std, CV per series), excluded columns, correlated groups, deduplication choices, and clustering diagnostics to stdout. Use `False` in batch tests to keep output quiet.

The four keys highlighted in the doc example work together as a preprocessing pipeline before the chosen cluster_algorithm runs: restrict which series enter the feature matrix (time_series, central_market), optionally drop high-CV or highly correlated columns (thresholds, correlation_decisions), then form n_clusters representative periods. Set print_details=True while tuning a case; set it False once options are fixed.

Examples¶

Runnable scripts live in pyflow_tests/doc_examples/clustering/ and are executed by test_docs_clustering.py.

Precomputed clusters¶

"""Docs: api/clustering.rst — Precomputed clusters"""
import pyflow_acdc as pyf
from pyflow_tests.test_constants import north_sea_ms_clustering_options

grid, _ = pyf.cases["NS_MTDC_2025"](years_data="23,24", expandable=False, online=False)
n_clusters, clustered = pyf.cluster_analysis(grid, north_sea_ms_clustering_options())

assert clustered is True
assert n_clusters == 4
assert 4 in grid.Clusters

Live clustering¶

"""Docs: api/clustering.rst — Live clustering"""
import pyflow_acdc as pyf

grid, _ = pyf.cases["NS_MTDC_2025"](years_data="24", expandable=False, online=False)
clustering_options = {
    "n_clusters": 2,
    "time_series": ["price", "Load"],
    "central_market": [],
    "thresholds": [0, 0.8],
    "correlation_decisions": [False, "1", False],
    "cluster_algorithm": "Kmeans",
    "print_details": False,
}
n_clusters, clustered = pyf.cluster_analysis(grid, clustering_options)

assert clustered is True
assert n_clusters == 2
assert 2 in grid.Clusters

Exploratory sweep¶

"""Docs: api/clustering.rst — Exploratory clustering sweep"""
import tempfile

import pyflow_acdc as pyf
from pyflow_acdc.Time_series_clustering import run_clustering_analysis

grid, _ = pyf.cases["NS_MTDC_2025"](years_data="24", expandable=False, online=False)

with tempfile.TemporaryDirectory() as save_path:
    results = run_clustering_analysis(
        grid,
        save_path=save_path,
        algorithms=["kmeans", "kmeans_medoids", "kmedoids"],
        n_clusters_list=[2, 4],
        time_series=["price", "Load"],
        print_details=False,
        ts_options=[None, 0, 0.8],
        correlation_decisions=[False, "1", False],
        plotting=False,
        identifier="doc_example",
    )

assert len(results) >= 3
assert set(results["algorithm"]) >= {"kmeans", "kmeans_medoids", "kmedoids"}

Cluster analysis¶

cluster_analysis(grid, clustering_options)[source]¶: Main entry used inside TEP when clustering_options is passed.

cluster_TS(grid, n_clusters, time_series=None, central_market=None, algorithm='kmeans', cv_threshold=0, correlation_threshold=0.8, print_details=False, correlation_decisions=None, critical_idx=None, base_critical_ratio=0.5, scaler_type='robust', forced_centers=None, **kwargs)[source]¶

Cluster time-series profiles into representative operating states.

Runs correlation-based reduction (identify_correlations()) and then the selected clustering algorithm, optionally weighting a set of “critical” rows more heavily.

Parameters:

grid (Grid) – Grid whose time series are clustered.
n_clusters (int) – Number of representative states (clusters) to produce.
time_series (list, optional) – Time-series selection and central-market references.
central_market (list, optional) – Time-series selection and central-market references.
algorithm (str, optional) – One of 'kmeans', 'kmedoids', 'ward', 'pam_hierarchical' (default 'kmeans').
cv_threshold (float, optional) – Coefficient-of-variation and correlation thresholds for reduction.
correlation_threshold (float, optional) – Coefficient-of-variation and correlation thresholds for reduction.
critical_idx (list, optional) – Indices treated as critical (clustered separately).
base_critical_ratio (float or int, optional) – Fraction (or count) of clusters reserved for critical rows.
scaler_type (str, optional) – Scaler used before clustering (default 'robust').
**kwargs – Extra algorithm-specific options, e.g. random_state, n_init, max_iter (kmeans) or method, init, metric (kmedoids).

identify_correlations(grid, time_series=None, correlation_threshold=0, cv_threshold=0, central_market=None, print_details=False, correlation_decisions=None)[source]¶

Identify highly correlated time series variables.

Parameters:

grid – Grid object containing time series
correlation_threshold – Correlation coefficient threshold (default: 0.8)
cv_threshold – Minimum variance threshold (default: 0)

Returns:

Dictionary containing:

correlation_matrix: Full correlation matrix
high_correlations: List of tuples (var1, var2, corr_value) for highly correlated pairs
groups: List of groups of correlated variables

Return type:

dict

Precomputed clusters¶

load_precomputed_clusters_to_grid(grid, precomputed=None, precomputed_path=None, fallback_n_clusters=0)[source]¶

Exploratory analysis¶

See Exploratory sweep for a minimal sweep with run_clustering_analysis().

run_clustering_analysis(grid, save_path='clustering_results', algorithms=None, n_clusters_list=None, time_series=None, print_details=False, ts_options=None, correlation_decisions=None, plotting=False, plotting_options=None, identifier=None)[source]¶

Sweep clustering algorithms and cluster counts for exploratory analysis.

Runs cluster_TS() for each (algorithm, n_clusters) pair, collects quality metrics, optionally saves representative-period plots, and writes clustering_summary_<identifier>.csv under save_path.

Parameters:

grid (Grid) – Grid with attached time series.
save_path (str) – Output directory for CSV summaries and optional plots.
algorithms (list of str, optional) – Clustering methods passed to cluster_TS() (default includes kmeans, kmedoids, ward, pam_hierarchical).
n_clusters_list (list of int, optional) – Cluster counts to test (defaults to DEFAULT_CLUSTER_NUMBERS).
time_series (list, optional) – TS types to include (empty list keeps grid defaults).
print_details (bool) – Verbose clustering diagnostics from cluster_TS().
ts_options (list, optional) – [central_market, cv_threshold, correlation_threshold] for filtering.
correlation_decisions (list, optional) – Passed through to identify_correlations().
plotting (bool) – When True, save time-series plots per sweep step.
plotting_options (list, optional) – [variable_name_or_None, file_extension] for plots.
identifier (str, optional) – Suffix for output filenames.

Returns:

One row per successful (algorithm, n_clusters) run with timing and quality metrics.

Return type:

pandas.DataFrame

Sweeps clustering algorithms and cluster counts on the attached time series, records quality metrics (coefficient of variation, inertia, Davies–Bouldin), and writes clustering_summary_<identifier>.csv under save_path. Set plotting=True to save representative-period plots while sweeping. Use this to tune clustering_options before calling TEP; production solves normally use cluster_analysis() inside the TEP drivers.

run_clustering_analysis_and_plot(grid, algorithms=None, n_clusters_list=None, path='clustering_results', time_series=None, print_details=False, ts_options=None, correlation_decisions=None, plotting_options=None, identifier=None)[source]¶

Time-Series Clustering¶

clustering_options¶

Examples¶

Precomputed clusters¶

Live clustering¶

Exploratory sweep¶

Cluster analysis¶

Precomputed clusters¶

Exploratory analysis¶

`clustering_options`¶