Time-Series Clustering

Functions in pyflow_acdc.Time_series_clustering. Representative-period clustering reduces long time-series inputs to a weighted set of scenarios for multi_scenario_TEP(), multi_period_MS_TEP(), and related TEP drivers.

Workflow guides: Transmission Expansion Planning (TEP and MS TEP) (TEP and MS TEP), Multi-period Transmission Expansion Planning (MP TEP and MP+MS TEP) (MP TEP and MP+MS TEP).

clustering_options

TEP functions accept a clustering_options dict, normally processed by cluster_analysis(). To reload saved clusters without re-running the algorithms, pass precomputed_clusters_path (see Precomputed clusters and the MS TEP example in pyflow_tests/doc_examples/tep/02_multi_scenario_tep.py).

Use with years_data="23,24" on NS_MTDC_2025 so time-series length matches the JSON payload.

Key

Role

n_clusters

Number of representative periods (scenarios) passed to the clustering algorithm.

time_series

TS types to include (e.g. ["price", "Load", "WPP"]). Series on the grid whose type is not listed are dropped before clustering.

central_market

List of price-zone names whose attached series are kept for market-linked types (price, Load, PGL_min, PGL_max, a_CG, b_CG, c_CG). An empty list ([]) keeps all zones in the grid — this is the usual choice on multi-zone cases such as NS_MTDC_2025. To cluster only around selected hubs, pass e.g. ["NL", "DE"]; series tied to other zones are then excluded.

thresholds

Two-element list [cv_threshold, correlation_threshold] used in filter_data() and identify_correlations`(). With [0, 0.8] (typical): no CV pre-filtering (0 disables it), and pairs with |correlation| > 0.8 form correlated groups. When cv_threshold > 0, series whose coefficient of variation exceeds that value are removed before clustering.

correlation_decisions

Three-element list [clean, method, scale_groups] for identify_correlations`(). [True, 3, True] means: reduce redundant series in each correlated group (clean), use method 3 (PCA representative: keep the member most aligned with the group’s first principal component), and scale the kept series by sqrt(group size) so merged information is not under-weighted. Set clean to False to skip correlation reduction. Methods 1 and 2 keep the highest-variance member or replace the group with a single PC1 column respectively (see identify_correlations()).

cluster_algorithm

e.g. kmedoids, kmeans_medoids, Kmeans

precomputed_clusters_path

JSON path; skips re-clustering when set (see load_precomputed_clusters_to_grid())

print_details

When True, print filtering statistics (mean, std, CV per series), excluded columns, correlated groups, deduplication choices, and clustering diagnostics to stdout. Use False in batch tests to keep output quiet.

The four keys highlighted in the doc example work together as a preprocessing pipeline before the chosen cluster_algorithm runs: restrict which series enter the feature matrix (time_series, central_market), optionally drop high-CV or highly correlated columns (thresholds, correlation_decisions), then form n_clusters representative periods. Set print_details=True while tuning a case; set it False once options are fixed.

Examples

Runnable scripts live in pyflow_tests/doc_examples/clustering/ and are executed by test_docs_clustering.py.

Precomputed clusters

"""Docs: api/clustering.rst — Precomputed clusters"""
import pyflow_acdc as pyf
from pyflow_tests.test_constants import north_sea_ms_clustering_options

grid, _ = pyf.cases["NS_MTDC_2025"](years_data="23,24", expandable=False, online=False)
n_clusters, clustered = pyf.cluster_analysis(grid, north_sea_ms_clustering_options())

assert clustered is True
assert n_clusters == 4
assert 4 in grid.Clusters

Live clustering

"""Docs: api/clustering.rst — Live clustering"""
import pyflow_acdc as pyf

grid, _ = pyf.cases["NS_MTDC_2025"](years_data="24", expandable=False, online=False)
clustering_options = {
    "n_clusters": 2,
    "time_series": ["price", "Load"],
    "central_market": [],
    "thresholds": [0, 0.8],
    "correlation_decisions": [False, "1", False],
    "cluster_algorithm": "Kmeans",
    "print_details": False,
}
n_clusters, clustered = pyf.cluster_analysis(grid, clustering_options)

assert clustered is True
assert n_clusters == 2
assert 2 in grid.Clusters

Exploratory sweep

"""Docs: api/clustering.rst — Exploratory clustering sweep"""
import tempfile

import pyflow_acdc as pyf
from pyflow_acdc.Time_series_clustering import run_clustering_analysis

grid, _ = pyf.cases["NS_MTDC_2025"](years_data="24", expandable=False, online=False)

with tempfile.TemporaryDirectory() as save_path:
    results = run_clustering_analysis(
        grid,
        save_path=save_path,
        algorithms=["kmeans", "kmeans_medoids", "kmedoids"],
        n_clusters_list=[2, 4],
        time_series=["price", "Load"],
        print_details=False,
        ts_options=[None, 0, 0.8],
        correlation_decisions=[False, "1", False],
        plotting=False,
        identifier="doc_example",
    )

assert len(results) >= 3
assert set(results["algorithm"]) >= {"kmeans", "kmeans_medoids", "kmedoids"}

Cluster analysis

cluster_analysis(grid, clustering_options)[source]

Main entry used inside TEP when clustering_options is passed.

cluster_TS(grid, n_clusters, time_series=None, central_market=None, algorithm='kmeans', cv_threshold=0, correlation_threshold=0.8, print_details=False, correlation_decisions=None, critical_idx=None, base_critical_ratio=0.5, scaler_type='robust', forced_centers=None, **kwargs)[source]

Cluster time-series profiles into representative operating states.

Runs correlation-based reduction (identify_correlations()) and then the selected clustering algorithm, optionally weighting a set of “critical” rows more heavily.

Parameters:
  • grid (Grid) – Grid whose time series are clustered.

  • n_clusters (int) – Number of representative states (clusters) to produce.

  • time_series (list, optional) – Time-series selection and central-market references.

  • central_market (list, optional) – Time-series selection and central-market references.

  • algorithm (str, optional) – One of 'kmeans', 'kmedoids', 'ward', 'pam_hierarchical' (default 'kmeans').

  • cv_threshold (float, optional) – Coefficient-of-variation and correlation thresholds for reduction.

  • correlation_threshold (float, optional) – Coefficient-of-variation and correlation thresholds for reduction.

  • critical_idx (list, optional) – Indices treated as critical (clustered separately).

  • base_critical_ratio (float or int, optional) – Fraction (or count) of clusters reserved for critical rows.

  • scaler_type (str, optional) – Scaler used before clustering (default 'robust').

  • **kwargs – Extra algorithm-specific options, e.g. random_state, n_init, max_iter (kmeans) or method, init, metric (kmedoids).

identify_correlations(grid, time_series=None, correlation_threshold=0, cv_threshold=0, central_market=None, print_details=False, correlation_decisions=None)[source]

Identify highly correlated time series variables.

Parameters:
  • grid – Grid object containing time series

  • correlation_threshold – Correlation coefficient threshold (default: 0.8)

  • cv_threshold – Minimum variance threshold (default: 0)

Returns:

Dictionary containing:
  • correlation_matrix: Full correlation matrix

  • high_correlations: List of tuples (var1, var2, corr_value) for highly correlated pairs

  • groups: List of groups of correlated variables

Return type:

dict

Precomputed clusters

load_precomputed_clusters_to_grid(grid, precomputed=None, precomputed_path=None, fallback_n_clusters=0)[source]

Exploratory analysis

See Exploratory sweep for a minimal sweep with run_clustering_analysis().

run_clustering_analysis(grid, save_path='clustering_results', algorithms=None, n_clusters_list=None, time_series=None, print_details=False, ts_options=None, correlation_decisions=None, plotting=False, plotting_options=None, identifier=None)[source]

Sweep clustering algorithms and cluster counts for exploratory analysis.

Runs cluster_TS() for each (algorithm, n_clusters) pair, collects quality metrics, optionally saves representative-period plots, and writes clustering_summary_<identifier>.csv under save_path.

Parameters:
  • grid (Grid) – Grid with attached time series.

  • save_path (str) – Output directory for CSV summaries and optional plots.

  • algorithms (list of str, optional) – Clustering methods passed to cluster_TS() (default includes kmeans, kmedoids, ward, pam_hierarchical).

  • n_clusters_list (list of int, optional) – Cluster counts to test (defaults to DEFAULT_CLUSTER_NUMBERS).

  • time_series (list, optional) – TS types to include (empty list keeps grid defaults).

  • print_details (bool) – Verbose clustering diagnostics from cluster_TS().

  • ts_options (list, optional) – [central_market, cv_threshold, correlation_threshold] for filtering.

  • correlation_decisions (list, optional) – Passed through to identify_correlations().

  • plotting (bool) – When True, save time-series plots per sweep step.

  • plotting_options (list, optional) – [variable_name_or_None, file_extension] for plots.

  • identifier (str, optional) – Suffix for output filenames.

Returns:

One row per successful (algorithm, n_clusters) run with timing and quality metrics.

Return type:

pandas.DataFrame

Sweeps clustering algorithms and cluster counts on the attached time series, records quality metrics (coefficient of variation, inertia, Davies–Bouldin), and writes clustering_summary_<identifier>.csv under save_path. Set plotting=True to save representative-period plots while sweeping. Use this to tune clustering_options before calling TEP; production solves normally use cluster_analysis() inside the TEP drivers.

run_clustering_analysis_and_plot(grid, algorithms=None, n_clusters_list=None, path='clustering_results', time_series=None, print_details=False, ts_options=None, correlation_decisions=None, plotting_options=None, identifier=None)[source]