bluemath_tk.datamining package

Submodules

bluemath_tk.datamining.kma module

class bluemath_tk.datamining.kma.KMA(num_clusters: int, seed: int = None)[source]

Bases: BaseClustering

K-Means (KMA) class.

This class performs the K-Means algorithm on a given dataframe.

num_clusters

The number of clusters to use in the K-Means algorithm.

Type:

int

seed

The random seed to use as initial datapoint.

Type:

int

data_variables

A list with all data variables.

Type:

List[str]

directional_variables

A list with directional variables.

Type:

List[str]

fitting_variables

A list with fitting variables.

Type:

List[str]

custom_scale_factor

A dictionary of custom scale factors.

Type:

dict

scale_factor

A dictionary of scale factors (after normalizing the data).

Type:

dict

centroids

The selected centroids.

Type:

pd.DataFrame

normalized_centroids

The selected normalized centroids.

Type:

pd.DataFrame

centroid_real_indices

The real indices of the selected centroids.

Type:

np.array

Notes

  • The K-Means algorithm is used to cluster data points into k clusters.

  • The K-Means algorithm is sensitive to the initial centroids.

  • The K-Means algorithm is not suitable for large datasets.

Examples

import numpy as np
import pandas as pd
from bluemath_tk.datamining.kma import KMA

data = pd.DataFrame(
    {
        'Hs': np.random.rand(1000) * 7,
        'Tp': np.random.rand(1000) * 20,
        'Dir': np.random.rand(1000) * 360
    }
)
kma = KMA(num_clusters=5)
nearest_centroids_idxs, nearest_centroids_df = kma.fit_predict(
    data=data,
    directional_variables=['Dir'],
)

kma.plot_selected_centroids(plot_text=True)
(<Figure size 640x480 with 10 Axes>,
 array([[<Axes: xlabel='Tp', ylabel='Hs'>, <Axes: >, <Axes: >, <Axes: >],
        [<Axes: >, <Axes: xlabel='Dir', ylabel='Tp'>, <Axes: >, <Axes: >],
        [<Axes: >, <Axes: >, <Axes: xlabel='Dir_u', ylabel='Dir'>,
         <Axes: >],
        [<Axes: >, <Axes: >, <Axes: >,
         <Axes: xlabel='Dir_v', ylabel='Dir_u'>]], dtype=object))
_images/bluemath_tk.datamining_0_1.png
property data: DataFrame

Returns the original data used for clustering.

property data_to_fit: DataFrame

Returns the data used for fitting the K-Means algorithm.

fit(data: DataFrame, directional_variables: List[str] = [], custom_scale_factor: dict = {}, min_number_of_points: int = None, max_number_of_iterations: int = 10, normalize_data: bool = True) None[source]

Fit the K-Means algorithm to the provided data.

This method initializes centroids for the K-Means algorithm using the provided dataframe and custom scale factor. It normalizes the data, and returns the calculated centroids.

TODO: Implement KMA regression guided with variable.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the KMA algorithm.

  • directional_variables (List[str], optional) – A list of directional variables (will be transformed to u and v). Default is [].

  • custom_scale_factor (dict, optional) – A dictionary specifying custom scale factors for normalization. Default is {}.

  • min_number_of_points (int, optional) – The minimum number of points to consider a cluster. Default is None.

  • max_number_of_iterations (int, optional) – The maximum number of iterations for the K-Means algorithm. This is used when min_number_of_points is not None. Default is 10.

  • normalize_data (bool, optional) – A flag to normalize the data. Default is True.

fit_predict(data: DataFrame, directional_variables: List[str] = [], custom_scale_factor: dict = {}, min_number_of_points: int = None, max_number_of_iterations: int = 10, normalize_data: bool = True) Tuple[DataFrame, DataFrame][source]

Fit the K-Means algorithm to the provided data and predict the nearest centroid for each data point.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the KMA algorithm.

  • directional_variables (List[str], optional) – A list of directional variables (will be transformed to u and v). Default is [].

  • custom_scale_factor (dict) – A dictionary specifying custom scale factors for normalization. Default is {}.

  • min_number_of_points (int, optional) – The minimum number of points to consider a cluster. Default is None.

  • max_number_of_iterations (int, optional) – The maximum number of iterations for the K-Means algorithm. This is used when min_number_of_points is not None. Default is 10.

  • normalize_data (bool, optional) – A flag to normalize the data. Default is True.

Returns:

A tuple containing the nearest centroid index for each data point, and the nearest centroids.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

property kma: KMeans
property normalized_data: DataFrame

Returns the normalized data used for clustering.

predict(data: DataFrame) Tuple[DataFrame, DataFrame][source]

Predict the nearest centroid for the provided data.

Parameters:

data (pd.DataFrame) – The input data to be used for the prediction.

Returns:

A tuple containing the nearest centroid index for each data point, and the nearest centroids.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

exception bluemath_tk.datamining.kma.KMAError(message: str = 'KMA error occurred.')[source]

Bases: Exception

Custom exception for KMA class.

bluemath_tk.datamining.lhs module

class bluemath_tk.datamining.lhs.LHS(num_dimensions: int, seed: int = 1)[source]

Bases: BaseSampling

Latin Hypercube Sampling (LHS) class.

This class performs the LHS algorithm for some input data.

num_dimensions

The number of dimensions to use in the LHS algorithm.

Type:

int

seed

The random seed to use.

Type:

int

lhs

The Latin Hypercube object.

Type:

qdc.LatinHypercube

data

The LHS samples dataframe.

Type:

pd.DataFrame

generate(dimensions_names, lower_bounds, upper_bounds, num_samples)[source]

Generate LHS samples.

Notes

  • This class is designed to perform the LHS algorithm.

Examples

>>> from bluemath_tk.datamining.lhs import LHS
>>> dimensions_names = ['CM', 'SS', 'Qb']
>>> lower_bounds = [0.5, -0.2, 1]
>>> upper_bounds = [5.3, 1.5, 200]
>>> lhs = LHS(num_dimensions=3, seed=0)
>>> lhs_sampled_df = lhs.generate(
...     dimensions_names=dimensions_names,
...     lower_bounds=lower_bounds,
...     upper_bounds=upper_bounds,
...     num_samples=100,
... )
property data: DataFrame
generate(dimensions_names: List[str], lower_bounds: List[float], upper_bounds: List[float], num_samples: int) DataFrame[source]

Generate LHS samples.

Parameters:
  • dimensions_names (List[str]) – The names of the dimensions.

  • lower_bounds (List[float]) – The lower bounds of the dimensions.

  • upper_bounds (List[float]) – The upper bounds of the dimensions.

  • num_samples (int) – The number of samples to generate. Must be greater than 0.

Returns:

self.data – The LHS samples.

Return type:

pd.DataFrame

property lhs: LatinHypercube
exception bluemath_tk.datamining.lhs.LHSError(message: str = 'LHS error occurred.')[source]

Bases: Exception

Custom exception for LHS class.

bluemath_tk.datamining.mda module

class bluemath_tk.datamining.mda.MDA(num_centers: int)[source]

Bases: BaseClustering

Maximum Dissimilarity Algorithm (MDA) class.

This class performs the MDA algorithm on a given dataframe.

num_centers

The number of centers to use in the MDA algorithm.

Type:

int

data

The input data.

Type:

pd.DataFrame

normalized_data

The normalized input data.

Type:

pd.DataFrame

data_to_fit

The data to fit the MDA algorithm.

Type:

pd.DataFrame

data_variables

A list with all data variables.

Type:

List[str]

directional_variables

A list with directional variables.

Type:

List[str]

fitting_variables

A list with fitting variables.

Type:

List[str]

custom_scale_factor

A dictionary of custom scale factors.

Type:

dict

scale_factor

A dictionary of scale factors (after normalizing the data).

Type:

dict

centroids

The selected centroids.

Type:

pd.DataFrame

normalized_centroids

The selected normalized centroids.

Type:

pd.DataFrame

centroid_iterative_indices

A list of iterative indices of the centroids.

Type:

List[int]

centroid_real_indices

The real indices of the selected centroids.

Type:

List[int]

fit(data, directional_variables, custom_scale_factor, first_centroid_seed)[source]

Fit the MDA algorithm to the provided data.

predict(data)[source]

Predict the nearest centroid for the provided data.

fit_predict(data, directional_variables, custom_scale_factor, first_centroid_seed)[source]

Fits the MDA model to the data and predicts the nearest centroids.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.mda import MDA
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> mda = MDA(num_centers=10)
>>> nearest_centroids_idxs, nearest_centroids_df = mda.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )
property data: DataFrame
property data_to_fit: DataFrame
fit(data: DataFrame, directional_variables: List[str] = [], custom_scale_factor: dict = {}, first_centroid_seed: int = None) None[source]

Fit the Maximum Dissimilarity Algorithm (MDA) to the provided data.

This method initializes centroids for the MDA algorithm using the provided dataframe, directional variables, and custom scale factor. It normalizes the data, iteratively selects centroids based on maximum dissimilarity, and denormalizes the centroids before returning them.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the MDA algorithm.

  • directional_variables (List[str], optional) – A list of names of the directional variables within the data. Default is [].

  • custom_scale_factor (dict, optional) – A dictionary specifying custom scale factors for normalization. Default is {}.

  • first_centroid_seed (int, optional) – The index of the first centroid to use in the MDA algorithm. Default is None.

Notes

  • The function assumes that the data is validated by the validate_data_mda

    decorator before execution.

  • When first_centroid_seed is not provided, max value centroid is used.

fit_predict(data: DataFrame, directional_variables: List[str] = [], custom_scale_factor: dict = {}, first_centroid_seed: int = None) Tuple[ndarray, DataFrame][source]

Fits the MDA model to the data and predicts the nearest centroids.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the MDA algorithm.

  • directional_variables (List[str], optional) – A list of names of the directional variables within the data. Default is [].

  • custom_scale_factor (dict, optional) – A dictionary specifying custom scale factors for normalization. Default is {}.

  • first_centroid_seed (int, optional) – The index of the first centroid to use in the MDA algorithm. Default is None.

Returns:

A tuple containing the nearest centroid index for each data point and the nearest centroids.

Return type:

Tuple[np.ndarray, pd.DataFrame]

property normalized_data: DataFrame
predict(data: DataFrame) Tuple[ndarray, DataFrame][source]

Predict the nearest centroid for the provided data.

Parameters:

data (pd.DataFrame) – The input data to be used for the prediction.

Returns:

A tuple containing the nearest centroid index for each data point and the nearest centroids.

Return type:

Tuple[np.ndarray, pd.DataFrame]

exception bluemath_tk.datamining.mda.MDAError(message: str = 'MDA error occurred.')[source]

Bases: Exception

Custom exception for MDA class.

bluemath_tk.datamining.pca module

class bluemath_tk.datamining.pca.PCA(n_components: int | float = 0.98, is_incremental: bool = False, debug: bool = False)[source]

Bases: BaseReduction

Principal Component Analysis (PCA) class.

n_components

The number of components or the explained variance ratio.

Type:

Union[int, float]

is_incremental

Indicates whether Incremental PCA is used.

Type:

bool

is_fitted

Indicates whether the PCA model has been fitted.

Type:

bool

scaler

The scaler used for standardizing the data, in case the data is standarized.

Type:

StandardScaler

vars_to_stack

The list of variables to stack.

Type:

List[str]

window_stacked_vars

The list of variables with windows.

Type:

List[str]

coords_to_stack

The list of coordinates to stack.

Type:

List[str]

coords_values

The values of the data coordinates used in fitting.

Type:

dict

pca_dim_for_rows

The dimension for rows in PCA.

Type:

str

windows_in_pca_dim_for_rows

The windows in PCA dimension for rows.

Type:

dict

value_to_replace_nans

The values to replace NaNs in the dataset.

Type:

dict

nan_threshold_to_drop

The threshold percentage to drop NaNs for each variable.

Type:

dict

num_cols_for_vars

The number of columns for variables.

Type:

int

pcs

The Principal Components (PCs).

Type:

xr.Dataset

Examples

from bluemath_tk.core.data.sample_data import get_2d_dataset
from bluemath_tk.datamining.pca import PCA

ds = get_2d_dataset()

pca = PCA(
    n_components=5,
    is_incremental=False,
    debug=True,
)
pca.fit(
    data=ds,
    vars_to_stack=["X", "Y"],
    coords_to_stack=["coord1", "coord2"],
    pca_dim_for_rows="coord3",
    windows_in_pca_dim_for_rows={"X": [1, 2, 3]},
    value_to_replace_nans={"X": 0.0},
    nan_threshold_to_drop={"X": 0.95},
)
pcs = pca.transform(
    data=ds,
)
reconstructed_ds = pca.inverse_transform(PCs=pcs)
eofs = pca.eofs
explained_variance = pca.explained_variance
explained_variance_ratio = pca.explained_variance_ratio
cumulative_explained_variance_ratio = pca.cumulative_explained_variance_ratio

# Save the full class in a pickle file
pca.save_model("pca_model.pkl")

# Plot the calculated EOFs
pca.plot_eofs(vars_to_plot=["X", "Y"], num_eofs=3)

        -------------------------------------------------------------------
        | Initializing PCA reduction model with the following parameters:
        |    - n_components: 5
        |    - is_incremental: False
        | For more information, please refer to the documentation.
        -------------------------------------------------------------------
        
_images/bluemath_tk.datamining_1_5.png _images/bluemath_tk.datamining_1_6.png

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

[2] https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html

[3] https://www.sciencedirect.com/science/article/abs/pii/S0378383911000676

property cumulative_explained_variance_ratio: ndarray

Return the cumulative explained variance ratio of the PCA model.

property data: Dataset

Returns the raw data used for PCA.

property eofs: Dataset

Return the Empirical Orthogonal Functions (EOFs).

property explained_variance: ndarray

Return the explained variance of the PCA model.

property explained_variance_ratio: ndarray

Return the explained variance ratio of the PCA model.

fit(data: Dataset, vars_to_stack: List[str], coords_to_stack: List[str], pca_dim_for_rows: str, windows_in_pca_dim_for_rows: dict = {}, value_to_replace_nans: dict = {}, nan_threshold_to_drop: dict = {}, scale_data: bool = True) None[source]

Fit PCA model to data.

Parameters:
  • data (xr.Dataset) – The data to fit the PCA model.

  • vars_to_stack (list of str) – The variables to stack.

  • coords_to_stack (list of str) – The coordinates to stack.

  • pca_dim_for_rows (str) – The PCA dimension to maintain in rows (usually the time).

  • windows_in_pca_dim_for_rows (dict, optional) – The window steps to roll the pca_dim_for_rows for each variable. Default is {}.

  • value_to_replace_nans (dict, optional) – The value to replace NaNs for each variable. Default is {}.

  • nan_threshold_to_drop (dict, optional) – The threshold percentage to drop NaNs for each variable. By default, variables with less than 90% of valid values are dropped. Default is {}.

  • scale_data (bool, optional) – If True, scale the data. Default is True.

Notes

For both value_to_replace_nans and nan_threshold_to_drop, the keys are the variables, and the suffixes for the windows are considered. Example: if you have variable “X”, and apply windows [1, 2, 3], you can use “X_1”, “X_2”, “X_3”. Nevertheless, you can also use the original variable name “X” to apply the same value for all windows.

fit_transform(data: Dataset, vars_to_stack: List[str], coords_to_stack: List[str], pca_dim_for_rows: str, windows_in_pca_dim_for_rows: dict = {}, value_to_replace_nans: dict = {}, nan_threshold_to_drop: dict = {}, scale_data: bool = True) Dataset[source]

Fit and transform data using PCA model.

Parameters:
  • data (xr.Dataset) – The data to fit the PCA model.

  • vars_to_stack (list of str) – The variables to stack.

  • coords_to_stack (list of str) – The coordinates to stack.

  • pca_dim_for_rows (str) – The PCA dimension to maintain in rows (usually the time).

  • windows_in_pca_dim_for_rows (dict, optional) – The window steps to roll the pca_dim_for_rows for each variable. Default is {}.

  • value_to_replace_nans (dict, optional) – The value to replace NaNs for each variable. Default is {}.

  • nan_threshold_to_drop (dict, optional) – The threshold percentage to drop NaNs for each variable. By default, variables with more than 10% of NaNs are dropped. Default is {}.

  • scale_data (bool, optional) – If True, scale the data. Default is True.

Returns:

The transformed data representing the Principal Components (PCs).

Return type:

xr.Dataset

Notes

For both value_to_replace_nans and nan_threshold_to_drop, the keys are the variables, and the suffixes for the windows are considered. Example: if you have variable “X”, and apply windows [1, 2, 3], you can use “X_1”, “X_2”, “X_3”. Nevertheless, you can also use the original variable name “X” to apply the same value for all windows.

inverse_transform(PCs: DataArray | Dataset) Dataset[source]

Inverse transform data using the fitted PCA model.

Parameters:

PCs (Union[xr.DataArray, xr.Dataset]) – The data to inverse transform. It should be the Principal Components (PCs).

Returns:

The inverse transformed data.

Return type:

xr.Dataset

property pca: PCA | IncrementalPCA

Returns the PCA or IncrementalPCA instance used for dimensionality reduction.

property pcs_df: DataFrame

Returns the principal components as a DataFrame.

plot_eofs(vars_to_plot: List[str], num_eofs: int, destandarize: bool = False, map_center: tuple = None) None[source]

Plot the Empirical Orthogonal Functions (EOFs).

Parameters:
  • vars_to_plot (List[str]) – The variables to plot.

  • num_eofs (int) – The number of EOFs to plot.

  • destandarize (bool, optional) – If True, destandarize the EOFs. Default is False.

  • map_center (tuple, optional) – The center of the map. Default is None. First value is the longitude (-180, 180), and the second value is the latitude (-90, 90).

plot_pcs(num_pcs: int, pcs: Dataset = None) None[source]

Plot the Principal Components (PCs).

Parameters:
  • num_pcs (int) – The number of Principal Components (PCs) to plot.

  • pcs (xr.Dataset, optional) – The Principal Components (PCs) to plot.

property stacked_data_matrix: ndarray

Return the stacked data matrix.

property standarized_stacked_data_matrix: ndarray

Return the standarized stacked data matrix.

transform(data: Dataset, after_fitting: bool = False) Dataset[source]

Transform data using the fitted PCA model.

Parameters:
  • data (xr.Dataset) – The data to transform.

  • after_fitting (bool, optional) – If True, use the already processed data. Default is False. This is just used in the fit_transform method!

Returns:

The transformed data.

Return type:

xr.Dataset

property window_processed_data: Dataset

Return the window processed data used for PCA.

exception bluemath_tk.datamining.pca.PCAError(message: str = 'PCA error occurred.')[source]

Bases: Exception

Custom exception for PCA class.

bluemath_tk.datamining.som module

class bluemath_tk.datamining.som.SOM(som_shape: Tuple[int, int], num_dimensions: int, sigma: float = 1, learning_rate: float = 0.5, decay_function: str = 'asymptotic_decay', neighborhood_function: str = 'gaussian', topology: str = 'rectangular', activation_distance: str = 'euclidean', random_seed: int = None, sigma_decay_function: str = 'asymptotic_decay')[source]

Bases: BaseClustering

Self-Organizing Map (SOM) class.

This class performs the Self-Organizing Map algorithm on a given dataframe.

som_shape

The shape of the SOM.

Type:

Tuple[int, int]

num_dimensions

The number of dimensions of the input data.

Type:

int

data

The input data.

Type:

pd.DataFrame

standarized_data

The standarized input data.

Type:

pd.DataFrame

data_to_fit

The data to fit the SOM algorithm.

Type:

pd.DataFrame

data_variables

A list with all data variables.

Type:

List[str]

directional_variables

A list with directional variables.

Type:

List[str]

fitting_variables

A list with fitting variables.

Type:

List[str]

scaler

The StandardScaler object.

Type:

StandardScaler

centroids

The selected centroids.

Type:

pd.DataFrame

is_fitted

A flag to check if the SOM model is fitted.

Type:

bool

activation_response(data)[source]

Returns the activation response of the given data.

get_centroids_probs_for_labels(data, labels)[source]

Returns the labels map of the given data.

plot_centroids_probs_for_labels(probs_data)[source]

Plots the labels map of the given data.

fit(data, directional_variables, num_iteration)[source]

Fits the SOM model to the provided data.

predict(data)[source]

Predicts the nearest centroid for the provided data.

fit_predict(data, directional_variables, num_iteration)[source]

Fit the SOM algorithm to the provided data and predict the nearest centroid for each data point.

Notes

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.som import SOM
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> som = SOM(som_shape=(3, 3), num_dimensions=4)
>>> nearest_centroids_idxs, nearest_centroids_df = som.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )
activation_response(data: DataFrame = None) ndarray[source]

Returns the activation response of the given data.

property data: DataFrame
property data_to_fit: DataFrame
property distance_map: ndarray

Returns the distance map of the SOM.

fit(data: DataFrame, directional_variables: List[str] = [], num_iteration: int = 1000) None[source]

Fits the SOM model to the provided data.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the fitting.

  • directional_variables (List[str], optional) – A list with the directional variables (will be transformed to u and v). Default is [].

  • num_iteration (int, optional) – The number of iterations for the SOM fitting. Default is 1000.

Notes

  • The function assumes that the data is validated by the validate_data_som

decorator before execution.

fit_predict(data: DataFrame, directional_variables: List[str] = [], num_iteration: int = 1000) Tuple[ndarray, DataFrame][source]

Fit the SOM algorithm to the provided data and predict the nearest centroid for each data point.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the SOM algorithm.

  • directional_variables (List[str], optional) – A list of directional variables (will be transformed to u and v). Default is [].

  • num_iteration (int, optional) – The number of iterations for the SOM fitting. Default is 1000.

Returns:

A tuple containing the winner neurons for each data point and the nearest centroids.

Return type:

Tuple[np.ndarray, pd.DataFrame]

get_centroids_probs_for_labels(data: DataFrame, labels: List[str]) DataFrame[source]

Returns the labels map of the given data.

plot_centroids_probs_for_labels(probs_data: DataFrame) Tuple[figure, axes][source]

Plots the labels map of the given data.

predict(data: DataFrame) Tuple[ndarray, DataFrame][source]

Predicts the nearest centroid for the provided data.

Parameters:

data (pd.DataFrame) – The input data to be used for the prediction.

Returns:

A tuple with the winner neurons and the centroids of the given data.

Return type:

Tuple[np.ndarray, pd.DataFrame]

property som: MiniSom
property standarized_data: DataFrame
exception bluemath_tk.datamining.som.SOMError(message: str = 'SOM error occurred.')[source]

Bases: Exception

Custom exception for SOM class.

Module contents

Project: BlueMath_tk Sub-Module: datamining Author: GeoOcean Research Group, Universidad de Cantabria Creation Date: 19 January 2024 Repository: https://github.com/GeoOcean/BlueMath_tk.git Status: Under development (Working)

class bluemath_tk.datamining.BaseClustering[source]

Bases: BlueMathModel

Base class for all clustering BlueMath models. This class provides the basic structure for all clustering models.

fit : None

Fits the model to the data.

predict : pd.DataFrame

Predicts the clusters for the provided data.

fit_predict : pd.DataFrame

Fits the model to the data and predicts the clusters.

plot_selected_centroids : Tuple[plt.figure, plt.axes]

Plots data and selected centroids on a scatter plot matrix.

plot_data_as_clusters : Tuple[plt.figure, plt.axes]

Plots data as nearest clusters.

abstractmethod fit(*args, **kwargs) None[source]

Fits the model to the data.

Parameters:
  • *args (list) – Positional arguments.

  • **kwargs (dict) – Keyword arguments.

abstractmethod fit_predict(*args, **kwargs) DataFrame[source]

Fits the model to the data and predicts the clusters.

Parameters:
  • *args (list) – Positional arguments.

  • **kwargs (dict) – Keyword arguments.

Returns:

The predicted clusters.

Return type:

pd.DataFrame

plot_data_as_clusters(data: DataFrame, nearest_centroids: ndarray, **kwargs) Tuple[figure, axes][source]

Plots data as nearest clusters.

Parameters:
  • data (pd.DataFrame) – The data to plot.

  • nearest_centroids (np.ndarray) – The nearest centroids.

  • **kwargs (dict, optional) – Additional keyword arguments to be passed to the scatter plot function.

Returns:

  • plt.figure – The figure object containing the plot.

  • plt.axes – The axes object for the plot.

plot_selected_centroids(data_color: str = 'blue', centroids_color: str = 'red', plot_text: bool = False, **kwargs) Tuple[figure, axes][source]

Plots data and selected centroids on a scatter plot matrix.

Parameters:
  • data_color (str, optional) – Color for the data points. Default is “blue”.

  • centroids_color (str, optional) – Color for the centroid points. Default is “red”.

  • plot_text (bool, optional) – Whether to display text labels for centroids. Default is False.

  • **kwargs (dict, optional) – Additional keyword arguments to be passed to the scatter plot function.

Returns:

  • plt.figure – The figure object containing the plot.

  • plt.axes – Array of axes objects for the subplots.

Raises:

ValueError – If the data and centroids do not have the same number of columns or if the columns are empty.

abstractmethod predict(*args, **kwargs) DataFrame[source]

Predicts the clusters for the provided data.

Parameters:
  • *args (list) – Positional arguments.

  • **kwargs (dict) – Keyword arguments.

Returns:

The predicted clusters.

Return type:

pd.DataFrame

class bluemath_tk.datamining.BaseReduction[source]

Bases: BlueMathModel

Base class for all dimensionality reduction BlueMath models. This class provides the basic structure for all dimensionality reduction models.

abstractmethod fit(*args, **kwargs) None[source]

Fits the model to the data.

Parameters:
  • *args (list) – Positional arguments.

  • **kwargs (dict) – Keyword arguments.

abstractmethod fit_transform(*args, **kwargs) Dataset[source]

Fits the model to the data and transforms it.

Parameters:
  • *args (list) – Positional arguments.

  • **kwargs (dict) – Keyword arguments.

Returns:

The transformed data.

Return type:

xr.Dataset

abstractmethod inverse_transform(*args, **kwargs) Dataset[source]

Inversely transforms the data using the fitted model.

Parameters:
  • *args (list) – Positional arguments.

  • **kwargs (dict) – Keyword arguments.

Returns:

The inversely transformed data.

Return type:

xr.Dataset

abstractmethod transform(*args, **kwargs) Dataset[source]

Transforms the data using the fitted model.

Parameters:
  • *args (list) – Positional arguments.

  • **kwargs (dict) – Keyword arguments.

Returns:

The transformed data.

Return type:

xr.Dataset

class bluemath_tk.datamining.BaseSampling[source]

Bases: BlueMathModel

Base class for all sampling BlueMath models. This class provides the basic structure for all sampling models.

generate : pd.DataFrame

Generates samples.

plot_generated_data : Tuple[plt.figure, plt.axes]

Plots the generated data on a scatter plot matrix.

abstractmethod generate(*args, **kwargs) DataFrame[source]

Generates samples.

Parameters:
  • *args (list) – Positional arguments.

  • **kwargs (dict) – Keyword arguments.

Returns:

The generated samples.

Return type:

pd.DataFrame

plot_generated_data(data_color: str = 'blue', **kwargs) Tuple[figure, axes][source]

Plots the generated data on a scatter plot matrix.

Parameters:
  • data_color (str, optional) – Color for the data points. Default is “blue”.

  • **kwargs (dict, optional) – Additional keyword arguments to be passed to the scatter plot function.

Returns:

  • plt.figure – The figure object containing the plot.

  • plt.axes – Array of axes objects for the subplots.

Raises:

ValueError – If the data is empty.

class bluemath_tk.datamining.ClusteringComparator(list_of_models: List[BaseClustering])[source]

Bases: object

Class for comparing clustering models.

fit(data: DataFrame, directional_variables: List[str] = [], custom_scale_factor: dict = {}) None[source]

Fits the clustering models.

plot_data_as_clusters(data: DataFrame) None[source]

Plots the data as clusters for the clustering models.

plot_selected_centroids() None[source]

Plots the selected centroids for the clustering models.

class bluemath_tk.datamining.KMA(num_clusters: int, seed: int = None)[source]

Bases: BaseClustering

K-Means (KMA) class.

This class performs the K-Means algorithm on a given dataframe.

num_clusters

The number of clusters to use in the K-Means algorithm.

Type:

int

seed

The random seed to use as initial datapoint.

Type:

int

data_variables

A list with all data variables.

Type:

List[str]

directional_variables

A list with directional variables.

Type:

List[str]

fitting_variables

A list with fitting variables.

Type:

List[str]

custom_scale_factor

A dictionary of custom scale factors.

Type:

dict

scale_factor

A dictionary of scale factors (after normalizing the data).

Type:

dict

centroids

The selected centroids.

Type:

pd.DataFrame

normalized_centroids

The selected normalized centroids.

Type:

pd.DataFrame

centroid_real_indices

The real indices of the selected centroids.

Type:

np.array

Notes

  • The K-Means algorithm is used to cluster data points into k clusters.

  • The K-Means algorithm is sensitive to the initial centroids.

  • The K-Means algorithm is not suitable for large datasets.

Examples

import numpy as np
import pandas as pd
from bluemath_tk.datamining.kma import KMA

data = pd.DataFrame(
    {
        'Hs': np.random.rand(1000) * 7,
        'Tp': np.random.rand(1000) * 20,
        'Dir': np.random.rand(1000) * 360
    }
)
kma = KMA(num_clusters=5)
nearest_centroids_idxs, nearest_centroids_df = kma.fit_predict(
    data=data,
    directional_variables=['Dir'],
)

kma.plot_selected_centroids(plot_text=True)
(<Figure size 640x480 with 10 Axes>,
 array([[<Axes: xlabel='Tp', ylabel='Hs'>, <Axes: >, <Axes: >, <Axes: >],
        [<Axes: >, <Axes: xlabel='Dir', ylabel='Tp'>, <Axes: >, <Axes: >],
        [<Axes: >, <Axes: >, <Axes: xlabel='Dir_u', ylabel='Dir'>,
         <Axes: >],
        [<Axes: >, <Axes: >, <Axes: >,
         <Axes: xlabel='Dir_v', ylabel='Dir_u'>]], dtype=object))
_images/bluemath_tk.datamining_2_1.png
property data: DataFrame

Returns the original data used for clustering.

property data_to_fit: DataFrame

Returns the data used for fitting the K-Means algorithm.

fit(data: DataFrame, directional_variables: List[str] = [], custom_scale_factor: dict = {}, min_number_of_points: int = None, max_number_of_iterations: int = 10, normalize_data: bool = True) None[source]

Fit the K-Means algorithm to the provided data.

This method initializes centroids for the K-Means algorithm using the provided dataframe and custom scale factor. It normalizes the data, and returns the calculated centroids.

TODO: Implement KMA regression guided with variable.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the KMA algorithm.

  • directional_variables (List[str], optional) – A list of directional variables (will be transformed to u and v). Default is [].

  • custom_scale_factor (dict, optional) – A dictionary specifying custom scale factors for normalization. Default is {}.

  • min_number_of_points (int, optional) – The minimum number of points to consider a cluster. Default is None.

  • max_number_of_iterations (int, optional) – The maximum number of iterations for the K-Means algorithm. This is used when min_number_of_points is not None. Default is 10.

  • normalize_data (bool, optional) – A flag to normalize the data. Default is True.

fit_predict(data: DataFrame, directional_variables: List[str] = [], custom_scale_factor: dict = {}, min_number_of_points: int = None, max_number_of_iterations: int = 10, normalize_data: bool = True) Tuple[DataFrame, DataFrame][source]

Fit the K-Means algorithm to the provided data and predict the nearest centroid for each data point.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the KMA algorithm.

  • directional_variables (List[str], optional) – A list of directional variables (will be transformed to u and v). Default is [].

  • custom_scale_factor (dict) – A dictionary specifying custom scale factors for normalization. Default is {}.

  • min_number_of_points (int, optional) – The minimum number of points to consider a cluster. Default is None.

  • max_number_of_iterations (int, optional) – The maximum number of iterations for the K-Means algorithm. This is used when min_number_of_points is not None. Default is 10.

  • normalize_data (bool, optional) – A flag to normalize the data. Default is True.

Returns:

A tuple containing the nearest centroid index for each data point, and the nearest centroids.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

property kma: KMeans
property normalized_data: DataFrame

Returns the normalized data used for clustering.

predict(data: DataFrame) Tuple[DataFrame, DataFrame][source]

Predict the nearest centroid for the provided data.

Parameters:

data (pd.DataFrame) – The input data to be used for the prediction.

Returns:

A tuple containing the nearest centroid index for each data point, and the nearest centroids.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

class bluemath_tk.datamining.LHS(num_dimensions: int, seed: int = 1)[source]

Bases: BaseSampling

Latin Hypercube Sampling (LHS) class.

This class performs the LHS algorithm for some input data.

num_dimensions

The number of dimensions to use in the LHS algorithm.

Type:

int

seed

The random seed to use.

Type:

int

lhs

The Latin Hypercube object.

Type:

qdc.LatinHypercube

data

The LHS samples dataframe.

Type:

pd.DataFrame

generate(dimensions_names, lower_bounds, upper_bounds, num_samples)[source]

Generate LHS samples.

Notes

  • This class is designed to perform the LHS algorithm.

Examples

>>> from bluemath_tk.datamining.lhs import LHS
>>> dimensions_names = ['CM', 'SS', 'Qb']
>>> lower_bounds = [0.5, -0.2, 1]
>>> upper_bounds = [5.3, 1.5, 200]
>>> lhs = LHS(num_dimensions=3, seed=0)
>>> lhs_sampled_df = lhs.generate(
...     dimensions_names=dimensions_names,
...     lower_bounds=lower_bounds,
...     upper_bounds=upper_bounds,
...     num_samples=100,
... )
property data: DataFrame
generate(dimensions_names: List[str], lower_bounds: List[float], upper_bounds: List[float], num_samples: int) DataFrame[source]

Generate LHS samples.

Parameters:
  • dimensions_names (List[str]) – The names of the dimensions.

  • lower_bounds (List[float]) – The lower bounds of the dimensions.

  • upper_bounds (List[float]) – The upper bounds of the dimensions.

  • num_samples (int) – The number of samples to generate. Must be greater than 0.

Returns:

self.data – The LHS samples.

Return type:

pd.DataFrame

property lhs: LatinHypercube
class bluemath_tk.datamining.MDA(num_centers: int)[source]

Bases: BaseClustering

Maximum Dissimilarity Algorithm (MDA) class.

This class performs the MDA algorithm on a given dataframe.

num_centers

The number of centers to use in the MDA algorithm.

Type:

int

data

The input data.

Type:

pd.DataFrame

normalized_data

The normalized input data.

Type:

pd.DataFrame

data_to_fit

The data to fit the MDA algorithm.

Type:

pd.DataFrame

data_variables

A list with all data variables.

Type:

List[str]

directional_variables

A list with directional variables.

Type:

List[str]

fitting_variables

A list with fitting variables.

Type:

List[str]

custom_scale_factor

A dictionary of custom scale factors.

Type:

dict

scale_factor

A dictionary of scale factors (after normalizing the data).

Type:

dict

centroids

The selected centroids.

Type:

pd.DataFrame

normalized_centroids

The selected normalized centroids.

Type:

pd.DataFrame

centroid_iterative_indices

A list of iterative indices of the centroids.

Type:

List[int]

centroid_real_indices

The real indices of the selected centroids.

Type:

List[int]

fit(data, directional_variables, custom_scale_factor, first_centroid_seed)[source]

Fit the MDA algorithm to the provided data.

predict(data)[source]

Predict the nearest centroid for the provided data.

fit_predict(data, directional_variables, custom_scale_factor, first_centroid_seed)[source]

Fits the MDA model to the data and predicts the nearest centroids.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.mda import MDA
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> mda = MDA(num_centers=10)
>>> nearest_centroids_idxs, nearest_centroids_df = mda.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )
property data: DataFrame
property data_to_fit: DataFrame
fit(data: DataFrame, directional_variables: List[str] = [], custom_scale_factor: dict = {}, first_centroid_seed: int = None) None[source]

Fit the Maximum Dissimilarity Algorithm (MDA) to the provided data.

This method initializes centroids for the MDA algorithm using the provided dataframe, directional variables, and custom scale factor. It normalizes the data, iteratively selects centroids based on maximum dissimilarity, and denormalizes the centroids before returning them.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the MDA algorithm.

  • directional_variables (List[str], optional) – A list of names of the directional variables within the data. Default is [].

  • custom_scale_factor (dict, optional) – A dictionary specifying custom scale factors for normalization. Default is {}.

  • first_centroid_seed (int, optional) – The index of the first centroid to use in the MDA algorithm. Default is None.

Notes

  • The function assumes that the data is validated by the validate_data_mda

    decorator before execution.

  • When first_centroid_seed is not provided, max value centroid is used.

fit_predict(data: DataFrame, directional_variables: List[str] = [], custom_scale_factor: dict = {}, first_centroid_seed: int = None) Tuple[ndarray, DataFrame][source]

Fits the MDA model to the data and predicts the nearest centroids.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the MDA algorithm.

  • directional_variables (List[str], optional) – A list of names of the directional variables within the data. Default is [].

  • custom_scale_factor (dict, optional) – A dictionary specifying custom scale factors for normalization. Default is {}.

  • first_centroid_seed (int, optional) – The index of the first centroid to use in the MDA algorithm. Default is None.

Returns:

A tuple containing the nearest centroid index for each data point and the nearest centroids.

Return type:

Tuple[np.ndarray, pd.DataFrame]

property normalized_data: DataFrame
predict(data: DataFrame) Tuple[ndarray, DataFrame][source]

Predict the nearest centroid for the provided data.

Parameters:

data (pd.DataFrame) – The input data to be used for the prediction.

Returns:

A tuple containing the nearest centroid index for each data point and the nearest centroids.

Return type:

Tuple[np.ndarray, pd.DataFrame]

class bluemath_tk.datamining.PCA(n_components: int | float = 0.98, is_incremental: bool = False, debug: bool = False)[source]

Bases: BaseReduction

Principal Component Analysis (PCA) class.

n_components

The number of components or the explained variance ratio.

Type:

Union[int, float]

is_incremental

Indicates whether Incremental PCA is used.

Type:

bool

is_fitted

Indicates whether the PCA model has been fitted.

Type:

bool

scaler

The scaler used for standardizing the data, in case the data is standarized.

Type:

StandardScaler

vars_to_stack

The list of variables to stack.

Type:

List[str]

window_stacked_vars

The list of variables with windows.

Type:

List[str]

coords_to_stack

The list of coordinates to stack.

Type:

List[str]

coords_values

The values of the data coordinates used in fitting.

Type:

dict

pca_dim_for_rows

The dimension for rows in PCA.

Type:

str

windows_in_pca_dim_for_rows

The windows in PCA dimension for rows.

Type:

dict

value_to_replace_nans

The values to replace NaNs in the dataset.

Type:

dict

nan_threshold_to_drop

The threshold percentage to drop NaNs for each variable.

Type:

dict

num_cols_for_vars

The number of columns for variables.

Type:

int

pcs

The Principal Components (PCs).

Type:

xr.Dataset

Examples

from bluemath_tk.core.data.sample_data import get_2d_dataset
from bluemath_tk.datamining.pca import PCA

ds = get_2d_dataset()

pca = PCA(
    n_components=5,
    is_incremental=False,
    debug=True,
)
pca.fit(
    data=ds,
    vars_to_stack=["X", "Y"],
    coords_to_stack=["coord1", "coord2"],
    pca_dim_for_rows="coord3",
    windows_in_pca_dim_for_rows={"X": [1, 2, 3]},
    value_to_replace_nans={"X": 0.0},
    nan_threshold_to_drop={"X": 0.95},
)
pcs = pca.transform(
    data=ds,
)
reconstructed_ds = pca.inverse_transform(PCs=pcs)
eofs = pca.eofs
explained_variance = pca.explained_variance
explained_variance_ratio = pca.explained_variance_ratio
cumulative_explained_variance_ratio = pca.cumulative_explained_variance_ratio

# Save the full class in a pickle file
pca.save_model("pca_model.pkl")

# Plot the calculated EOFs
pca.plot_eofs(vars_to_plot=["X", "Y"], num_eofs=3)

        -------------------------------------------------------------------
        | Initializing PCA reduction model with the following parameters:
        |    - n_components: 5
        |    - is_incremental: False
        | For more information, please refer to the documentation.
        -------------------------------------------------------------------
        
_images/bluemath_tk.datamining_3_5.png _images/bluemath_tk.datamining_3_6.png

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

[2] https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html

[3] https://www.sciencedirect.com/science/article/abs/pii/S0378383911000676

property cumulative_explained_variance_ratio: ndarray

Return the cumulative explained variance ratio of the PCA model.

property data: Dataset

Returns the raw data used for PCA.

property eofs: Dataset

Return the Empirical Orthogonal Functions (EOFs).

property explained_variance: ndarray

Return the explained variance of the PCA model.

property explained_variance_ratio: ndarray

Return the explained variance ratio of the PCA model.

fit(data: Dataset, vars_to_stack: List[str], coords_to_stack: List[str], pca_dim_for_rows: str, windows_in_pca_dim_for_rows: dict = {}, value_to_replace_nans: dict = {}, nan_threshold_to_drop: dict = {}, scale_data: bool = True) None[source]

Fit PCA model to data.

Parameters:
  • data (xr.Dataset) – The data to fit the PCA model.

  • vars_to_stack (list of str) – The variables to stack.

  • coords_to_stack (list of str) – The coordinates to stack.

  • pca_dim_for_rows (str) – The PCA dimension to maintain in rows (usually the time).

  • windows_in_pca_dim_for_rows (dict, optional) – The window steps to roll the pca_dim_for_rows for each variable. Default is {}.

  • value_to_replace_nans (dict, optional) – The value to replace NaNs for each variable. Default is {}.

  • nan_threshold_to_drop (dict, optional) – The threshold percentage to drop NaNs for each variable. By default, variables with less than 90% of valid values are dropped. Default is {}.

  • scale_data (bool, optional) – If True, scale the data. Default is True.

Notes

For both value_to_replace_nans and nan_threshold_to_drop, the keys are the variables, and the suffixes for the windows are considered. Example: if you have variable “X”, and apply windows [1, 2, 3], you can use “X_1”, “X_2”, “X_3”. Nevertheless, you can also use the original variable name “X” to apply the same value for all windows.

fit_transform(data: Dataset, vars_to_stack: List[str], coords_to_stack: List[str], pca_dim_for_rows: str, windows_in_pca_dim_for_rows: dict = {}, value_to_replace_nans: dict = {}, nan_threshold_to_drop: dict = {}, scale_data: bool = True) Dataset[source]

Fit and transform data using PCA model.

Parameters:
  • data (xr.Dataset) – The data to fit the PCA model.

  • vars_to_stack (list of str) – The variables to stack.

  • coords_to_stack (list of str) – The coordinates to stack.

  • pca_dim_for_rows (str) – The PCA dimension to maintain in rows (usually the time).

  • windows_in_pca_dim_for_rows (dict, optional) – The window steps to roll the pca_dim_for_rows for each variable. Default is {}.

  • value_to_replace_nans (dict, optional) – The value to replace NaNs for each variable. Default is {}.

  • nan_threshold_to_drop (dict, optional) – The threshold percentage to drop NaNs for each variable. By default, variables with more than 10% of NaNs are dropped. Default is {}.

  • scale_data (bool, optional) – If True, scale the data. Default is True.

Returns:

The transformed data representing the Principal Components (PCs).

Return type:

xr.Dataset

Notes

For both value_to_replace_nans and nan_threshold_to_drop, the keys are the variables, and the suffixes for the windows are considered. Example: if you have variable “X”, and apply windows [1, 2, 3], you can use “X_1”, “X_2”, “X_3”. Nevertheless, you can also use the original variable name “X” to apply the same value for all windows.

inverse_transform(PCs: DataArray | Dataset) Dataset[source]

Inverse transform data using the fitted PCA model.

Parameters:

PCs (Union[xr.DataArray, xr.Dataset]) – The data to inverse transform. It should be the Principal Components (PCs).

Returns:

The inverse transformed data.

Return type:

xr.Dataset

property pca: PCA | IncrementalPCA

Returns the PCA or IncrementalPCA instance used for dimensionality reduction.

property pcs_df: DataFrame

Returns the principal components as a DataFrame.

plot_eofs(vars_to_plot: List[str], num_eofs: int, destandarize: bool = False, map_center: tuple = None) None[source]

Plot the Empirical Orthogonal Functions (EOFs).

Parameters:
  • vars_to_plot (List[str]) – The variables to plot.

  • num_eofs (int) – The number of EOFs to plot.

  • destandarize (bool, optional) – If True, destandarize the EOFs. Default is False.

  • map_center (tuple, optional) – The center of the map. Default is None. First value is the longitude (-180, 180), and the second value is the latitude (-90, 90).

plot_pcs(num_pcs: int, pcs: Dataset = None) None[source]

Plot the Principal Components (PCs).

Parameters:
  • num_pcs (int) – The number of Principal Components (PCs) to plot.

  • pcs (xr.Dataset, optional) – The Principal Components (PCs) to plot.

property stacked_data_matrix: ndarray

Return the stacked data matrix.

property standarized_stacked_data_matrix: ndarray

Return the standarized stacked data matrix.

transform(data: Dataset, after_fitting: bool = False) Dataset[source]

Transform data using the fitted PCA model.

Parameters:
  • data (xr.Dataset) – The data to transform.

  • after_fitting (bool, optional) – If True, use the already processed data. Default is False. This is just used in the fit_transform method!

Returns:

The transformed data.

Return type:

xr.Dataset

property window_processed_data: Dataset

Return the window processed data used for PCA.

class bluemath_tk.datamining.SOM(som_shape: Tuple[int, int], num_dimensions: int, sigma: float = 1, learning_rate: float = 0.5, decay_function: str = 'asymptotic_decay', neighborhood_function: str = 'gaussian', topology: str = 'rectangular', activation_distance: str = 'euclidean', random_seed: int = None, sigma_decay_function: str = 'asymptotic_decay')[source]

Bases: BaseClustering

Self-Organizing Map (SOM) class.

This class performs the Self-Organizing Map algorithm on a given dataframe.

som_shape

The shape of the SOM.

Type:

Tuple[int, int]

num_dimensions

The number of dimensions of the input data.

Type:

int

data

The input data.

Type:

pd.DataFrame

standarized_data

The standarized input data.

Type:

pd.DataFrame

data_to_fit

The data to fit the SOM algorithm.

Type:

pd.DataFrame

data_variables

A list with all data variables.

Type:

List[str]

directional_variables

A list with directional variables.

Type:

List[str]

fitting_variables

A list with fitting variables.

Type:

List[str]

scaler

The StandardScaler object.

Type:

StandardScaler

centroids

The selected centroids.

Type:

pd.DataFrame

is_fitted

A flag to check if the SOM model is fitted.

Type:

bool

activation_response(data)[source]

Returns the activation response of the given data.

get_centroids_probs_for_labels(data, labels)[source]

Returns the labels map of the given data.

plot_centroids_probs_for_labels(probs_data)[source]

Plots the labels map of the given data.

fit(data, directional_variables, num_iteration)[source]

Fits the SOM model to the provided data.

predict(data)[source]

Predicts the nearest centroid for the provided data.

fit_predict(data, directional_variables, num_iteration)[source]

Fit the SOM algorithm to the provided data and predict the nearest centroid for each data point.

Notes

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.som import SOM
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> som = SOM(som_shape=(3, 3), num_dimensions=4)
>>> nearest_centroids_idxs, nearest_centroids_df = som.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )
activation_response(data: DataFrame = None) ndarray[source]

Returns the activation response of the given data.

property data: DataFrame
property data_to_fit: DataFrame
property distance_map: ndarray

Returns the distance map of the SOM.

fit(data: DataFrame, directional_variables: List[str] = [], num_iteration: int = 1000) None[source]

Fits the SOM model to the provided data.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the fitting.

  • directional_variables (List[str], optional) – A list with the directional variables (will be transformed to u and v). Default is [].

  • num_iteration (int, optional) – The number of iterations for the SOM fitting. Default is 1000.

Notes

  • The function assumes that the data is validated by the validate_data_som

decorator before execution.

fit_predict(data: DataFrame, directional_variables: List[str] = [], num_iteration: int = 1000) Tuple[ndarray, DataFrame][source]

Fit the SOM algorithm to the provided data and predict the nearest centroid for each data point.

Parameters:
  • data (pd.DataFrame) – The input data to be used for the SOM algorithm.

  • directional_variables (List[str], optional) – A list of directional variables (will be transformed to u and v). Default is [].

  • num_iteration (int, optional) – The number of iterations for the SOM fitting. Default is 1000.

Returns:

A tuple containing the winner neurons for each data point and the nearest centroids.

Return type:

Tuple[np.ndarray, pd.DataFrame]

get_centroids_probs_for_labels(data: DataFrame, labels: List[str]) DataFrame[source]

Returns the labels map of the given data.

plot_centroids_probs_for_labels(probs_data: DataFrame) Tuple[figure, axes][source]

Plots the labels map of the given data.

predict(data: DataFrame) Tuple[ndarray, DataFrame][source]

Predicts the nearest centroid for the provided data.

Parameters:

data (pd.DataFrame) – The input data to be used for the prediction.

Returns:

A tuple with the winner neurons and the centroids of the given data.

Return type:

Tuple[np.ndarray, pd.DataFrame]

property som: MiniSom
property standarized_data: DataFrame