MDA

Bases: BaseClustering

Maximum Dissimilarity Algorithm (MDA) class.

This class performs the MDA algorithm on a given dataframe.

Attributes:
  • num_centers (int) –

    The number of centers to use in the MDA algorithm.

  • data (DataFrame) –

    The input data.

  • normalized_data (DataFrame) –

    The normalized input data.

  • data_to_fit (DataFrame) –

    The data to fit the MDA algorithm.

  • data_variables (List[str]) –

    A list with all data variables.

  • directional_variables (List[str]) –

    A list with directional variables.

  • fitting_variables (List[str]) –

    A list with fitting variables.

  • custom_scale_factor (dict) –

    A dictionary of custom scale factors.

  • scale_factor (dict) –

    A dictionary of scale factors (after normalizing the data).

  • centroids (DataFrame) –

    The selected centroids.

  • normalized_centroids (DataFrame) –

    The selected normalized centroids.

  • centroid_iterative_indices (List[int]) –

    A list of iterative indices of the centroids.

  • centroid_real_indices (List[int]) –

    The real indices of the selected centroids.

Methods:

Name Description
fit

Fit the MDA algorithm to the provided data.

predict

Predict the nearest centroid for the provided data.

fit_predict

Fits the MDA model to the data and predicts the nearest centroids.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.mda import MDA
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> mda = MDA(num_centers=10)
>>> nearest_centroids_idxs, nearest_centroids_df = mda.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )

__init__(num_centers)

Initializes the MDA class.

Parameters:
  • num_centers (int) –

    The number of centers to use in the MDA algorithm. Must be greater than 0.

Raises:
  • ValueError

    If num_centers is not greater than 0.

fit(data, directional_variables=[], custom_scale_factor={}, first_centroid_seed=None)

Fit the Maximum Dissimilarity Algorithm (MDA) to the provided data.

This method initializes centroids for the MDA algorithm using the provided dataframe, directional variables, and custom scale factor. It normalizes the data, iteratively selects centroids based on maximum dissimilarity, and denormalizes the centroids before returning them.

Parameters:
  • data (DataFrame) –

    The input data to be used for the MDA algorithm.

  • directional_variables (List[str], default: [] ) –

    A list of names of the directional variables within the data. Default is [].

  • custom_scale_factor (dict, default: {} ) –

    A dictionary specifying custom scale factors for normalization. Default is {}.

  • first_centroid_seed (int, default: None ) –

    The index of the first centroid to use in the MDA algorithm. Default is None.

Notes
  • The function assumes that the data is validated by the validate_data_mda decorator before execution.
  • When first_centroid_seed is not provided, max value centroid is used.

fit_predict(data, directional_variables=[], custom_scale_factor={}, first_centroid_seed=None)

Fits the MDA model to the data and predicts the nearest centroids.

Parameters:
  • data (DataFrame) –

    The input data to be used for the MDA algorithm.

  • directional_variables (List[str], default: [] ) –

    A list of names of the directional variables within the data. Default is [].

  • custom_scale_factor (dict, default: {} ) –

    A dictionary specifying custom scale factors for normalization. Default is {}.

  • first_centroid_seed (int, default: None ) –

    The index of the first centroid to use in the MDA algorithm. Default is None.

Returns:
  • Tuple[ndarray, DataFrame]

    A tuple containing the nearest centroid index for each data point and the nearest centroids.

predict(data)

Predict the nearest centroid for the provided data.

Parameters:
  • data (DataFrame) –

    The input data to be used for the prediction.

Returns:
  • Tuple[ndarray, DataFrame]

    A tuple containing the nearest centroid index for each data point and the nearest centroids.

MDAError

Bases: Exception

Custom exception for MDA class.

KMA

Bases: BaseClustering

K-Means (KMA) class.

This class performs the K-Means algorithm on a given dataframe.

Attributes:
  • num_clusters (int) –

    The number of clusters to use in the K-Means algorithm.

  • seed (int) –

    The random seed to use as initial datapoint.

  • data (DataFrame) –

    The input data.

  • normalized_data (DataFrame) –

    The normalized input data.

  • data_to_fit (DataFrame) –

    The data to fit the K-Means algorithm.

  • data_variables (List[str]) –

    A list with all data variables.

  • directional_variables (List[str]) –

    A list with directional variables.

  • fitting_variables (List[str]) –

    A list with fitting variables.

  • custom_scale_factor (dict) –

    A dictionary of custom scale factors.

  • scale_factor (dict) –

    A dictionary of scale factors (after normalizing the data).

  • centroids (DataFrame) –

    The selected centroids.

  • normalized_centroids (DataFrame) –

    The selected normalized centroids.

  • centroid_real_indices (array) –

    The real indices of the selected centroids.

Methods:

Name Description
fit

Fit the K-Means algorithm to the provided data.

predict

Predict the nearest centroid for the provided data.

fit_predict

Fit the K-Means algorithm to the provided data and predict the nearest centroid for each data point.

Notes
  • The K-Means algorithm is used to cluster data points into k clusters.
  • The K-Means algorithm is sensitive to the initial centroids.
  • The K-Means algorithm is not suitable for large datasets.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.kma import KMA
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> kma = KMA(num_clusters=5)
>>> nearest_centroids_idxs, nearest_centroids_df = kma.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )
TODO
  • Add customization for the K-Means algorithm.

__init__(num_clusters, seed=None, init='k-means++', n_init='auto', algorithm='lloyd')

Initializes the KMA class.

Parameters:
  • num_clusters (int) –

    The number of clusters to use in the K-Means algorithm. Must be greater than 0.

  • seed (int, default: None ) –

    The random seed to use as initial datapoint. Must be greater or equal to 0 and less than number of datapoints. Default is 0.

  • n_init (str, default: 'auto' ) –

    The number of initializations to perform. Default is "k-means++".

  • algorithm (str, default: 'lloyd' ) –

    The algorithm to use. Default is "lloyd".

Raises:
  • ValueError

    If num_centers is not greater than 0. Or if seed is not greater or equal to 0.

fit(data, directional_variables=[], custom_scale_factor={})

Fit the K-Means algorithm to the provided data.

This method initializes centroids for the K-Means algorithm using the provided dataframe and custom scale factor. It normalizes the data, and returns the calculated centroids.

Parameters:
  • data (DataFrame) –

    The input data to be used for the KMA algorithm.

  • directional_variables (List[str], default: [] ) –

    A list of directional variables (will be transformed to u and v). Default is [].

  • custom_scale_factor (dict, default: {} ) –

    A dictionary specifying custom scale factors for normalization. Default is {}.

Notes
  • The function assumes that the data is validated by the validate_data_kma decorator before execution.

fit_predict(data, directional_variables=[], custom_scale_factor={})

Fit the K-Means algorithm to the provided data and predict the nearest centroid for each data point.

Parameters:
  • data (DataFrame) –

    The input data to be used for the KMA algorithm.

  • directional_variables (List[str], default: [] ) –

    A list of directional variables (will be transformed to u and v). Default is [].

  • custom_scale_factor (dict, default: {} ) –

    A dictionary specifying custom scale factors for normalization. Default is {}.

Returns:
  • Tuple[DataFrame, ndarray, DataFrame]

    A tuple containing the nearest centroid index for each data point, and the nearest centroids.

predict(data)

Predict the nearest centroid for the provided data.

Parameters:
  • data (DataFrame) –

    The input data to be used for the prediction.

Returns:
  • Tuple[ndarray, DataFrame]

    A tuple containing the nearest centroid index for each data point and the nearest centroids.

KMAError

Bases: Exception

Custom exception for KMA class.

SOM

Bases: BaseClustering

Self-Organizing Map (SOM) class.

This class performs the Self-Organizing Map algorithm on a given dataframe.

Attributes:
  • som_shape (Tuple[int, int]) –

    The shape of the SOM.

  • num_dimensions (int) –

    The number of dimensions of the input data.

  • data (DataFrame) –

    The input data.

  • standarized_data (DataFrame) –

    The standarized input data.

  • data_to_fit (DataFrame) –

    The data to fit the SOM algorithm.

  • data_variables (List[str]) –

    A list with all data variables.

  • directional_variables (List[str]) –

    A list with directional variables.

  • fitting_variables (List[str]) –

    A list with fitting variables.

  • scaler (StandardScaler) –

    The StandardScaler object.

  • centroids (DataFrame) –

    The selected centroids.

  • is_fitted (bool) –

    A flag to check if the SOM model is fitted.

Methods:

Name Description
activation_response

Returns the activation response of the given data.

get_centroids_probs_for_labels

Returns the labels map of the given data.

plot_centroids_probs_for_labels

Plots the labels map of the given data.

fit

Fits the SOM model to the provided data.

predict

Predicts the nearest centroid for the provided data.

fit_predict

Fit the SOM algorithm to the provided data and predict the nearest centroid for each data point.

Notes
  • Check MiniSom documentation for more information: https://github.com/JustGlowing/minisom

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.som import SOM
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> som = SOM(som_shape=(3, 3), num_dimensions=4)
>>> nearest_centroids_idxs, nearest_centroids_df = som.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )
TODO
  • Add option to normalize data?

distance_map property

Returns the distance map of the SOM.

__init__(som_shape, num_dimensions, sigma=1, learning_rate=0.5, decay_function='asymptotic_decay', neighborhood_function='gaussian', topology='rectangular', activation_distance='euclidean', random_seed=None, sigma_decay_function='asymptotic_decay')

Initializes a Self Organizing Maps.

A rule of thumb to set the size of the grid for a dimensionality reduction task is that it should contain 5*sqrt(N) neurons where N is the number of samples in the dataset to analyze.

E.g. if your dataset has 150 samples, 5*sqrt(150) = 61.23 hence a map 8-by-8 should perform well.

Parameters:
  • som_shape (tuple) –

    Shape of the SOM. This should be a tuple with two integers.

  • num_dimensions (int) –

    Number of the elements of the vectors in input.

  • For

    https://github.com/JustGlowing/minisom/blob/master/minisom.py

Raises:
  • ValueError

    If the SOM shape is not a tuple with two integers. Or if the number of dimensions is not an integer.

activation_response(data=None)

Returns the activation response of the given data.

fit(data, directional_variables=[], num_iteration=1000)

Fits the SOM model to the provided data.

Parameters:
  • data (DataFrame) –

    The input data to be used for the fitting.

  • directional_variables (List[str], default: [] ) –

    A list with the directional variables (will be transformed to u and v). Default is [].

  • num_iteration (int, default: 1000 ) –

    The number of iterations for the SOM fitting. Default is 1000.

Notes
  • The function assumes that the data is validated by the validate_data_som decorator before execution.

fit_predict(data, directional_variables=[], num_iteration=1000)

Fit the SOM algorithm to the provided data and predict the nearest centroid for each data point.

Parameters:
  • data (DataFrame) –

    The input data to be used for the SOM algorithm.

  • directional_variables (List[str], default: [] ) –

    A list of directional variables (will be transformed to u and v). Default is [].

  • num_iteration (int, default: 1000 ) –

    The number of iterations for the SOM fitting. Default is 1000.

Returns:
  • Tuple[ndarray, DataFrame]

    A tuple containing the winner neurons for each data point and the nearest centroids.

get_centroids_probs_for_labels(data, labels)

Returns the labels map of the given data.

plot_centroids_probs_for_labels(probs_data)

Plots the labels map of the given data.

predict(data)

Predicts the nearest centroid for the provided data.

Parameters:
  • data (DataFrame) –

    The input data to be used for the prediction.

Returns:
  • Tuple[ndarray, DataFrame]

    A tuple with the winner neurons and the centroids of the given data.

SOMError

Bases: Exception

Custom exception for SOM class.