Clustering

`MDA`

Bases: BaseClustering

Maximum Dissimilarity Algorithm (MDA) class.

This class performs the MDA algorithm on a given dataframe.

Attributes:

num_centers (int) –

The number of centers to use in the MDA algorithm.
data (DataFrame) –

The input data.
normalized_data (DataFrame) –

The normalized input data.
data_to_fit (DataFrame) –

The data to fit the MDA algorithm.
data_variables (List[str]) –

A list with all data variables.
directional_variables (List[str]) –

A list with directional variables.
fitting_variables (List[str]) –

A list with fitting variables.
custom_scale_factor (dict) –

A dictionary of custom scale factors.
scale_factor (dict) –

A dictionary of scale factors (after normalizing the data).
centroids (DataFrame) –

The selected centroids.
normalized_centroids (DataFrame) –

The selected normalized centroids.
centroid_iterative_indices (List[int]) –

A list of iterative indices of the centroids.
centroid_real_indices (List[int]) –

The real indices of the selected centroids.

Methods:

Name	Description
`fit`	Fit the MDA algorithm to the provided data.
`predict`	Predict the nearest centroid for the provided data.
`fit_predict`	Fits the MDA model to the data and predicts the nearest centroids.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.mda import MDA
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> mda = MDA(num_centers=10)
>>> nearest_centroids_idxs, nearest_centroids_df = mda.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )

`init(num_centers)`

Initializes the MDA class.

Parameters:	`num_centers` (`int`) – The number of centers to use in the MDA algorithm. Must be greater than 0.

Raises:	`ValueError` – If num_centers is not greater than 0.

`fit(data, directional_variables=[], custom_scale_factor={}, first_centroid_seed=None)`

Fit the Maximum Dissimilarity Algorithm (MDA) to the provided data.

This method initializes centroids for the MDA algorithm using the provided dataframe, directional variables, and custom scale factor. It normalizes the data, iteratively selects centroids based on maximum dissimilarity, and denormalizes the centroids before returning them.

Parameters:

data (DataFrame) –

The input data to be used for the MDA algorithm.
directional_variables (List[str], default: [] ) –

A list of names of the directional variables within the data. Default is [].
custom_scale_factor (dict, default: {} ) –

A dictionary specifying custom scale factors for normalization. Default is {}.
first_centroid_seed (int, default: None ) –

The index of the first centroid to use in the MDA algorithm. Default is None.

Notes

The function assumes that the data is validated by the validate_data_mda decorator before execution.
When first_centroid_seed is not provided, max value centroid is used.

`fit_predict(data, directional_variables=[], custom_scale_factor={}, first_centroid_seed=None)`

Fits the MDA model to the data and predicts the nearest centroids.

Parameters:

data (DataFrame) –

The input data to be used for the MDA algorithm.
directional_variables (List[str], default: [] ) –

A list of names of the directional variables within the data. Default is [].
custom_scale_factor (dict, default: {} ) –

A dictionary specifying custom scale factors for normalization. Default is {}.
first_centroid_seed (int, default: None ) –

The index of the first centroid to use in the MDA algorithm. Default is None.

Returns:	`Tuple[ndarray, DataFrame]` – A tuple containing the nearest centroid index for each data point and the nearest centroids.

`predict(data)`

Predict the nearest centroid for the provided data.

Parameters:	`data` (`DataFrame`) – The input data to be used for the prediction.

Returns:	`Tuple[ndarray, DataFrame]` – A tuple containing the nearest centroid index for each data point and the nearest centroids.

`MDAError`

Bases: Exception

Custom exception for MDA class.

`KMA`

Bases: BaseClustering

K-Means (KMA) class.

This class performs the K-Means algorithm on a given dataframe.

Attributes:

num_clusters (int) –

The number of clusters to use in the K-Means algorithm.
seed (int) –

The random seed to use as initial datapoint.
data (DataFrame) –

The input data.
normalized_data (DataFrame) –

The normalized input data.
data_to_fit (DataFrame) –

The data to fit the K-Means algorithm.
data_variables (List[str]) –

A list with all data variables.
directional_variables (List[str]) –

A list with directional variables.
fitting_variables (List[str]) –

A list with fitting variables.
custom_scale_factor (dict) –

A dictionary of custom scale factors.
scale_factor (dict) –

A dictionary of scale factors (after normalizing the data).
centroids (DataFrame) –

The selected centroids.
normalized_centroids (DataFrame) –

The selected normalized centroids.
centroid_real_indices (array) –

The real indices of the selected centroids.

Methods:

Name	Description
`fit`	Fit the K-Means algorithm to the provided data.
`predict`	Predict the nearest centroid for the provided data.
`fit_predict`	Fit the K-Means algorithm to the provided data and predict the nearest centroid for each data point.

Notes

The K-Means algorithm is used to cluster data points into k clusters.
The K-Means algorithm is sensitive to the initial centroids.
The K-Means algorithm is not suitable for large datasets.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.kma import KMA
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> kma = KMA(num_clusters=5)
>>> nearest_centroids_idxs, nearest_centroids_df = kma.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )

TODO

Add customization for the K-Means algorithm.

`init(num_clusters, seed=None, init='k-means++', n_init='auto', algorithm='lloyd')`

Initializes the KMA class.

Parameters:

num_clusters (int) –

The number of clusters to use in the K-Means algorithm. Must be greater than 0.
seed (int, default: None ) –

The random seed to use as initial datapoint. Must be greater or equal to 0 and less than number of datapoints. Default is 0.
n_init (str, default: 'auto' ) –

The number of initializations to perform. Default is "k-means++".
algorithm (str, default: 'lloyd' ) –

The algorithm to use. Default is "lloyd".

Raises:	`ValueError` – If num_centers is not greater than 0. Or if seed is not greater or equal to 0.

`fit(data, directional_variables=[], custom_scale_factor={})`

Fit the K-Means algorithm to the provided data.

This method initializes centroids for the K-Means algorithm using the provided dataframe and custom scale factor. It normalizes the data, and returns the calculated centroids.

Parameters:	`data` (`DataFrame`) – The input data to be used for the KMA algorithm. `directional_variables` (`List[str]`, default: `[]` ) – A list of directional variables (will be transformed to u and v). Default is []. `custom_scale_factor` (`dict`, default: `{}` ) – A dictionary specifying custom scale factors for normalization. Default is {}.

Notes

The function assumes that the data is validated by the validate_data_kma decorator before execution.

`fit_predict(data, directional_variables=[], custom_scale_factor={})`

Fit the K-Means algorithm to the provided data and predict the nearest centroid for each data point.

Parameters:	`data` (`DataFrame`) – The input data to be used for the KMA algorithm. `directional_variables` (`List[str]`, default: `[]` ) – A list of directional variables (will be transformed to u and v). Default is []. `custom_scale_factor` (`dict`, default: `{}` ) – A dictionary specifying custom scale factors for normalization. Default is {}.

Returns:	`Tuple[DataFrame, ndarray, DataFrame]` – A tuple containing the nearest centroid index for each data point, and the nearest centroids.

`predict(data)`

Predict the nearest centroid for the provided data.

Parameters:	`data` (`DataFrame`) – The input data to be used for the prediction.

Returns:	`Tuple[ndarray, DataFrame]` – A tuple containing the nearest centroid index for each data point and the nearest centroids.

`KMAError`

Bases: Exception

Custom exception for KMA class.

`SOM`

Bases: BaseClustering

Self-Organizing Map (SOM) class.

This class performs the Self-Organizing Map algorithm on a given dataframe.

Attributes:

som_shape (Tuple[int, int]) –

The shape of the SOM.
num_dimensions (int) –

The number of dimensions of the input data.
data (DataFrame) –

The input data.
standarized_data (DataFrame) –

The standarized input data.
data_to_fit (DataFrame) –

The data to fit the SOM algorithm.
data_variables (List[str]) –

A list with all data variables.
directional_variables (List[str]) –

A list with directional variables.
fitting_variables (List[str]) –

A list with fitting variables.
scaler (StandardScaler) –

The StandardScaler object.
centroids (DataFrame) –

The selected centroids.
is_fitted (bool) –

A flag to check if the SOM model is fitted.

Methods:

Name	Description
`activation_response`	Returns the activation response of the given data.
`get_centroids_probs_for_labels`	Returns the labels map of the given data.
`plot_centroids_probs_for_labels`	Plots the labels map of the given data.
`fit`	Fits the SOM model to the provided data.
`predict`	Predicts the nearest centroid for the provided data.
`fit_predict`	Fit the SOM algorithm to the provided data and predict the nearest centroid for each data point.

Notes

Check MiniSom documentation for more information: https://github.com/JustGlowing/minisom

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> from bluemath_tk.datamining.som import SOM
>>> data = pd.DataFrame(
...     {
...         'Hs': np.random.rand(1000) * 7,
...         'Tp': np.random.rand(1000) * 20,
...         'Dir': np.random.rand(1000) * 360
...     }
... )
>>> som = SOM(som_shape=(3, 3), num_dimensions=4)
>>> nearest_centroids_idxs, nearest_centroids_df = som.fit_predict(
...     data=data,
...     directional_variables=['Dir'],
... )

TODO

Add option to normalize data?

`distance_map` `property`

Returns the distance map of the SOM.

`init(som_shape, num_dimensions, sigma=1, learning_rate=0.5, decay_function='asymptotic_decay', neighborhood_function='gaussian', topology='rectangular', activation_distance='euclidean', random_seed=None, sigma_decay_function='asymptotic_decay')`

Initializes a Self Organizing Maps.

A rule of thumb to set the size of the grid for a dimensionality reduction task is that it should contain 5*sqrt(N) neurons where N is the number of samples in the dataset to analyze.

E.g. if your dataset has 150 samples, 5*sqrt(150) = 61.23 hence a map 8-by-8 should perform well.

Parameters:	`som_shape` (`tuple`) – Shape of the SOM. This should be a tuple with two integers. `num_dimensions` (`int`) – Number of the elements of the vectors in input. `For` – https://github.com/JustGlowing/minisom/blob/master/minisom.py

Raises:	`ValueError` – If the SOM shape is not a tuple with two integers. Or if the number of dimensions is not an integer.

`activation_response(data=None)`

Returns the activation response of the given data.

`fit(data, directional_variables=[], num_iteration=1000)`

Fits the SOM model to the provided data.

Parameters:	`data` (`DataFrame`) – The input data to be used for the fitting. `directional_variables` (`List[str]`, default: `[]` ) – A list with the directional variables (will be transformed to u and v). Default is []. `num_iteration` (`int`, default: `1000` ) – The number of iterations for the SOM fitting. Default is 1000.

Notes

The function assumes that the data is validated by the validate_data_som decorator before execution.

`fit_predict(data, directional_variables=[], num_iteration=1000)`

Fit the SOM algorithm to the provided data and predict the nearest centroid for each data point.

Parameters:	`data` (`DataFrame`) – The input data to be used for the SOM algorithm. `directional_variables` (`List[str]`, default: `[]` ) – A list of directional variables (will be transformed to u and v). Default is []. `num_iteration` (`int`, default: `1000` ) – The number of iterations for the SOM fitting. Default is 1000.

Returns:	`Tuple[ndarray, DataFrame]` – A tuple containing the winner neurons for each data point and the nearest centroids.

`get_centroids_probs_for_labels(data, labels)`

Returns the labels map of the given data.

`plot_centroids_probs_for_labels(probs_data)`

Plots the labels map of the given data.

`predict(data)`

Predicts the nearest centroid for the provided data.

Parameters:	`data` (`DataFrame`) – The input data to be used for the prediction.

Returns:	`Tuple[ndarray, DataFrame]` – A tuple with the winner neurons and the centroids of the given data.

`SOMError`

Bases: Exception

Custom exception for SOM class.

MDA

__init__(num_centers)

fit(data, directional_variables=[], custom_scale_factor={}, first_centroid_seed=None)

fit_predict(data, directional_variables=[], custom_scale_factor={}, first_centroid_seed=None)

predict(data)

MDAError

KMA

__init__(num_clusters, seed=None, init='k-means++', n_init='auto', algorithm='lloyd')

fit(data, directional_variables=[], custom_scale_factor={})

fit_predict(data, directional_variables=[], custom_scale_factor={})

predict(data)

KMAError

SOM

distance_map property

__init__(som_shape, num_dimensions, sigma=1, learning_rate=0.5, decay_function='asymptotic_decay', neighborhood_function='gaussian', topology='rectangular', activation_distance='euclidean', random_seed=None, sigma_decay_function='asymptotic_decay')

activation_response(data=None)

fit(data, directional_variables=[], num_iteration=1000)

fit_predict(data, directional_variables=[], num_iteration=1000)

get_centroids_probs_for_labels(data, labels)

plot_centroids_probs_for_labels(probs_data)

predict(data)

SOMError

`MDA`

`init(num_centers)`

`fit(data, directional_variables=[], custom_scale_factor={}, first_centroid_seed=None)`

`fit_predict(data, directional_variables=[], custom_scale_factor={}, first_centroid_seed=None)`

`predict(data)`

`MDAError`

`KMA`

`init(num_clusters, seed=None, init='k-means++', n_init='auto', algorithm='lloyd')`

`fit(data, directional_variables=[], custom_scale_factor={})`

`fit_predict(data, directional_variables=[], custom_scale_factor={})`

`predict(data)`

`KMAError`

`SOM`

`distance_map` `property`

`init(som_shape, num_dimensions, sigma=1, learning_rate=0.5, decay_function='asymptotic_decay', neighborhood_function='gaussian', topology='rectangular', activation_distance='euclidean', random_seed=None, sigma_decay_function='asymptotic_decay')`

`activation_response(data=None)`

`fit(data, directional_variables=[], num_iteration=1000)`

`fit_predict(data, directional_variables=[], num_iteration=1000)`

`get_centroids_probs_for_labels(data, labels)`

`plot_centroids_probs_for_labels(probs_data)`

`predict(data)`

`SOMError`