Correlation

class pyspark.ml.stat.Correlation[source]

Compute the correlation matrix for the input dataset of Vectors using the specified method. Methods currently supported: pearson (default), spearman.

New in version 2.2.0.

Notes

For Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input Dataset before calling corr with method = ‘spearman’ to avoid recomputing the common lineage.

Methods

corr(dataset, column[, method])

Compute the correlation matrix with specified method using dataset.

Methods Documentation

static corr(dataset: pyspark.sql.dataframe.DataFrame, column: str, method: str = 'pearson') → pyspark.sql.dataframe.DataFrame[source]

Compute the correlation matrix with specified method using dataset.

New in version 2.2.0.

Parameters
datasetpyspark.sql.DataFrame

A DataFrame.

columnstr

The name of the column of vectors for which the correlation coefficient needs to be computed. This must be a column of the dataset, and it must contain Vector objects.

methodstr, optional

String specifying the method to use for computing correlation. Supported: pearson (default), spearman.

Returns
A DataFrame that contains the correlation matrix of the column of vectors. This
DataFrame contains a single row and a single column of name METHODNAME(COLUMN).

Examples

>>>
>>> from pyspark.ml.linalg import DenseMatrix, Vectors
>>> from pyspark.ml.stat import Correlation
>>> dataset = [[Vectors.dense([1, 0, 0, -2])],
...            [Vectors.dense([4, 5, 0, 3])],
...            [Vectors.dense([6, 7, 0, 8])],
...            [Vectors.dense([9, 0, 0, 1])]]
>>> dataset = spark.createDataFrame(dataset, ['features'])
>>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
>>> print(str(pearsonCorr).replace('nan', 'NaN'))
DenseMatrix([[ 1.        ,  0.0556...,         NaN,  0.4004...],
             [ 0.0556...,  1.        ,         NaN,  0.9135...],
             [        NaN,         NaN,  1.        ,         NaN],
             [ 0.4004...,  0.9135...,         NaN,  1.        ]])
>>> spearmanCorr = Correlation.corr(dataset, 'features', method='spearman').collect()[0][0]
>>> print(str(spearmanCorr).replace('nan', 'NaN'))
DenseMatrix([[ 1.        ,  0.1054...,         NaN,  0.4       ],
             [ 0.1054...,  1.        ,         NaN,  0.9486... ],
             [        NaN,         NaN,  1.        ,         NaN],
             [ 0.4       ,  0.9486... ,         NaN,  1.        ]])