dcor (c46b2)

Distance Correlation (DCOR)

Distance correlation coefficient is a very useful and elegant alternative to the standard measures
of correlation and is based on several deep and non-trivial theoretical calculations developed by
Székely, Rizzo and Bakirov [1]. The main result is that a single, simple statistic DCOR(X,Y) can
be used to assess whether two random vectors X and Y, of possibly different respective dimensions,
are dependent, linearly or non-linearly, based on an independent and identically distributed
(i.i.d.) sample.

* Commands | Invoking DCOR
* Background | An introduction to distance correlation
* References | Relevant citations
* Example | Outline of an example input script

DCOR Commands

DCOR can be invoked from CORREL using the keyword DCOR to calculate dependence between two time
series, which can be of different respective dimensions. Time series can contain any variable. The
only requirement is both the time series should be of equal length.

setup trajectory file

correl maxt ... maxa ... maxs …
setup first time series
setup second time series
traj trajectory sepcifications
dcor time-series1 time-series2

DCOR can also be invoked from CORMAN as a part of “COORdinate COVAarianceâ€, using the keyword DCOR
(DCOV), to calculate distance correlation (covariance) between positional fluctuation of two
selection of atoms.

COORdinates COVAriance traj-spec 2x(atom_selection) [UNIT_for_output int] -
[RESIdue_average_nsets integer] [MATRix] -
[ENTRopy [TEMP <real>] [DIAG] [RESI] [SCHL] ]-

If DCOR or DCOV has been requested with ENTRopy, ENTRopy command will be ignored. If both DCOR and
DCOV have been requested, then DCOR will be ignored.

When using CORREL parallelization in calculating distance correlations can be achieved by submitting
multiple serial jobs. Using COORdinate COVAriance distance correlations between positional
fluctuations of multiple atoms can be calculated using parallel version of CHARMM.

Benchmark for 352x352 atoms, 500 steps dynamics, using 32 core AMD box:

CPUs DCOR (charmm c40)
time (sec) speedup efficiency

1 1443.0 1.00 100%
2 750.0 1.92 96%
4 366.6 3.94 98%
8 189.6 7.61 95%
16 93.9 15.37 96%

For better parallelization of DCOR using COORdinate COVAriance, select the larger number of atoms using
the second selection. Time requirement for DCOR calculation between two time series increases as N*N
where N is the length of the time series.

Introduction to Distance Correlation

Among Pearson’s correlation coefficient (PCC), a generalized correlation coefficient (GCC) 16 and
distance correlation coefficient (DCOR), DCOR is the most appropriate parameter to find association
in atomic motions because it is least sensitive to angular dependence while reflecting variability in
covariance. [2] Calculation of DCOR between two vector series is straightforward to implement. Let {A}
and {B} be two vector series with m entries each and the ith entry in {A} is denoted by A_i . To
calculate distance covariance between {A} and {B} the following five steps are needed.

1) Calculate the m x m matrix, a, from {A}, where a_ij is the Euclidean distance between the ith
and jth entries of {A}: a_ij = a_ji =| A i - A j |
2) Average the rows of a: a_i. = 1/m sum_j(a_ij)
3) Average the columns of a: a_.j = 1/m sum_i(a_ij)
4) Average all elements of a: a_.. = 1/(m*m) sum_ij(a_ij)
5) Build the m x m matrix alpha from a where alpha_ij = a_ij - a_i. - a_. j + a_..

Then the distance covariance is
DCOV(A, B) ≡ sqrt(1/(m*m) sum_ij(alpha_ij * beta_ij))
where beta_ij is defined similarly from B.

The distance correlation coefficient, DCOR, is defined as
DCOR = DCOV(A,B)/sqrt(DCOV(A,A) * DCOV (B,B))

DCOR was found to capture both linear and non-linear correlation between positional vectors [2] and
was able to reveal long-distance concerted motions in a protein that was not revealed by PCC or GCC. [2]


[1] Measuring and testing dependence by correlation of distances
GJ Székely, ML Rizzo, NK Bakirov
The Annals of Statistics 35.6 (2007): 2769-2794.

[2] Detection of Long-Range Concerted Motions in Protein by a Distance Covariance
Amitava Roy and Carol Beth Post
J. Chem. Theory Comput., 2012, 8 (9), pp 3009–3014

Input File

1) CORREL input for DCOR between positional fluctuation between 2 CA atoms:

open read file unit 31 name traj1_file
open read file unit 32 name traj2_file

correl maxt 5000 maxa 4 maxs 6
enter a atom XYZ sele ires 1 .and. type CA end
enter b atom XYZ sele ires 20 .and. type CA end
traj firstu 31 nunit 2
dcor a b

CORREL> dcor a b
DCOR> VAR1 = 19.868 VAR2 = 18.500 COVAR = 14.777 CORR = 0.771


VAR1,VAR2 – distance variances
COVAR – distance covariance between time series “a†and “bâ€
CORR – distance correlation between time series “a†and “bâ€

2) COORdinate COVAriance input for DCOR between positional fluctuations between all CA atoms:

open read file unit 31 name traj1_file
open read file unit 32 name traj2_file

open write card unit 41 name matrix_file
coor cova firstu 31 nunit 2 sele type CA end sele type CA end unit 41 dcor

Output options are identical with COOR COVA output options.