galgor (c46b2)

Galgor: Commands which deal with Genetic Algorithm and Monte Carlo.


# Michal Vieth,H. Daigler, C.L. Brooks III -Dec-15-1997 Initial release.


The commands described in this node are associated with genetic
algorithm module for conformational searches and docking of small ligands to
rigid proteins. The full description of the GA features is presented
in the paper "Rational approach to docking. Optimizing the search algorithm"


* Implementation | A brief description of the anatomy of GA
* Syntax | Syntax of the replication commands
* Description | Description of key words and commands usage
* Restrictions | Restrictions on usage
* Examples | Supplementary examples of the use of GA


Top
Genetic Algorithm and Monte Carlo: Description and Discussion


Name Keyword Module
GA setup GALGOR SETUP genetic.src
Genetic algorithm GALGOR EVOLVE genetic.src, genetic2.src
Monte Carlo GALGOR EVOLVE MCARLO genetic.src, genetic2.src

This code was created by Michal Vieth, Heidi Daigler and Charles Brooks III
at The Scripps Research Institute during the summer/fall of 1997
based on the code provided by Charles Brooks and Heidi Daigler, Department
of Chemistry, Carnegie Mellon University developed during the summer of 1994.
Its purpose is to enable monte carlo and genetic algorithm based conformational
searches to be performed on peptides/proteins, small organic molecules and
docking of (small) ligands to their receptors.
It builds upon the replica
ideas of Leo Caves to make multiple copies of the system, i.e., the
chromosomes. These chromosomes make up a population of molecular
conformations which are crossed and mutated according to the specific
genetic algorithm used. It is recomended to use a generational update
version of genetic algorithm with elitism 1 or 2, and and island/nicheing
ideas to evolve subpopulations independently.
Each chromosome is represented by the set of
internal coordinates (and rigid body degrees of freedom if desired) deemed to
"evolve", i.e., the genes. The genes are represented as real numbers. The
chromosome are crossed at a randomly chosen gene or random genes are
mutated with a given rate of mutation. In addition migration of individuals
between subpopulations is used if the island model is employed.
From each population, the
members suitable for crossing, the "parents", are chosen based upon
their ranking, as measured by a priori chosen preference to select
higher ranked individuals. Evolutionary presure of 1.2 based of scaled fitness
following Jones, JMB 245, 43-53 is used for parents selection.
The potential energy function used can be all of CHARMM's energy
terms, including constraints, or any subset of them. The user energy
term can be utilized to include any special functional form which is
particularly amenable to the GA search. The current implementation
uses the standard CHARMM energy functions.
For docking the essential part of the energy function is the use
of soft core VDW potential that permits ligands to diffuse into the
protein. The protein in docking is rigid, and at this time it is
not possible to include partial flexibility.
Monte Carlo scheme uses the same engine, the only difference is that
the members of the entire population become completly separate entities
which do not exchange any information. Their evolution by simulated
annealing is similar to MCSS molecular dynamics.
Entire approach to docking and comparison of performance of GA/MD/MC
is described in detail in two back to back papers submitted to J.Comp.Chem.
"Rational approach to docking I and II."

In the initial set-up stage, the GALGorithm keyword is parsed and the
main subroutine for the genetic algorithm is called from CHARMM
(GENETIC). The keyword SETUp causes a number of things to occur.
First the number of chromosomes (replicas) and the atoms whose
internal coordinates are to be sampled and replicated are parsed.
Following this, the specific IC variables which are to be "evolved"
are selected by the VARIable_ICs sub-parser. This is carried out in a
separate routine which is modeled after the routine used for
processing the IC edit command. Finally, the replica subroutine from
Leo Caves is called to replicate the system and the original subsystem
atoms are deleted by a call equivalent to delete atoms select segid
orig end. Each new segment generated by the replica command is given
the prefix C (for chromosome) followed by the number of that replica
(chromosome). The apropriate pointer structures for
inclusion/exclusion of dihedral, angle and bond ICs are then created
and control is passed back to the CHARMM level, where specific
manipulations can be performed on the individual chromosomes, e.g., to
vary initial conformations around the initial progenerator of all
chromosomes.

Evolution of the population of chromosomes is controlled by calling
GALGorithm with the EVOLve keyword from CHARMM. This uses a parser
which sets-up the control variables for the genetic algorithm
evolution and then runs the genetic evolution. The specific functions
performed by this portion of the code are:

1. Parse parameters controlling the GA evolution.

GA follows the scheme:

2. Evolve the population of chromosomes. This involves the following
steps:

a) Evaluate the energy of the population at step 0.
b) Rank the population members in accordance with the viability.
c) Choose parents to develop the next generation based on roulette
wheel selection according to scaled fitness.
d) Mate parents by crossover or random mutations in genes,
e) Reconstruct (via IC build-like procedure) the cartesian positions of
the new generation from modified ICs and if applicable from
the modified values of the rigid degrees of freedom.
f) Evaluate energies of the new generation
g) if elitist
model is being used transfer fittest parent into the new generation.
If steady state/generational update is used replace least fit
parents by children. if elitism is used retain some most fit parents.
IF evolutionary strategy is pick the best members to form a new
population
h) Begin again at step b), cycle through to convergence or
MAXGenerations or NEVALuations

Monte Carlo proceeds via the following scheme:

2. Evolve the population of chromosomes. This involves the following
steps:

a) Evaluate the energies of each replica
b) Generate new replica by mutations
c) Reconstruct (via IC build-like procedure) the cartesian positions of
the new generation from modified ICs and if applicable from
the modified values of the rigid degrees of freedom.
d) Evaluate energies of the new replicas
e) Accept new replicas with Boltzmann probability
h) Begin again at step b), cycle through
MAXGenerations or NEVALuations


Top
Syntax for the Genetic ALGORithm command

GALGOR

setup: {[SETUP] [CHROmosomes int] [atom selection] -
{SEED 3x(<resnum> <atom>)}
{[VARIable_IC] -
{[DIHEdral] [IMPRO] [INCLude] [4x<atom selection]} {[DEPE] -
[4x<atom selection] [OFFSet int]} -
{{[BOND] [3x atom selection]} [ALL] [NONE]} -
{{[ANGLE] [2x atom selection]} [ALL] [NONE]} -
{TRAN} {ROTA} -
[END]}

evolution: {[EVOLVE] -

GA parameters:
[PARENTS int] [CHILDREN int] -
{[STEADY] [ELITE int]} [EPRES real] -
[CROSsover_rate real] [MUTAtion_rate real] [GSIZE int] -
[NICHes int] [INTEraction_frequency int] -

MC parameters :
{[MCARlo] [TBEG real] [TEND real] [TFRQ int]
[TJWAlking real] [IJWAlking real]} -

Evolution parameters:
[MAXGenerations int] [NEVAluations int] [IPRINT int] -
{ [TOLE real] [TOLC real]} [PRIN_frequency int] -

Initialization parameters:
{[RANDomization] [ISEED int] [RDIHE real] [RROTA real] -
{[RTRA real] [RXTR real] [RYTR real] [RZTR real] [OFTR real]}} -

Step sizes:
[TRANslation_step real] -
[ROTAtion_step real] [BOSTep real] [ANSTep real] -
[IMPRoper_step real] [PINTernal real] [PALL real] [PCALL real]
[PTALL real] -
{[IBIG_step_frq int] [BTRAN real] [BROTA real] [BDIST real] -

Nonbonded parameters:
{[RMIN real]} [QNOE] [NBFRQ real] -

Exit parameters:
[DELETE] [CLEAR] [LEAVE int]}}


Top
There are two basic parts of running genetic algorithm or Monte Carlo
in CHARMM:


Description of the basic key words of the genetic algorithm :

The following is the description of the setup commands for setting up the
system

Keyword/Syntax Default Purpose

SETUP setting up the data structure

CHRO 50 Number of chromosomes in a population

atom sele The atoms to be used as chromosomes

SEED 3x(<resnum> <atom>) Specify seed atoms for ic builds
Default to use first three atoms in PSF

VARI Definition of the active variables - genes.
This keyword acts in similar way to IC EDIT. It has
to be followed by definition of internal
coordinates and end with END statement.

DIHE INCL <sele> ALL Within VARI specifies the selection of dihedral
angles to be used as active variables. May be
followed by DIHE DEPE.

DIHE DEPE <sele> OFFS value Within VARI specifies the dihedral that can be
computed from the value of the dihedral defined
immediately before by adding the OFFSet value,
This specifies dihedral dependency if two or more
quartets of atoms describe rotation about the same

bond
DIHE IMPR <sele> ALL Improper dihedrals to be used as active variables.

BOND <sele> ALL Bonds to be used as active variable

ANGLE <sele> ALL Angles to be used as active variables

Example:
GALGOR SETUP -
CHRO 50 SELE segid maa END -
SEED 1 c 2 n 2 ca -
VARIable_ICs
DIHE INCL 1 C 2 N 2 CA 2 C end
DIHE DEPE 2 CA 2 C 3 N 3 CA end OFFSET -2.0
DIHE IMPR 2 N 1 CL 1 C 1 O end
BOND ALL
ANGLE ALL
TRAN ROTA
END

The following is the description of the setup commands for evolution of
the system


Keyword/Syntax Default Purpose

EVOLVE The beginning of the evolution setup

STEA off The type of evolutionary algorithm to be used.
IF STEAdy key word is absent evolutionary strategy
is used, in which the new population consists of
CHRO most fit individuals that are chosen from
2*CHRO parents and children. If STEA key word is
present generational (or steady state) update
is used in which children replace least fit
parents in the population. STEA is highly
recomended for problems with two or more variables
to avoid premature convergence of the population.

PARENTS CHRO The number of parents to be used in mating,
typically all chromosomes are permitted to be
parents

ELITE 2 So called elitism of the population -
The number of parents to be retained in the next
generation. If STEA key word is used with elitism
equal to CHRO-2 a typical steady state update
is carried, otherwise the update is called
generational. With no STEA key word in evolutionary
strategy elitism is not changing the
evolution process.

CHILDREN CHRO-ELIT The number of children created in each mating

EPRES 1.2 Evolutionary presure, this is defined as the
relative probability to select the most fit
individual with respect to average individual.
The scaled fitness is linear and based on Jones,
JMB 245, 43-5

NICHes 1 The number of subpopulations (niches) to be evolved
independently, This is used as yet another mean
to avoid premature convergence.

INTE 100 Migration frequency between niches. This is
indicates how frequently individuals migrate from
one subpopulation to another. THe essence of
migration is top replace NICHE-1 worst members
in each subpopulation by the best members of
other subpopulations

CROSsover 0.6 The probability of mating by crossover [0,1],
higher values tend to cause faster convergence

MUTA 0.4 The probability of mating by mutation, higher
values prevent fast convergence. In general the
sum of MUTA and CROS would be 1.0, unless we
want to create children as clones of some parents.
The clones will make sense in generational update,
less sense in evolutionary strategy and no sense
in the steady state update.

GSIZE 1 The number of variables coding for a gene. Since
we use floating point numbers to represent a gene
1 is strongly recomended. Higher values would
affect the meaning of crossover changing it to
mutation if it occurred within a gene.

MCARLO off Turns on Monte Carlo procedure, instead of GA,
CHROM act now as separate replicas not interacting
with each other and each evolving by means of
mutations independently

TBEG 300.0 Initial temperature of the system in K

TEND 300.0 Final temperature in K

TFRQ 10 Temperature change frequency, the temperature
change is calculated in the following way :
tchange=(TEND-TBEG)/(MAXG/TFRQ)

IJWALK MAXG+1 Frequency of sampling at higher temperature, that
allows sporadically for higher acceptance ratio.
The use of this jwalker is not recommended unless
the system has critically low acceptance rate.
The acceptance rate is printed with energy in place
of GRMS printout

TJWALK 400.0 Temperature allowing for better acceptance ratio
turned on every IJWALK generations. This is turned
on only for replicas whose acceptance ratio is
less than 1%

MAXGeneration 10 The total number of generations (or MC steps) for
which the system is to evolve

NEVAL 150000 The maximum number of energy evaluations performed
on the system

IPRINT 5 The number of chromosomes for which the energies
are to be printed

PRINT_freq 100 Printing frequency of chromosome energies. The
chromosome are ranked based on energy, and
depending on the print level various terms are
included. By default all energy terms are reported.

TOLCO 0.0 The value of the RMS deviation between the average
chromosomes and all other chromosomes (measured
by values of all genes) for which the evolution
is terminated due to convergence in variable space.
IF the value is zero the convergence is not checked

TOLER 0.0 The value of the RMS deviation between the energy
of the average chromosome and all other chromosomes
for which the evolution due to convergence in
energy. If the value is zero the convergence is
not checked.


RAND off IF this keyword is used the variables in all
chromosomes are randomized around their original
values with the spread equal to their step sizes

ISEED 31415 Seed number for the ranom generator

RDIHE 0.0 The spread for generating the initial value of
dihedrals if RAND key word is used. With a default
value dihedrals will be randomized about their
initial values (with RAND present)

RROTA 0.0 The spread in initial values of euler angles
generated randomly if RAND key word is used. The
values are in radians. IF the value is zero
euler angles will be randomized about their
initial values

RTRA 0.0 The maximum distance of the center of mass
of molecules from the point specified by RXTR
RYTR RZTR for the random initial placement of
molecules.

RXTR 0.0 The location of the center for generation
RYTR 0.0 of initial centers of masses
RZTR 0.0

OFTRA 0.2 The minimum distance of the center of mass
of molecules from the point specified by RXTR
RYTR RZTR for the random initial placement of
molecules.

TRAN 0.6 The magnitude of the maximum value for the
step size of translations, the actual value is
calculated : (Random number-0.5)*TRAN

ROTA 0.5 The magnitude of the maximum value for the
step size of rotations, the actual value is
calculated : (Random number-0.5)*ROTA. The
values are in radians.

BOSTep 0.002 The magnitude of the maximum value for the
step size of bond distance change

ANSTEP 2.0 The magnitude of the maximum value for the
step size of bond angle change in degrees

IMPROPER 2.0 The magnitude of the maximum value for the
step size of improper dihedral angle change
in degrees

PINT 0.5 Probability to mutate internal degrees of
freedom, the probability to mutate rigid body degrees of freedom is 1-pint. This applies
only for the system with active translations or
rotations, otherwise the value of PINT is 1.

PALL 0.0 Relative probability to mutate all 3
translational or rotational degrees of
freedom, default 0 with respect to other
rigid body mutations

PCALl 0.0 Relative probability to mutate all 6 rigid
body degrees of freedom

PTALl 0.0 Probability to mutate all internal degrees
of freedom simultaneously

IBIGstep MAXG+10 The frequency of mutations with big steps

BTRAN 0.6 The step for translations every IBIG steps

BROTA 0.5 The step for rotations every IBIG steps

BDIS 30.0 The step for dihedrals change every IBIG steps

QNOE off The flag turning on the NOE potential

NBFRQ 1 Nonbonded interaction list update frequency

RMIN 0.0 The soft core switching distance, the force
for all nonbonded interactions at RMIN*SIGMA
is the same as for the distances lower that
RMIN*SIGMA. The two above flags turn on the soft
core VDW potential. The strongly recomended
value for RMIN in the initial stages of docking
is 0.885.

LEAVE CHROM The number of chromosomes to be left at the end
of evolution

DELETE off Delete all chromosomes except the first LEAVE

CLEAR off CLearing all GA routines


Example for evolution of bonds, angles, improper dihedrals and dihedrals.

GALGorithm EVOLve -
STEA elite 2 EPRE 1.2 -
MAXG 1000 -
NICHE 1 50 GSIZE 1 MUTA 0.4 CROS 0.6 -
RAND ISEED 1423 -
PINT 1.0 TOLC 0.01 -
PRIN 10 DIST 20.0 IPRO 3.0 ANST 1.5 BOST 0.002 -
NBFR 10 DELETE CLEAR


Top
1) Since the setup of GA module is based on the replica module (with deletion
of the parent replica) the requirements for running GA code are identical
to the requirements for replicas.
2) Energy has to be called before the evolution step.
3) For any interaction of chromosomes with environment PSF (i.e. protein
to which chromosomes are docked) it is assumed that chromosomes are
in the PSF before the environment
4) Due to the existing bag in the nonobonded list generation after deletion
of the parent replica PSF in order to avoid any errors in generation of
nonbonded list with environment it is necessary to define PSF for the
environment twice and then delete the first environment PSF:


replica nrep 10 sele segid ligand end
dele atom sele segid ligand end

or

GALGOR SETUP chrom 10 sele segid ligand end


open unit 1 read form name "6Atim.pdb"
read sequ pdb unit 1
close unit 1

generate 3pti setup nodihe

open unit 1 read form name "6Atim.pdb"
read sequ pdb unit 1
close unit 1

generate 3ptb setup nodihe

dele atom sele segid 3pti end

5) In the process of generation Cartesian coordinates from internal
coordinates it is ASSUMED THAT a dihedral angle IS DEFINED in
the IC table for the FIRST THREE atom in the PSF. If the first
three atoms from PSF cannot be used as seeds RTF has to be modified
so the first three atoms define a dihedral angle in IC table . In a
later stage the GA_place routine should be modified to allow for
different atom seeding.
6) In order to run consecutive evolutions of GA with different parameters
or consecutive runs of MC no DELETE nor CLEAR key words can be used.
Otherwise segmentation fault will be reported.
7) IC FILL command has to be used when cartesian coordinated of chromosomes
are read from external file if one wants to use initial values of
internal coordinates corresponding to the cartesian coordinates.
Otherwise internal coordinates will be taken from the parent chromosome.
GA_
8) The supported energy terms are: USER, ANGLE, BOND, UREYB, DIHE,
IMDIHE, VDW, ELEC, NOE, CHARM, CDIHE.


Top
Supplementary examples.

_________________________________________________________
Alanine dipeptide rigid bond/angle/improper setup:
_________________________________________________________
Read sequ card
* maa


AMN ALA CBX

Generate maa setup
ic parameters
ic seed 1 cl 1 c 1 o !provide the three atoms to "seed" building
ic build !build the structure based upon the internal coordinates (ics)

set npar 20
GAlgorithm SETUp -
CHROmosomes 20 select segid maa end -
VARIable_ICs
DIHEdral INCLude 1 C 2 N 2 CA 2 C end
DIHEdral INCLude 2 N 2 CA 2 C 3 N end
END

Nbonds cdie eps 1.0 cutnb 99.0 ctofnb 90.0 wrnmxd 99.0 swit vswit
energy

ic fill
_________________________________________________________
Alanine dipeptide global minimum by generational update GA:
_________________________________________________________

GALGorithm EVOLve -
RAND iseed @num rdihe 180.0 -
steady epres 1.2 -
MAXGenerations 1000 -
niches 2 inte 10 -
gsize 1 muta 0.2 pinte 1.0 cone -
print 10 Dist 30.0 anst 1.0 toler 0.01 delete leave 10 clear

Alanine dipeptide global minimum by evolutionary strategy:
_________________________________________________________

GALGorithm EVOLve -
RAND iseed @num rdihe 180.0 -
MAXGenerations 1000 -
niches 2 inte 10 -
gsize 1 muta 0.2 pinte 1.0 cone -
print 10 Dist 30.0 anst 1.0 toler 0.01 delete leave 10 clear

__________________________________________________________
Alanine dipeptide global minimum by simulated annealing MC:
_________________________________________________________

GALGorithm EVOLve -
mCarlo Tbeg 500.0 Tend 250.0 tfrq 10 -
RAND iseed 3213 rdihe 360.0 -
MAXGenerations 5000 -
niches 2 inte 10 -
gsize 1 muta 0.2 pinte 1.0 cone -
ibig 50 bdis 100.0 -
print 10 Dist 20.0 anst 1.0 toler 0.01 delete clear

_________________________________________________________
GA docking of g3p to tim:
_________________________________________________________

setup:

set nchrom 150
set nliga @nchrom
set npar @nchrom
GAlgorithm SETUp -
CHROmosomes @nchrom select segid g3p end -
VARIable_IC
dihe incl 1 O1 1 C2 1 C4 1 O7 end
dihe depe 1 O1 1 C2 1 C4 1 C8 end offset -120.0
dihe incl 1 C2 1 C4 1 C8 1 O10 end
dihe depe 1 O7 1 C4 1 C8 1 O10 end offset -120.0
dihe incl 1 C4 1 C8 1 O10 1 P14 end
dihe incl 1 H1 1 O1 1 C2 1 C4 end
dihe incl 1 C2 1 C4 1 O7 1 H2 end
dihe depe 1 C8 1 C4 1 O7 1 H2 end offset 120.0
dihe incl 1 C8 1 O10 1 P14 1 O15 end
dihe depe 1 C8 1 O10 1 P14 1 O16 end offset 120.0
dihe depe 1 C8 1 O10 1 P14 1 O17 end offset -120.0
rota tran
END

mult nliga by 2
mult nchrom by 2

open unit 1 read form name "6Atim.pdb"
read sequ pdb unit 1
close unit 1

generate 3pti setup nodihe

open unit 1 read form name "6Atim.pdb"
read sequ pdb unit 1
close unit 1

generate 3ptb setup nodihe

dele atom sele segid 3pti end

open unit 1 read form name 6tim_m.crd
read coor ignore unit 1 append
close unit 1

cons fix sele segid 3pt* end
fast on
wrnlev -1
energy rdie eps 3.0 cutnb 13.5 ctofnb 8.0 ctonnb 6.0 swit vswit inbfr 5 -
qrmin rmin 0.885

GALGorithm EVOLve -
RAND iseed @num rtran 11. rxtr -0.7 rytr -4.2 rztr -9. oftra 0.3 -
rrota 6.28 rdihe 360.0 -
steady -
elite 2 epres 1.2 -
MAXGenerations 400 -
niches 5 inte 100 nbfrq 40 pcall .3 pall 0.3 -
ibigstep 40 btran 1.8 brota 1.8 bdist 40.0 -
qrmin rmin 0.885 qnoe -
gsize 1 muta 0.7 cross .3 tran 1. rota 1. pint 0.3 -
print 50 Dist 30.0

energy rdie eps 3.0 cutnb 12.5 ctofnb 8.0 ctonnb 6.0 swit vswit inbfr 5
!!!!!!!!!!!!!!!!!
!!!!!!! EVOLUTION
!!!!!!!!!!!!!!!!!
GALGorithm EVOLve -
steady -
nevalu 5000000 -
elite 2 epres 1.1 -
MAXGenerations 70 -
niches 5 inte 25 nbfrq 50 pcall .2 pall 0.2 -
ibigstep 50 btran 1. brota 1. bdist 60.0 qrmin rmin 0.65 qnoe -
gsize 1 muta 0.3 cross .7 tran 0.5 rota .4 pint 0.6 -
print 50 Dist 30.0

GALGorithm EVOLve -
steady -
nevalu 5000000 -
elite 2 epres 1.1 -
MAXGenerations 30 -
niches 5 inte 5 nbfrq 30 pcall .2 pall 0.2 -
ibigstep 30 btran 1. brota 1. bdist 40.0 qnoe qrmin rmin 0.5 -
gsize 1 muta 0.3 cross .7 tran 0.5 rota .4 pint 0.6 -
print 50 Dist 30.0 delete clear leave 10