PLUMED Masterclass 21.6: Dimensionality reduction

- Date
- 12th April 2021

The primary aim of this Masterclass is to show you how you might use PLUMED in your work. We will see how to call PLUMED from a python notebook and discuss some strategies for selecting collective variables.

Once this Masterclass is completed, users will be able to:

- Calculate CVs and import them into a python notebook.
- Generate visualizations of data using chemiscope
- Use the projection of a vector as a CV.
- Use path collective variables.
- Run dimensionality reduction algorithms with PLUMED
- Use Gibbs dividing surfaces to determine the extent of a phase.

Throughout this exercise, we use the atomistic simulation environment:

https://wiki.fysik.dtu.dk/ase/

and chemiscope:

Please look at the websites for these codes that are given above for more information.

For this Masterclass, you will need to set up the plumed and gromacs as you did for PLUMED Masterclass 21.5: Simulations with multiple replicas. You thus install plumed and gromacs using:

conda install --strict-channel-priority -c plumed/label/masterclass-mpi -c conda-forge plumed conda install --strict-channel-priority -c plumed/label/masterclass-mpi -c conda-forge gromacs

You can install the other software you need using:

conda install -c conda-forge py-plumed Numpy pandas matplotlib notebook mdtraj mdanalysis git ase

Notice that you need a package called ase (the atomic simulation environment) for these exercises and the other packages you have been using. You also need to install chemiscope, which you can do by using the following command:

pip install chemiscope

The data needed to complete this Masterclass can be found on GitHub. You can clone this repository locally on your machine using the following command:

git clone https://github.com/plumed/masterclass-21-6.git

I recommend that you run each exercise in a separate sub-directory (i.e. Exercise-1, Exercise-2, ...), which you can create inside the root directory `masterclass-21-6`

. Organizing your data this way will help you to keep things clean.

- Note
- All the exercises have been tested with PLUMED version 2.7.0.

I like working with Python notebooks. Using notebooks allows me to keep the codes that I am working on close to the explanations of how these codes work. I can also make all my figures within a notebook. By using notebooks, I can thus have a single file that contains:

- Discussion of the work done.
- The analysis code.
- The final figures that were generated.

In previous masterclasses we have run plumed driver through bash from within a notebook by using commands like the one shown below:

!cd ../Exercises/Exercise_1 && plumed driver --noatoms > /dev/null

# Read in colvar file produced

data = np.loadtxt("../Exercises/Exercise_1/colvar")

We have then read in the colvar file produced and analyzed it from within the notebook. We can avoid using plumed driver and can call plumed directly from python using the python interface. The code below, for instance, generates the plot underneath the code. The code reads in the trajectory from the file traj.pdb that you obtained from the GitHub repository. The plot then shows how the \(\phi\) angle on the second residue of the protein changes with time.

import matplotlib.pyplot as plt

import numpy as np

import plumed

import ase

import ase.io

# Read in trajectory using ase

traj = ase.io.read('../data/traj.pdb',':')

# Setup plumed object to do calculation

p = plumed.Plumed()

p.cmd("setMDEngine","python")

# Read PDB so need to multiply by 0.1 to convert to nm

p.cmd("setMDLengthUnits", 0.1)

p.cmd("setTimestep", 1.)

p.cmd("setKbT", 1.)

natoms = len(traj[0].positions)

p.cmd("setNatoms",natoms)

p.cmd("setLogFile","test.log")

p.cmd("init")

# Read plumed input

p.cmd("readInputLine","MOLINFO STRUCTURE=../data/bhp.pdb")

# If you are doing many variables I would represent putting these

# next three PLUMED commands into a function

p.cmd("readInputLine", "t1: TORSION ATOMS=@phi-2" )

# Now setup some memory to hold the variable that is shared

# between plumed and the underlying code

shape = np.zeros( 1, dtype=np.int_ )

p.cmd("getDataRank t1 ", shape )

t1 = np.zeros((1))

p.cmd("setMemoryForData t1", t1)

# Loop over trajectory and get data from plumed

nfram, tt, v1, box = 0, [], [], np.array([[100.,0,0],[0,100.,0],[0,0,100]])

charges, forces, virial = np.zeros(natoms,dtype=np.float64), np.zeros([natoms,3]), np.zeros((3,3),dtype=np.float64)

for ts in traj :

# Set all the input variables to PLUMED

p.cmd("setStep",nfram)

p.cmd("setBox",box )

p.cmd("setMasses", ts.get_masses() )

p.cmd("setCharges", charges )

pos = np.array(ts.get_positions(), dtype=np.float64 )

p.cmd("setPositions", pos )

p.cmd("setForces", forces )

p.cmd("setVirial", virial )

# Run the plumed calculation

p.cmd("calc")

tt.append(nfram)

# We can now extract the value of the torsion by accessing the shared memory we set up earlier

v1.append(t1[0])

nfram = nfram + 1

# Plot teh graph of the torsional angle as a function of time

plt.plot( tt, v1, 'ko')

plt.xlabel("Simulation step")

plt.ylabel("Torsion angle / radians")

plt.show()

**Your task in this first exercise is to modify the code above and to produce a figure similar to the one shown below.** This figure shows all the values of the \(\phi\) and \(\psi\) angles in the second residue of the protein during the simulation.

Plots showing the trajectory in CV space similar to those you generated at the end of the previous exercise are helpful. What would be more useful, however, is some way of understanding the relationship between the positions in CV space and the structures of the various atoms. In other words, what we would like is something like this:

You can see the frame in the trajectory that each point in the plot corresponds to from the above figure. The snapshots on the right correspond to the structures the system had at the points highlighted in red, yellow, green and blue respectively in the plot on the left.

The figure above was generated using chemiscope.

This server allows you to generate and interact with plots like the one shown above. ** Your task in this exercise is to generate your own chemiscope representation of the data in traj.pdb. ** To create a chemiscope representation of the \(\phi\) angles that we generated using the first python script from the previous exercise, you would add the following python code:

from ase.data import atomic_masses

from chemiscope import write_input

# This ensures that the atomic masses are used in place of the symbols

# when constructing the atomic configurations' chemiscope representations.

# Using the symbols will not work because ase is written by chemists and not

# biologists. For a chemist, HG1 is mercury as opposed to the first hydrogen

# on a guanine residue.

for frame in traj:

frame.numbers = np.array(

[

np.argmin(np.subtract(atomic_masses, float(am)) ** 2)

for am in frame.arrays["occupancy"]

]

)

# This constructs the dictionary of properties for chemiscope

properties = {

"time": {

"target": "structure",

"values": tt,

"description": "Simulation step number",

},

"t2": {

"target": "structure",

"values": v1,

"description": "Phi angle of second residue",

},

}

# This generates our chemiscope output

write_input("torsion_chemiscope.json.gz", frames=traj, properties=properties )

You would then upload the torsion_chemiscope.json.gz file that is generated by this script at https://chemiscope.org.

** See if you can generate your own chemiscope representation of the data in traj.pdb.** I would recommend calculating and uploading a chemiscope representation of all the protein's torsional angles.

At the very least, you need to do at least two backbone torsional angles. However, if you do more than two torsions, you can generate a plot like the one shown below.

Chemiscope comes into its own when you are working with a machine learning algorithm. These algorithms can (in theory) learn the collective variables you need to use from the trajectory data. To make sense of the coordinates that have been learned, you have to carefully visualize where structures are projected in the low dimensional space. You can use chemiscope to complete this process of visualizing the representation the computer has found for you. In this next set of exercises, we will apply various dimensionality reduction algorithms to the data contained in the file traj.pdb. If you visualize the contents of that file using VMD, you will see that this file contains a short protein trajectory. You are perhaps unsure what CV to analyze this data and thus want to see if you can shed any light on the contents of the trajectory by using machine learning.

Typically, PLUMED analyses one set of atomic coordinates at a time. To run a machine learning algorithm, however, you need to gather information on multiple configurations. Therefore, the first thing you need to learn to use these algorithms is how to store configurations for later analysis with a machine learning algorithm. The following input illustrates how to complete this task using PLUMED.

Click on the labels of the actions for more information on what each action computes

# This reads in the template pdb file and thus allows us to use the @nonhydrogens # special group later in the input MOLINFOSTRUCTURE=__FILL__compulsory keyworda file in pdb format containing a reference structure.MOLTYPE=protein # This stores the positions of all the nonhydrogen atoms for later analysiscompulsory keyword ( default=protein )what kind of molecule is contained in the pdb file - usually not needed since protein/RNA/DNA are compatiblecc:COLLECT_FRAMES__FILL__=@nonhydrogens # This should output the atomic positions for the frames that were collected to a pdb file called traj.pdb OUTPUT_ANALYSIS_DATA_TO_PDBcould not find this keywordUSE_OUTPUT_DATA_FROM=__FILL__use the output of the analysis performed by this object as input to your new analysis objectFILE=traj.pdbcompulsory keywordthe name of the file to output to

**Copy the input above into a plumed file and fill in the blanks.** You should then be able to run the command using:

Then, once all the blanks are filled in, run the command using:

plumed driver --mf_pdb traj.pdb

You can also store the values of collective variables for later analysis with these algorithms. ** Modify the input above so that all Thirty backbone dihedral angles in the protein are stored, and output using OUTPUT_ANALYSIS_DATA_TO_COLVAR and rerun the calculation. **

For more information on the dimensionality reduction algorithms that we are using in this section see:

https://arxiv.org/abs/1907.04170

Having learned how to store data for later analysis with a dimensionality reduction algorithm, lets now apply principal component analysis (PCA) upon our stored data. In principle component analysis, low dimensional projections for our trajectory are constructed by:

- Computing a covariance matrix from the trajectory data
- Diagonalizing the covariance matrix.
- Calculating the projection of each trajectory frame on a subset of the eigenvectors of the covariance matrix.

To perform PCA using PLUMED, we are going to use the following input with the blanks filled in:

Click on the labels of the actions for more information on what each action computes

# This reads in the template pdb file and thus allows us to use the @nonhydrogens # special group later in the input MOLINFOSTRUCTURE=__FILL__compulsory keyworda file in pdb format containing a reference structure.MOLTYPE=protein # This stores the positions of all the nonhydrogen atoms for later analysiscompulsory keyword ( default=protein )what kind of molecule is contained in the pdb file - usually not needed since protein/RNA/DNA are compatiblecc:COLLECT_FRAMES__FILL__=@nonhydrogens # This diagonalizes the covariance matrixcould not find this keywordpca:PCAUSE_OUTPUT_DATA_FROM=__FILL__use the output of the analysis performed by this object as input to your new analysis objectMETRIC=OPTIMALcompulsory keyword ( default=EUCLIDEAN )the method that you are going to use to measure the distances between pointsNLOW_DIM=2 # This projects each of the trajectory frames onto the low dimensional space that was # identified by the PCA commandcompulsory keywordnumber of low-dimensional coordinates requireddat:PROJECT_ALL_ANALYSIS_DATAUSE_OUTPUT_DATA_FROM=__FILL__use the output of the analysis performed by this object as input to your new analysis objectPROJECTION=__FILL__ # This should output the PCA projections of all the coordinates OUTPUT_ANALYSIS_DATA_TO_COLVARcompulsory keywordthe projection that you wish to generate out-of-sample projections withuse the output of the analysis performed by this object as input to your new analysis objectARG=the input for this action is the scalar output from one or more other actions.dat.*FILE=pca_data # These next three commands calculate the secondary structure variables. These # variables measure how much of the structure resembles an alpha helix, an antiparallel beta-sheet # and a parallel beta-sheet. Configurations that have different secondary structures should be projected # in different parts of the low dimensional space.compulsory keywordthe name of the file to output toalpha:ALPHARMSDRESIDUES=allthis command is used to specify the set of residues that could conceivably form part of the secondary structure.abeta:ANTIBETARMSDRESIDUES=allthis command is used to specify the set of residues that could conceivably form part of the secondary structure.STRANDS_CUTOFF=1.0If in a segment of protein the two strands are further apart then the calculation of the actual RMSD is skipped as the structure is very far from being beta-sheet like.pbeta:PARABETARMSDRESIDUES=allthis command is used to specify the set of residues that could conceivably form part of the secondary structure.STRANDS_CUTOFF=1.0 # These commands collect and output the secondary structure variables so that we can use this information to # determine how good our projection of the trajectory data is.If in a segment of protein the two strands are further apart then the calculation of the actual RMSD is skipped as the structure is very far from being beta-sheet like.cc2:COLLECT_FRAMESARG=the input for this action is the scalar output from one or more other actions.alpha,abeta,pbetaOUTPUT_ANALYSIS_DATA_TO_COLVARuse the output of the analysis performed by this object as input to your new analysis objectcc2ARG=the input for this action is the scalar output from one or more other actions.cc2.*FILE=secondary_structure_datacompulsory keywordthe name of the file to output to

To generate the projection, you run the command:

plumed driver --mf_pdb traj.pdb

You can generate a projection of the data above using chemiscope by using the following script:

# This ase command should read in the traj.pdb file that was analyzed. Notice that the analysis

# actions below ignore the first frame in this trajectory, so we need to take that into account

# when we generated the chemiscope

traj = ase.io.read('../data/traj.pdb',':')

# This reads in the PCA projection that are output by by the OUTPUT_ANALYSIS_DATA_TO_COLVAR command

# above

projection = np.loadtxt("pca_data")

# We also read in the secondary structure data by colouring points following the secondary

# structure. We can get a sense of how good our projection is.

structure = np.loadtxt("secondary_structure_data")

# This ensures that the atomic masses are used in place of the symbols

# when constructing the atomic configurations' chemiscope representations.

# Using the symbols will not work because ase is written by chemists and not

# biologists. For a chemist, HG1 is mercury as opposed to the first hydrogen

# on a guanine residue.

for frame in traj:

frame.numbers = np.array(

[

np.argmin(np.subtract(atomic_masses, float(am)) ** 2)

for am in frame.arrays["occupancy"]

]

)

# This constructs the dictionary of properties for chemiscope

properties = {

"pca1": {

"target": "structure",

"values": projection[:,0],

"description": "First principle component",

},

"pca2": {

"target": "structure",

"values": projection[:,1],

"description": "Second principle component",

},

"alpha": {

"target": "structure",

"values": structure[:,0],

"description": "Alpha helical content",

},

"antibeta": {

"target": "structure",

"values": structure[:,1],

"description": "Anti parallel beta sheet content",

},

"parabeta": {

"target": "structure",

"values": structure[:,2],

"description": "Parallel beta sheet content",

},

}

# This generates our chemiscope output

write_input("pca_chemiscope.json.gz", frames=traj[1:], properties=properties )

When the output from this set of commands is loaded into chemiscope, we can construct figures like the one shown below. On the axes here, we have plotted the PCA coordinates. The points are then coloured according to the alpha-helical content.

**See if you can use PLUMED and chemiscope to generate a figure similar to the one above.** Try to experiment with the way the points are coloured. Look at the beta-sheet content as well.

In the previous section, we performed PCA on the atomic positions directly. In the section before last, however, we also saw how we could store high-dimensional vectors of collective variables and then use them as input to a dimensionality reduction algorithm. Therefore, we might legitimately ask if we can do PCA using these high-dimensional vectors as input rather than atomic positions. The answer to this question is yes as long as the CV is not periodic. If any of our CVs are not periodic, we cannot analyze them using the PCA action. We can, however, formulate the PCA algorithm differently. In this alternative formulation, which is known as classical multidimensional scaling (MDS), we do the following:

- We calculate the matrix of distances between configurations
- We perform an operation known as centring the matrix.
- We diagonalize the centred matrix The eigenvectors multiplied by the corresponding eigenvalue's square root can then be used as a set of projections for our input points.

This method is used less often than PCA as the matrix that we have to diagonalize here in the third step can be considerably larger than the matrix that we have to diagonalize when we perform PCA.

To avoid this expensive diagonalization step, we often select a subset of so-called landmark points on which to run the algorithm directly. Projections for the remaining points are then found by using a so-called out-of-sample procedure. This is what has been done in the following input:

Click on the labels of the actions for more information on what each action computes

THIScommandcould not find this keywordreadscould not find this keywordincould not find this keywordthecould not find this keywordtemplatecould not find this keywordpdbcould not find this keywordfilecould not find this keywordandcould not find this keywordthuscould not find this keywordallowscould not find this keyworduscould not find this keywordtocould not find this keywordusecould not find this keywordthecould not find this keyword@nonhydrogens# group later in the input MOLINFOcould not find this keywordSTRUCTURE=beta-hairpin.pdbcompulsory keyworda file in pdb format containing a reference structure.MOLTYPE=protein # This stores the positions of all the nonhydrogen atoms for later analysiscompulsory keyword ( default=protein )what kind of molecule is contained in the pdb file - usually not needed since protein/RNA/DNA are compatiblecc:COLLECT_FRAMESATOMS=@nonhydrogens # The following commands compute all the Ramachandran angles of the protein for youthe atoms whose positions we are tracking for the purpose of analyzing the datar2-phi:TORSIONATOMS=@phi-2the four atoms involved in the torsional angler2-psi:TORSIONATOMS=@psi-2the four atoms involved in the torsional angler3-phi:TORSIONATOMS=@phi-3the four atoms involved in the torsional angler3-psi:TORSIONATOMS=@psi-3the four atoms involved in the torsional angler4-phi:TORSION __FILL__r4-psi:TORSION __FILL__r5-phi:TORSION __FILL__r5-psi:TORSION __FILL__r6-phi:TORSION __FILL__r6-psi:TORSION __FILL__r7-phi:TORSION __FILL__r7-psi:TORSION __FILL__r8-phi:TORSION __FILL__r8-psi:TORSION __FILL__r9-phi:TORSION __FILL__r9-psi:TORSION __FILL__r10-phi:TORSION __FILL__r10-psi:TORSION __FILL__r11-phi:TORSION __FILL__r11-psi:TORSION __FILL__r12-phi:TORSION __FILL__r12-psi:TORSION __FILL__r13-phi:TORSION __FILL__r13-psi:TORSION __FILL__r14-phi:TORSION __FILL__r14-psi:TORSION __FILL__r15-phi:TORSION __FILL__r15-psi:TORSION __FILL__r16-phi:TORSION __FILL__r16-psi:TORSION __FILL__ # This command stores all the Ramachandran angles that were computedangles:COLLECT_FRAMES__FILL__=could not find this keywordr2-phi,r2-psi,r3-phi,r3-psi,r4-phi,r4-psi,r5-phi,r5-psi,r6-phi,r6-psi,r7-phi,r7-psi,r8-phi,r8-psi,r9-phi,r9-psi,r10-phi,r10-psi,r11-phi,r11-psi,r12-phi,r12-psi,r13-phi,r13-psi,r14-phi,r14-psi,r15-phi,r15-psi,r16-phi,r16-psi# Lets now compute the matrix of distances between the frames in the space of the Ramachandran anglesdistmat:EUCLIDEAN_DISSIMILARITIESuse the output of the analysis performed by this object as input to your new analysis objectMETRIC=EUCLIDEAN # Now select 500 landmark points to analyzecompulsory keyword ( default=EUCLIDEAN )the method that you are going to use to measure the distances between pointsfps:LANDMARK_SELECT_FPSuse the output of the analysis performed by this object as input to your new analysis objectNLANDMARKS=500 # Run MDS on the landmarkscompulsory keywordthe number of landmarks that you would like to selectmds:CLASSICAL_MDS__FILL__=could not find this keywordfpsNLOW_DIM=2 # Project the remaining trajectory datacompulsory keywordnumber of low-dimensional coordinates requiredosample:PROJECT_ALL_ANALYSIS_DATAuse the output of the analysis performed by this object as input to your new analysis objectPROJECTION=__FILL__ # This command outputs all the projections of all the points in the low dimensional space OUTPUT_ANALYSIS_DATA_TO_COLVARcompulsory keywordthe projection that you wish to generate out-of-sample projections withuse the output of the analysis performed by this object as input to your new analysis objectARG=the input for this action is the scalar output from one or more other actions.osample.*FILE=mds_data # These next three commands calculate the secondary structure variables. These # variables measure how much of the structure resembles an alpha helix, an antiparallel beta-sheet # and a parallel beta-sheet. Configurations that have different secondary structures should be projected # in different parts of the low dimensional space.compulsory keywordthe name of the file to output toalpha:ALPHARMSDabeta:ANTIBETARMSDSTRANDS_CUTOFF=1.0If in a segment of protein the two strands are further apart then the calculation of the actual RMSD is skipped as the structure is very far from being beta-sheet like.pbeta:PARABETARMSDcc2:COLLECT_FRAMESARG=the input for this action is the scalar output from one or more other actions.alpha,abeta,pbetaOUTPUT_ANALYSIS_DATA_TO_COLVARuse the output of the analysis performed by this object as input to your new analysis objectcc2ARG=the input for this action is the scalar output from one or more other actions.cc2.*FILE=secondary_structure_datacompulsory keywordthe name of the file to output to

This input collects all the torsional angles for the configurations in the trajectory. Then, at the end of the calculation, the matrix of distances between these points is computed, and a set of landmark points is selected using a method known as farthest point sampling. A matrix that contains only those distances between the landmarks is then constructed and diagonalized by the CLASSICAL_MDS action so that projections of the landmarks can be built. The final step is then to project the remainder of the trajectory using the PROJECT_ALL_ANALYSIS_DATA action. Try to fill in the blanks in the input above and run this calculation now using the command:

plumed driver --mf_pdb traj.pdb

Once the calculation has completed, you can, once again, visualize the data using chemiscope by using a suitably modified version of the script from the previous exercise. The image below shows the MDS coordinates coloured according to the alpha-helical content.

**Try to generate an image that looks like this one yourself by completing the input above and then using what you learned about generating chemiscope representations in the previous exercise.**

The two algorithms (PCA and MDS) that we have looked at thus far are linear dimensionality reduction algorithms. In addition to these, there are a whole class of non-linear dimensionality reduction reduction algorithms which work by transforming the matrix of dissimilarities between configurations, calculating geodesic rather than Euclidean distances between configurations or by changing the form of the loss function that is optimized. In this final exercise, we will use an algorithm that uses the last of these three strategies to construct a non-linear projection. The algorithm is known as sketch-map and input for sketch-map is provided below:

Click on the labels of the actions for more information on what each action computes

# This reads in the template pdb file and thus allows us to use the @nonhydrogens # special group later in the input MOLINFOSTRUCTURE=__FILL__compulsory keyworda file in pdb format containing a reference structure.compulsory keyword ( default=protein )what kind of molecule is contained in the pdb file - usually not needed since protein/RNA/DNA are compatiblecc:COLLECT_FRAMES__FILL__=@nonhydrogens # This should output the atomic positions for the frames that were collected and analyzed using MDS OUTPUT_ANALYSIS_DATA_TO_PDBcould not find this keyworduse the output of the analysis performed by this object as input to your new analysis objectFILE=traj.pdb # The following commands compute all the Ramachandran angles of the protein for youcompulsory keywordthe name of the file to output tor2-phi:TORSIONATOMS=@phi-2the four atoms involved in the torsional angler2-psi:TORSIONATOMS=@psi-2the four atoms involved in the torsional angler3-phi:TORSIONATOMS=@phi-3the four atoms involved in the torsional angler3-psi:TORSIONATOMS=@psi-3the four atoms involved in the torsional angler4-phi:TORSION __FILL__r4-psi:TORSION __FILL__r5-phi:TORSION __FILL__r5-psi:TORSION __FILL__r6-phi:TORSION __FILL__r6-psi:TORSION __FILL__r7-phi:TORSION __FILL__r7-psi:TORSION __FILL__r8-phi:TORSION __FILL__r8-psi:TORSION __FILL__r9-phi:TORSION __FILL__r9-psi:TORSION __FILL__r10-phi:TORSION __FILL__r10-psi:TORSION __FILL__r11-phi:TORSION __FILL__r11-psi:TORSION __FILL__r12-phi:TORSION __FILL__r12-psi:TORSION __FILL__r13-phi:TORSION __FILL__r13-psi:TORSION __FILL__r14-phi:TORSION __FILL__r14-psi:TORSION __FILL__r15-phi:TORSION __FILL__r15-psi:TORSION __FILL__r16-phi:TORSION __FILL__r16-psi:TORSION __FILL__ # This command stores all the Ramachandran angles that were computedangles:COLLECT_FRAMES__FILL__=could not find this keywordr2-phi,r2-psi,r3-phi,r3-psi,r4-phi,r4-psi,r5-phi,r5-psi,r6-phi,r6-psi,r7-phi,r7-psi,r8-phi,r8-psi,r9-phi,r9-psi,r10-phi,r10-psi,r11-phi,r11-psi,r12-phi,r12-psi,r13-phi,r13-psi,r14-phi,r14-psi,r15-phi,r15-psi,r16-phi,r16-psi# Lets now compute the matrix of distances between the frames in the space of the Ramachandran anglesdistmat:EUCLIDEAN_DISSIMILARITIESuse the output of the analysis performed by this object as input to your new analysis objectMETRIC=EUCLIDEAN # Now select 500 landmark points to analyzecompulsory keyword ( default=EUCLIDEAN )the method that you are going to use to measure the distances between pointsfps:LANDMARK_SELECT_FPSuse the output of the analysis performed by this object as input to your new analysis objectNLANDMARKS=500 # Run sketch-map on the landmarkscompulsory keywordthe number of landmarks that you would like to selectsmap:SKETCH_MAP__FILL__=could not find this keywordfpsNLOW_DIM=2compulsory keywordthe dimension of the low dimensional space in which the projections will be constructedHIGH_DIM_FUNCTION={SMAP R_0=6 A=8 B=2}compulsory keywordthe parameters of the switching function in the high dimensional spaceLOW_DIM_FUNCTION={SMAP R_0=6 A=2 B=2}compulsory keywordthe parameters of the switching function in the low dimensional spaceCGTOL=1E-3compulsory keyword ( default=1E-6 )the tolerance for the conjugate gradient minimizationCGRID_SIZE=20compulsory keyword ( default=10 )number of points to use in each grid directionFGRID_SIZE=200compulsory keyword ( default=0 )interpolate the grid onto this number of points -- only works in 2DANNEAL_STEPS=0 # Project the remaining trajectory datacompulsory keyword ( default=10 )the number of steps of annealing to doosample:PROJECT_ALL_ANALYSIS_DATAuse the output of the analysis performed by this object as input to your new analysis objectPROJECTION=__FILL__ # This command outputs all the projections of all the points in the low dimensional space OUTPUT_ANALYSIS_DATA_TO_COLVARcompulsory keywordthe projection that you wish to generate out-of-sample projections withuse the output of the analysis performed by this object as input to your new analysis objectARG=the input for this action is the scalar output from one or more other actions.osample.*FILE=smap_data # These next three commands calculate the secondary structure variables. These # variables measure how much of the structure resembles an alpha helix, an antiparallel beta-sheet # and a parallel beta-sheet. Configurations that have different secondary structures should be projected # in different parts of the low dimensional space.compulsory keywordthe name of the file to output toalpha:ALPHARMSDabeta:ANTIBETARMSDpbeta:PARABETARMSDcc2:COLLECT_FRAMESARG=the input for this action is the scalar output from one or more other actions.alpha,abeta,pbetaOUTPUT_ANALYSIS_DATA_TO_COLVARuse the output of the analysis performed by this object as input to your new analysis objectcc2ARG=the input for this action is the scalar output from one or more other actions.cc2.*FILE=secondary_structure_datacompulsory keywordthe name of the file to output to

This input collects all the torsional angles for the configurations in the trajectory. Then, at the end of the calculation, the matrix of distances between these points is computed, and a set of landmark points is selected using a method known as farthest point sampling. A matrix that contains only those distances between the landmarks is then constructed and diagonalized by the CLASSICAL_MDS action, and this set of projections is used as the initial configuration for the various minimization algorithms that are then used to optimize the sketch-map stress function. As in the previous exercise, once the projections of the landmarks are found, the projections for the remainder of the points in the trajectory are found by using the PROJECT_ALL_ANALYSIS_DATA action. Try to fill in the blanks in the input above and run this calculation now using the command:

plumed driver --mf_pdb traj.pdb

Once the calculation has completed, you can, once again, visualize the data generated using chemiscope by using the scripts from earlier exercises. My projection is shown in the figure below. Points are, once again, coloured following the alpha-helical content of the corresponding structure.

** Try to see if you can reproduce an image that looks like the one above**

This exercise has shown you that running dimensionality reduction algorithms using PLUMED involves the following stages:

- Data is collected from the trajectory using COLLECT_FRAMES.
- Landmark points are selected using a Landmark Selection algorithm
- The distances between the trajectory frames are computed using EUCLIDEAN_DISSIMILARITIES
- A loss function is optimized to generate projections of the landmarks.
- Projections of the non-landmark points are found using PROJECT_ALL_ANALYSIS_DATA.

There are multiple choices to be made in each of the various stages described above. For example, you can change the particular sort of data this is collected from the trajectory. There are multiple different ways to select landmarks. You can use the distances directly, or you can transform them. You can use various loss functions and optimize the loss function using a variety of different algorithms. When you tackle your own research problems using these methods, you can experiment with the various choices that can be made.

**Generating a low-dimensional representation using these algorithms is not enough to publish a paper. ** We want to get some physical understanding from our simulation. Gaining this understanding is the hard part.

In many papers in this area, you will hear people talk about the distinction between collective variables and the reaction coordinate. The distinction these authors are trying emphasize when they use this language is between a descriptor that takes different values for the various important basins in the energy landscape (the collective variable) and the actual pathway the reaction proceeds along (the reaction coordinate). Furthermore, the authors of such papers will often argue that the reaction coordinate is simply some linear/non-linear combination of collective variables. In this exercise, we will study alanine dipeptide with path collective variables to show you one way of thinking about this distinction between collective variables and reaction coordinates.

By studying a system that is this simple, you will also hopefully come to understand how we can interpret the coordinates that we extract using the dimensionality reduction algorithms that were discussed in the previous exercise.

To remind you, the free energy surface as a function of the \(\phi\) and \(\psi\) angles for alanine dipeptide is shown below:

In other masterclasses, we have discussed how there are two critical states in the above energy landscape. These states are labelled C7eq and C7ax above.

The two Ramachandran angles plotted on the x and y axes of the free energy surface above are examples of what we have called collective variables. Both of these angles can be used to distinguish between C7eq and C7ax configurations. The reaction coordinate is the path that the system actually takes as it moves from the C7ax to the C7eq configuration. Based on the shape of the free energy surface, we might suppose that this reaction coordinate looks something like the black line shown below:

The file called alanine-transformation.pdb that you can find in the data directory of the GitHub repository contains a series of configurations that lie close to the transition path that is illustrated in black in the figure above. Below are plots that show how \(\phi\) and \(\psi\) change as the system moves along this path. **Try to see if you can use what you have learned in previous masterclasses to reproduce the figure above before continuing. **

We know what structures correspond to the various stable states of our system. We might, therefore, be tempted to ask ourselves if we can not just use the RMSD distance from a structure as the reaction coordinate. This approach will work if there is a single important configuration in the energy landscape. We could use the RMSD distance from this lowest energy structure as a CV to extract this configuration's free energy relative to everything else. How well does this work if we have two distinct states, though, as we have for alanine dipeptide? To investigate this question, fill in the PLUMED input below that calculates the RMSD distances from the C7ax and C7eq configurations:

Click on the labels of the actions for more information on what each action computes

c7ax:RMSD__FILL__=../data/C7ax.pdbcould not find this keywordTYPE=__FILL__compulsory keyword ( default=SIMPLE )the manner in which RMSD alignment is performed.c7eq:RMSD__FILL__=../data/C7eq.pdbcould not find this keywordTYPE=__FILL__ PRINTcompulsory keyword ( default=SIMPLE )the manner in which RMSD alignment is performed.ARG=the input for this action is the scalar output from one or more other actions.c7ax,c7eqFILE=colvarthe name of the file on which to output these quantitiesSTRIDE=100'''compulsory keyword ( default=1 )the frequency with which the quantities of interest should be output

You can run an MD simulation to monitor these distances using the python script below.

def generate_gromacs_input( directory, plumed_input, nsteps, temp ) :

# Setup the initial configuration by picking a frame at random from the startpoints file

allc = io.read('../data/startpoints.pdb',':')

nframes = len(allc)

io.write( directory + '/conf.pdb', allc[int(np.random.randint(0,nframes))] )

# Copy the topology to the appropriate directory

shutil.copyfile("../data/topol.top", directory + "/topol.top")

# Setup the mdp file

mdp = open("../data/md.mdp","r")

contents = mdp.read()

mdp.close()

new_content = contents.replace("SEED", str(np.random.randint(0,1000000))).replace("NSTEPS",str(nsteps)).replace("TEMP",str(temp))

mdpout = open(directory + "/md.mdp", "w")

mdpout.write(new_content)

mdpout.close()

# And write the plumed input

pout = open(directory + "/plumed.dat", "w")

pout.write( plumed_input )

pout.close()

return

plm = '''

INSERT PLUMED INNPUT HERE

'''

# Create a directory to run the calculation in

!rm -rf ../Unbiased_MD && mkdir ../Unbiased_MD

# Generate gromacs input for MD simulation at 1000 K

generate_gromacs_input( '../Unbiased_MD', plm, 500000, 1000 )

# Now run gromacs

!cd ../Unbiased_MD/ && gmx_mpi grompp -f md.mdp -c conf.pdb -p topol.top -maxwarn 2 &> /dev/null

!cd ../Unbiased_MD/ && gmx_mpi mdrun --plumed plumed.dat &> /dev/null

# And read in the colvar files

unbiased_data = np.loadtxt('../Unbiased_MD/colvar')

I ran simulations at 300K and 1000K using the above script. When the simulations completed, I was able to construct the figures below:

The black points above are the RMSD values for the trajectories. I have also shown the RMSD values for the frames in alanine-transformation in red. Notice that all the configurations explored in the MD are very distant from the C7ax and C7eq states. **Try now to reproduce the above figure yourself.**

You can see that at 300K, the simulation we ran did not visit the C7eq state. Furthermore, you can also clearly see this from the projections of the trajectories on the RMSD distances from these two configurations. Notice in these figures, however, that the distances from these two reference configurations were often considerably larger than the distance between these reference configurations and the configurations on the reference path in alanine-transformation.pdb.

Instead of calculating two distances, we might ask ourselves if the linear combination of \(\phi\) and \(\psi\) that is illustrated in the figure below:

gives a better description for the transition. We can define this CV as follows:

\[ s = \frac{(\phi_2 - \phi_1).(\phi_3 - \phi_1) + (\psi_2 - \psi_1).(\psi_3 - \psi_1)}{ \sqrt{(\phi_2 - \phi_1)^2 + (\psi_2 - \psi_1)^2} } \]

In this expression \((\phi_1,\psi_1)\) are the Ramachandran angles in the \(C_7eq\) configuration and \((\phi_2,\psi_2)\) are the Ramachandran angles in the \(C_7ax\). \((\phi_3,\psi_3)\) is the instantaneous configuration. You should thus be able to see that we have arrived at the above expression by using our knowledge of the dot product between two vectors.

**See if you can write an input to re-analyse the data in alanine-transformation.pdb and the MD simulations from the previous section using this CV.**

You should be able to get plots of the value of this CV as a function of step number that looks like the ones shown below:

I implemented this CV using a combination of TORSION and COMBINE. I also ignored the fact that the torsions are periodic variables when calculating the linear combination.

You can't use the sort of linear algebra above with periodic variables, but it is OK for these illustrative purposes.

Notice that the first frame in alanine-transformation.pdb has the molecule in the \(C_7ax\) configuration. The last frame has the molecule in the \(C_7eq\) state.

Figure masterclass-21-6-pca-figure showed how we can construct a new collective variables by taking a linear combination of two other variables. This idea can be extended to higher dimensions, however. As long as we can find the vector that connectes the \(C_7eq\) and \(C_7ax\) states we can project our current coordinates on that particular vector. We can even define this vector in the space of the coordinates of the atoms. In other words, if the 3 \(N\) coordinate of atomic positions is \(\mathbf{x}^{(1)}\) for the \(C_7eq\) configuration and \(\mathbf{x}^{(2)}\) for the \(C_7ax\) configuration and if the instantaneous configuration of the atoms is \(\mathbf{x}^{(3)}\) we can use the following as a CV:

\[ s = \frac{\sum_{i=1}^{3N} (x^{(2)}_i - x^{(1)}_i ) (x^{(3)}_i - x^{(1)}_i )}{ \sqrt{\sum_{i=1}^{3N} (x^{(2)}_i - x^{(1)}_i )^2} } \]

as long as rotational and translational movement is removed before calculating the two displacement vectors above.

We can also get some sense of how far the system has moved from this vector by computing:

\[ z = \sqrt{ \sum_{i=1}^{3N} (x^{(3)}_i - x^{(1)}_i)^2 - s^2 } \]

which follows by applying Pythagoras theorem.

You can calculate this collective variable using PLUMED by using the input below:

p: PCAVARS __FILL__=pca-reference.pdb __FILL__=OPTIMAL PRINT ARG=p.* FILE=colvar

**Use this input to re-analyse the data in alanine-transformation.pdb and your MD simulations before continuing.** The figure below shows the results that I obtained by analyzing the data using the PCAVARS command above.

The figure above shows that this coordinate is good. We move quite far from this initial vector when we move to the C7ax state, however.

Instead of using a single PCA variable and calculating the projection on the vector connecting these two points, we can instead use the projection of the instantaneous coordinates in the plane that contains three reference configurations. You can calculate these collective variables using PLUMED by using the input below:

p: PCAVARS __FILL__=pca2-reference.pdb __FILL__=OPTIMAL PRINT ARG=p.* FILE=colvar

Notice that there are now three structures within pca2-reference.pdb, and thus two vectors are defined. **Use this input to re-analyse the data in alanine-transformation.pdb and your MD simulations before continuing.**

The figure below shows the results that I obtained by analyzing the data using the PCAVARS command above.

Instead of using the projection on more than one vector, we can extend the ideas in the previous section and use curvilinear paths rather than linear paths. This trick is what is done with the PATH CVs that are illustrated in the figure below:

As you can see, there are two path collective variables, \(S(X)\) measures your position on a path and is calculated using:

\[ S(X)=\frac{\sum_{i=1}^{N} i\ \exp^{-\lambda \vert X-X_i \vert }}{ \sum_{i=1}^{N} \exp^{-\lambda \vert X-X_i \vert } } \]

\(Z(x)\), meanwhile, measures the distance from the path using:

\[ Z(X) = - \frac{1}{\lambda} \ln\left[ \sum_{i=1}^{N} \exp^{-\lambda \vert X-X_i \vert } \right] \]

We can calculate these CVs using a filled-in version of the input that is shown below:

path: PATH __FILL__=../data/alanine-path.pdb TYPE=__FILL__ LAMBDA=15100. PRINT ARG=* FILE=colvar

The figure below shows the \(\phi\) and \(\psi\) angles for the configurations that define the path as red dots. Meanwhile, the black dots are the \(\phi\) and \(\psi\) angles that are visited during a high temperature, unbiased MD trajectory. The green dots are then the configurations in alanine-transformation.pdb You can clearly see that the red dots lie along the transition path that connects the C7eq and C7ax configurations and that the green dots follow this pathway:

When we calculate the path collective variables for the black and red data points in the above figure, we get the result shown on the right. The panel on the left shows the output from a 300 K simulation during which the C7eq configuration was not visited:

You can see that the majority of configurations have very small z values. It would thus seem that we have identified our reaction coordinate. We can use the S-path coordinate to tell us where we are on the transition pathway between the two states. Furthermore, in defining this coordinate, we have used the positions of all the heavy atoms in the protein.

Notice that when we talk about the reaction coordinate, we are not talking about a single set of configurations that connect the two states. The previous sections have shown that there are always multiple pathways that connect the two states. The task of identifying a single reaction coordinate thus appears impossible. How, after all, can there be a single reaction path if we know that there are multiple pathways that connect the two states?

One way of answering this question is to look at how far the transitions you observe deviate from the reference transition path using the Z-coordinate. An alternative answer can be found by considering so-called isocommittor surfaces. When using this method, you suppose there is a saddle point between the two states of interest (the \(C_7ax\) and \(C_7eq\) configurations in our alanine dipeptide example).

Let's suppose that we now start a simulation with the system balanced precariously on this saddle point. The system will, very-rapidly, fall off the maximum and move towards either the left or the right basin. If we are exactly at the saddle, 50% of the trajectories will fall to the right, and 50% will fall to the left.

We can use the idea of the isocommittor to identify whether any collective variable is giving a good description of the transition state ensemble's location. The idea in such simulations is to define regions of space that correspond to the various stable states in the free energy landscape. We then launch lots of simulations from configurations that are not within these basins and identify the basin that these calculations visit first. We can then make graphs like those shown below. These graphs show the fraction of times a particular CV value was visited in a trajectory that visited the c7eq basin before it visited the c7ax basin.

The PLUMED input that I used when making the figure above is a filled in version of the following:

Click on the labels of the actions for more information on what each action computes

phi:TORSION__FILL__=5,7,9,15could not find this keywordpsi:TORSION__FILL__=7,9,15,17could not find this keywordpca:PCAVARS__FILL__=../data/pca-reference.pdbcould not find this keywordTYPE=__FILL__ PRINTcompulsory keyword ( default=OPTIMAL )The method we are using for alignment to the reference structure__FILL__=could not find this keywordphi,psi,pca.eig-1,path.spathFILE=colvarthe name of the file on which to output these quantitiesSTRIDE=1 __FILL__compulsory keyword ( default=1 )the frequency with which the quantities of interest should be outputARG=could not find this keywordphiSTRIDE=1could not find this keywordBASIN_LL1=-3could not find this keywordBASIN_UL1=-1could not find this keywordBASIN_LL2=1could not find this keywordBASIN_UL2=2could not find this keywordFILE=basincould not find this keyword

I deployed the input above within the following python script that launches a large number of gromacs simulations:

def gen_trajectories( ntraj, plm ) :

nc7eq, c7eq_colv, all_data = 0, np.zeros([0,5]), np.zeros([0,5])

for i in range(ntraj) :

!rm -rf ../Test && mkdir ../Test

generate_gromacs_input( '../Test', plm, 10000000, 300 )

!cd ../Test/ && gmx_mpi grompp -f md.mdp -c conf.pdb -p topol.top -maxwarn 2 &> /dev/null

!cd ../Test/ && gmx_mpi mdrun --plumed plumed.dat &> /dev/null

bfile = open('../Test/basin','r')

with open( '../Test/basin', "r" ) as myfile :

for line in myfile :

if line.startswith("#! SET COMMITTED TO BASIN") : basin = line.split()[5]

colv_data = np.loadtxt("../Test/colvar")

if len(colv_data.shape)==1 : colv_data = np.reshape( colv_data, (1,5) )

all_data = np.concatenate( (all_data, colv_data), axis=0 )

if basin=="1" :

c7eq_colv = np.concatenate( (c7eq_colv, colv_data), axis=0 )

nc7eq = nc7eq + 1

print( "NUMBER c7ax", ntraj-nc7eq, "c7eq", nc7eq )

return c7eq_colv, all_data

p='''

YOUR PLUMED INPUT GOES HERE

'''

bas_data, all_data = gen_trajectories( ntraj, p )

Notice how the above script stores the CV values from trajectories that finished in the C7eq basin in the NumPy array called bas_data. We can use the data in this file to construct the (unnormalized) histograms of CV value that finished in C7eq and the total histogram. The script I used to construct these histograms is shown below. This script also shows how we can arrive at the final conditional probability distributions from masterclass21-6-rmsd-committor by dividing by the total number of configurations that moves to C7eq first by the total number of configuration that visited a point.

def histo( data, nbins, xmin, xmax ) :

delr = ( xmax - xmin ) / nbins

hist = np.zeros(nbins)

for d in data :

xbin = int( np.floor( (d-xmin) / delr ) )

hist[xbin] = hist[xbin] + 1

return hist / delr

def get_isocommitor( bas_data, full_data, nbins, xmin, xmax ) :

bas_histo = histo( bas_data, nbins, xmin, xmax )

full_histo = histo( full_data, nbins, xmin, xmax )

for i in range(nbins) :

if np.abs(full_histo[i])<1E-4 : bas_histo[i] = 0

else : bas_histo[i] = bas_histo[i] / full_histo[i]

return bas_histo

# This makes the isocommittor as a function of \f$\phi\f$:

commit = get_isocommitor( bas_data[:,1], all_data[:,1], 30, -np.pi, np.pi )

Notice that the script above does not compute the error bars I included in masterclass21-6-rmsd-committor. ** Your task in this exercise is to reproduce this figure with the error bars.** I generated the error bars in masterclass21-6-rmsd-committor by running 10 sets of 500 simulations. I constructed separate histograms from each of these batches of simulations and calculated the mean and variance from these ten sets of samples.

The exercises in this section aimed to discuss the difference between collective variables and the reaction coordinate. I have argued that collective variables are developed by researchers thinking about coordinates that can distinguish between the important states. Reaction coordinates, meanwhile, are the actual pathways that reactions proceed along. These reaction coordinates are found by probing the data we get from simulations.

I am not convinced that the distinction between reaction coordinates and collective variables is very important. Every coordinate we have tried in the previous sections has allowed us to determine whether we are in the C7eq or the C7ax basin. My view is that when we do science, we impose structure on our results by asking questions. Our question has been to determine the relative free energies of the C7ax and C7eq basins in this exercise. To do this, we have to define what configurations are in the C7ax basin and what configurations are in the C7eq basin. The distinction between these two structures is an arbitrary figment of our imagination. If we use different variables, we might be defining these states differently, and we might get slightly different answers. We should not be surprised by the fact that we get different answers if we ask the question differently. The important thing is to develop an interesting question and to tell a good story about whatever you find out.

Many researchers have used these rare event methods to study nucleation and crystal growth. Studying such problems introduces an additional challenge when designing collective variables as all the atoms are indistinguishable. You thus know before even running the simulation that there are multiple paths that connect the reactant (liquid) state and the product (solid) state. The nucleation, after all, can start with any atom in the simulation box. The first step in developing CVs for studying nucleation processes is thus to identify some order parameter that allows us to distinguish atoms that are in the solid state from atoms that are in the liquid state. The following input, once it is filled in will calculate some suitable order parameters for all the atoms in the system:

Click on the labels of the actions for more information on what each action computes

UNITSNATURAL( default=off ) use natural unitscoord:COORDINATIONNUMBER__FILL__=1-5184could not find this keyword__FILL__={CUBIC D_0=1.2 D_MAX=1.5}could not find this keywordcub:FCCUBIC__FILL__=1-5184could not find this keyword__FILL__={CUBIC D_0=1.2 D_MAX=1.5}could not find this keywordALPHA=27compulsory keyword ( default=3.0 )The alpha parameter of the angular functionfcub:MTRANSFORM_MOREDATA=__FILL__compulsory keywordThe multicolvar that calculates the set of base quantities that we are interested in__FILL__={SMAP R_0=0.45 D_0=0.0 A=8 B=8} DUMPMULTICOLVARcould not find this keyword__FILL__=could not find this keywordcoordSTRIDE=1000compulsory keyword ( default=1 )the frequency with which the atoms should be output__FILL_=could not find this keywordcoord.xyzDUMPMULTICOLVAR__FILL__=could not find this keywordcubSTRIDE=1000compulsory keyword ( default=1 )the frequency with which the atoms should be output__FILL__=could not find this keywordcub.xyzDUMPMULTICOLVAR__FILL__=could not find this keywordfcubSTRIDE=1000compulsory keyword ( default=1 )the frequency with which the atoms should be output__FILL__=could not find this keywordfcub.xyz

You can run a short MD simulation on Lennard-Jonesium to compute those order parameter by using the following input (in) file for simplemd:

inputfile interface.xyz outputfile output.xyz temperature 0.6 tstep 0.005 friction 1 forcecutoff 2.5 listcutoff 3.0 nstep 10000 nconfig 100 trajectory.xyz nstat 100 energies.dat

You can run this from a python notebook and read in the values of the order parameters by using the following script:

p = '''

INSERT YOUR PLUMED INPUT HERE

'''

inp = '''

INSERT YOUR SIMPLEMD INPUT HERE

'''

# Make a directory to run in

!rm -rf ../LJ-trajectories/First-run && mkdir ../LJ-trajectories/First-run

# Copy the initial configuration

!cp ../data/interface.xyz ../LJ-trajectories/First-run

# Output the plumed file

f = open("../LJ-trajectories/First-run/plumed.dat", 'w')

f.write(p)

f.close()

# Output the input to simplemd

f = open("../LJ-trajectories/First-run/in", 'w')

f.write(inp)

f.close()

# Now run PLUMED

!cd ../LJ-trajectories/First-run/ && plumed simplemd < in &> /dev/null

# Read in the various order parameters (N.B. you can almost certainly write better python here)

!grep X ../LJ-trajectories/First-run/coord.xyz | awk '{print $5}' > ../LJ-trajectories/First-run/coord.dat

coord = np.loadtxt("../LJ-trajectories/First-run/coord.dat")

!grep X ../LJ-trajectories/First-run/cub.xyz | awk '{print $5}' > ../LJ-trajectories/First-run/cub.dat

cub = np.loadtxt("../LJ-trajectories/First-run/cub.dat")

!grep X ../LJ-trajectories/First-run/fcub.xyz | awk '{print $6}' > ../LJ-trajectories/First-run/fcub.dat

fcub = np.loadtxt("../LJ-trajectories/First-run/fcub.dat")

A visualization of the order parameters for all the atoms can then be produced using chemiscope, which allows you to see the value of all the individual atom order parameters.

import ase

import ase.io

from chemiscope import write_input

# Read in the trajectory

traj = ase.io.read('../LJ-trajectories/First-run/trajectory.xyz',':')

# This constructs the dictionary of properties for chemiscope

properties = {

"coord": {

"target": "atom",

"values": coord,

"description": "Coordination number of atom",

},

"cub": {

"target": "atom",

"values": cub,

"description": "FCCCUBIC order parameter",

},

"fcub": {

"target": "atom",

"values": fcub,

"description": "Transformed FCCUBIC order parameter",

},

}

# This generates our chemiscope output

write_input("fccubic_chemiscope.json.gz", cutoff=1.5, frames=traj, properties=properties )

** Put all the above together and see if you can produce a chemiscope representation like the one below **

Chemiscope is invaluable for determining whether our order parameter is good at distinguishing the atoms within the solid from those within the liquid. If you visualize the FCCCUBIC or the transformed values of these parameters in the example above, you see that roughly half of the atoms are solid-like, and nearly half of the atoms are liquid-like. Notice also that we can use the dimensionality reduction that were discussed in Exercise 4: Path collective variables and clustering algorithms to develop these atomic order parameters. Generating the data to analyze using these algorithms is easier because we can get multiple sets of coordinates to analyze from each trajectory frame. The atoms are, after all, indistinguishable.

Once we have identified an order parameter we can analyse the distribution of values that it takes by using an input like the one shown below:

Click on the labels of the actions for more information on what each action computes

UNITSNATURAL( default=off ) use natural unitscoord:COORDINATIONNUMBER__FILL__=1-5184could not find this keyword__FILL__={CUBIC D_0=1.2 D_MAX=1.5}could not find this keywordcub:FCCUBIC__FILL__=1-5184could not find this keyword__FILL__={CUBIC D_0=1.2 D_MAX=1.5}could not find this keywordALPHA=27compulsory keyword ( default=3.0 )The alpha parameter of the angular functionfcub:MTRANSFORM_MOREDATA=__FILL__compulsory keywordThe multicolvar that calculates the set of base quantities that we are interested in__FILL__={SMAP R_0=0.45 D_0=0.0 A=8 B=8}could not find this keywordcoord_histo:HISTOGRAM__FILL__=could not find this keywordcoordSTRIDE=10compulsory keyword ( default=1 )the frequency with which the data should be collected and added to the quantity being averaged__FILL__=2could not find this keywordGRID_MAX=15compulsory keywordthe upper bounds for the grid__FILL__=100could not find this keyword__FILL__=DISCRETEcould not find this keywordcub_histo:HISTOGRAM__FILL__=could not find this keywordcubSTRIDE=10compulsory keyword ( default=1 )the frequency with which the data should be collected and added to the quantity being averagedGRID_MIN=-1compulsory keywordthe lower bounds for the grid__FILL__=1could not find this keyword__FILL__=100could not find this keyword__FILL__=DISCRETEcould not find this keywordfcub_histo:HISTOGRAM__FILL__=could not find this keywordfcubSTRIDE=10compulsory keyword ( default=1 )the frequency with which the data should be collected and added to the quantity being averaged__FILL__=0could not find this keywordGRID_MAX=1compulsory keywordthe upper bounds for the grid__FILL__=100could not find this keyword__FILL__=DISCRETE __FILL__could not find this keyword__FILL__=could not find this keywordcoord_histoFILE=could not find this keywordcoord_histo.dat__FILL____FILL__=could not find this keywordcub_histoFILE=could not find this keywordcub_histo.dat__FILL____FILL__=could not find this keywordfcub_histoFILE=could not find this keywordfcub_histo.dat

The resulting distributions of order parameter values should look like this:

** Try to use the information above to reproduce this plot. ** Notice that the distribution of order parameters for the FCC Cubic order parameters is bimodal. There is a peak corresponding to the value that this quantity takes in the box's liquid part. The second peak then corresponds to the value that the quantity takes in the box's solid part. The same cannot be said for the coordination number by contrast. Notice, last of all, that by transforming the values above we have an order parameter that is one for atoms that are within the solid and zero otherwise.

It is tempting to argue that the number of solid particles is equal to the number of atoms that have an order parameter that is greater than some threshold. A better way to calculate the number of solid particles, however, is to proceed as follows:

- Run simulations of the solid and liquid under the same thermodynamic conditions.
- Calculate the average values per atom value of the order parameter in these simulations of the solid \(\phi_s\) and liquid \(\phi_l\).
Calculate the number of solid \(n_l\) and liquid atoms \(n_l\) by solving the following pair of simultaneous equations \(N=n_s + n_l\) and \(\Phi = n_s \phi_s + n_l \phi_l\), where \(N\) is the total number of atoms. \(\Phi\), meanwhile, is given by:

\[ \Phi = \sum_{i=1}^N \phi_i \]

The sum runs over all the atoms and where \(\phi_i\) is the order parameter for the \(i\)th atom in the system.

This procedure works because \(\Phi\) is an extensive quantity and is thus additive. If we have a system that contains a aolid liquid interface we can express the value of any extensive quantity, \(\Psi\) using:

\[ \Psi = n_s \psi_s + n_l \psi_l + \psi_i \]

where \(\psi_s\) and \(\psi_l\) is the average value per atom value of the quantity \(\psi\) in the solid and liquid phases. \(\psi_i\), meanwhile, is the so-called surface excess term which measures the contribution that the presence of the interface makes to the value of \(\Psi\). When we find \(n_s\) by solving the simultaneous equations above, we assume that this excess term is zero.

** To complete this exercise, you should run simulations of the solid, liquid and interface. You should use the ensemble average of the order parameter for your solid and liquid simulations to make a graph like the one below that shows the number of atoms of solid as a function of time for the system that contains the interface**. The number of solid atoms has been determined by solving the simultaneous equations using the theory above. Notice that there are input configurations for the solid, liquid and interface systems in the data directory of the GitHub repository. The solid configuration is in solid.xyz. The liquid is in liquid.xyz, and the interface is in interface.xyz. The final result you get will look something like this:

Notice that the green line is the least noisy of these curves. The reason the line is less noisy is connected to the fact that you have the clearest delineation between solid and liquid atoms when you use this order parameter.

NB. We are running at constant volume because we are keeping things simple and using simplemd. If you are studying nucleation using these techniques, you should run at constant pressure.

The exercises in this tutorial have introduced you to the following ideas:

- Using chemiscope to visualize trajectories and collective variables.
- Using dimensionality reduction algorithms
- Using Path collective variables
- Clustering trajectories
- Dealing with indistinguishable atoms

What I would like to do in the follow-up session is to think about how you can use these ideas as you do your research. The first step in using any of these questions is asking a research question that is amenable to answering using one of these methods. In this final exercise, I would like you to think about using these methods in your research. The first step in doing so is asking yourself the question **what am I trying to find out?** The clearer you make this question, the easier you will find the subsequent research project. With this in mind, here are some examples of terrible research questions:

- Understanding protein-ligand interactions using machine learning algorithms.
- Understanding the mechanism via which limescale forms in pipes
- Understanding protein folding using machine learning.
- Studying transport through membrane proteins.
- Understanding the catalytic mechanism of iron in ammonia formation.

These research questions are terrible because the final result that you will obtain has not been specified. It is unclear when the project will be finished, or what success in this project looks like. With this in mind, consider the following alternative questions, which each have a clear goal. Each of the goals below maps on to one of the terrible research questions above:

- Calculate the free energy change when ligand A binds to protein B.
- Calculate the free energy change when single calcium and carbonate ions bind to copper surfaces.
- Calculate the free energy change when protein A transitions to its folded state.
- Calculate the free energy change as a sodium ion moves through a sodium channel protein.
- Calculate the free energy change when nitrogen, hydrogen and ammonia gas bind to iron surfaces.

For each of the questions above, it is clear what success will look like – you will have a converged estimate of the free energy difference. We have shown you how such estimates can be extracted in previous masterclasses and what it looks like when such calculations go wrong. Answering these research questions is thus simply a matter of applying what you have learned. Notice, also, how for each of the questions above, it is clear what you can do in the first step:

- Run a parallel tempering metadynamics simulation with the distance between the centres of mass of the protein and the ligand as a CV.
- Run a metadynamics simulation of a slab of copper in contact with an aqueous solution containing a single calcium ion. Use the z component of the vector connecting the calcium ion and the centre of mass of the copper slab as the CV. Run a metadynamics simulation using the distance from the protein's folded state as the collective variable.
- Run a metadynamics simulation of the sodium in the channel protein. Use the z position of the sodium relative to the centre of the channel protein as a CVs.
- Run a metadynamics simulation of a slab of iron in contact with a simulation box that is empty except for one nitrogen molecule. Use the z component of the vector connecting the centre of mass of the nitrogen and the centre of mass of the iron slab as the CV.

These simulations may or may not give you the desired free energy difference. You will at least get some trajectory data from them, however. You can then analyze the trajectories using a machine learning algorithm or similar to refine your approach. This refining stage is where the experience of your supervisor should be invaluable.

With all this in mind, think about your research project. Try to develop a research question to work on and an approach that you can use to tackle this question. Your question should not be some grand challenge.

It should be something simple that you know how to do. You do not need to run these simulations. I want you to think about the research question to discuss them at the follow-up session. I want you to think about how you can use these methods in your research as if you are not doing so, attending these masterclasses is pointless as you will never really apply these methods.