In this Masterclass, we will discuss how to run variationally enhanced sampling simulations with PLUMED. We will also understand how to analyze the results.
Once you have completed this Masterclass you will be able to:
Use PLUMED to run and analyze
For this masterclass you will need versions of PLUMED (with the VES module enabled) and GROMACS that are compiled using the MPI library. All the exercises were tested with PLUMED version 2.8.0 and GROMACS 2020.6. In order to obtain the correct versions, please follow the instructions at this link.
The data needed to execute the exercises of this Masterclass can be found on GitHub. You can clone this repository locally on your machine using the following command:
git clone https://github.com/valsson-group/masterclass-22-11.git
Here, we will briefly summarize the theory behind variationally enhanced sampling (VES). For an full overview of VES, you should read the original paper, a recent review book chapter on VES, or a recent enhanced sampling review that includes discussion about VES.
VES is based on the the following functional of the bias potential:
\[ \Omega [V] = \frac{1}{\beta} \log \frac {\int d\mathbf{s} \, e^{-\beta \left[ F(\mathbf{s}) + V(\mathbf{s})\right]}} {\int d\mathbf{s} \, e^{-\beta F(\mathbf{s})}} + \int d\mathbf{s} \, p_{\mathrm{tg}}(\mathbf{s}) V(\mathbf{s}), \]
where \(\mathbf{s}\) are the CVs that we are biasing, \( p_{\mathrm{tg}}(\mathbf{s}) \) is a predefined probability distribution that we will refer to as the target distribution, and \( F(\mathbf{s}) \) is the free energy surface. This functional can be shown to be convex and to have a minimum at:
\[ V(\mathbf{s}) = -F(\mathbf{s})-{\frac {1}{\beta}} \log {p_{\mathrm{tg}}(\mathbf{s})}. \]
The last equation states that when we minimize the functional \( \Omega [V] \), we can obtain the free energy surface from the bias potential (and the target distribution) We can choose the target distribution \( p_{\mathrm{tg}}(\mathbf{s}) \) at will and it is the CV distribution that we obtain when minimizing \( \Omega [V] \).
We put the variational principle to practice by expanding \( V(\mathbf{s}) \) in some basis set:
\[ V(\mathbf{s}) = \sum\limits_{i} \alpha_i \, f_i(\mathbf{s}), \]
where \( f_i(\mathbf{s}) \) are the basis functions and the \(\boldsymbol\alpha \) are the coefficients in the expansion. We then need to find the coefficients \(\boldsymbol\alpha \) that minimize \( \Omega [V] \). In principle one could use any optimization algorithm. In practice the algorithm that has become the default choice for VES is the so-called averaged stochastic gradient descent algorithm [9]. In this algorithm the \(\boldsymbol\alpha \) are evolved iteratively according to:
\[ \boldsymbol\alpha^{(n+1)} = \boldsymbol\alpha^{(n)}-\mu \left[ \nabla\Omega(\bar{\boldsymbol\alpha}^{(n)})+ H(\bar{\boldsymbol\alpha}^{(n)})[\boldsymbol\alpha^{(n)}-\bar{\boldsymbol\alpha}^{(n)}] \right] \]
where \(\mu\) is the step size, \(\bar{\boldsymbol\alpha}^{(n)} \) is the running average of \(\boldsymbol\alpha^{(n)} \) at iteration \( n \), and \(\nabla\Omega(\bar{\boldsymbol\alpha}^{(n)}) \) and \(H(\bar{\boldsymbol\alpha}^{(n)}) \) are the gradient and Hessian of \( \Omega[V] \) evaluated at the running average at iteration \( n \), respectively. The behavior of the coefficients will become clear in the examples below.
As said above, we can choose the target distribution \( p_{\mathrm{tg}}(\mathbf{s}) \) at will. The most simple choice would be a uniform target distribution. However, it has found more optimal to employ the so-called well-tempered distribution [138] :
\[ p_{\mathrm{tg}}(\mathbf{s}) = \frac{[ P(\mathbf{s}) ]^{1/\gamma}} {\int d\mathbf{s}\, [ P(\mathbf{s}) ]^{1/\gamma}} \]
where \( \gamma \) is the so-called bias factor and \( P(\mathbf{s}) \) is the unbiased CV distribution. Therefore the well-tempered distribution is the unbiased distribution with enhanced fluctuations and lowered barriers. This is the same distribution as sampled in well-tempered metadynamics. The advantages of this distribution are that the features of the FES (metastable states) are preserved and that the system is not forced to sample regions of high free energy (that can represent un-physical configurations) as it would if we had chosen the uniform target distribution.
There is a caveat though, the well-tempered \( p_{\mathrm{tg}}(\mathbf{s}) \) depends on \( F(\mathbf{s})\) that is the function that we are trying to calculate. One way to approach this problem is to calculate \( p_{\mathrm{tg}}(\mathbf{s}) \) self-consistently [138], for instance at iteration \( k \):
\[ p^{(k+1)}(\mathbf{s})=\frac{e^{-(\beta/\gamma) F^{(k+1)}(\mathbf{s})}}{\int d\mathbf{s} \, e^{-(\beta/\gamma) F^{(k+1)}(\mathbf{s})}} \]
where:
\[ F^{(k+1)}(\mathbf{s})=-V^{(k)}(\mathbf{s}) - \frac{1}{\beta} \log p^{(k)}(\mathbf{s}) \]
Normally \( p^{(0)}(\mathbf{s}) \) is taken to be uniform. Therefore the target distribution evolves in time until it becomes stationary when the simulation has converged. It has been shown that in some cases the convergence is faster using the well-tempered target distribution than using the uniform \( p(\mathbf{s}) \) [138].
In this tutorial, we will consider the association/dissociation of NaCl in aqueous solution. The system consists of 1 Na atom, 1 Cl atom, and 107 water molecules for a total of 323 atoms. In an effort to speed up the simulations, we employ a rather small water box, and thus need to employ smaller cutoffs than usually used. Therefore, this simulation setup should not be used in production runs. Typically, the run should take around 15-20 minutes to run on a laptop using two MPI processes. By running the simulations on a cluster, you reduce the simulation time.
The most relevant CV for this system is the distance between the Na and Cl atoms that is defined in PLUMED as
dist: DISTANCE ATOMSthe pair of atom that we are calculating the distance between. =322,323
Furthermore, the NaCl association/dissociation is coupled to the collective motion of the solvent. To measure that, we will use a CV that measures the solvation of the Na atom. For this, we employ the coordination number of the Na atom with respect to the oxygens of the water molecules that we define in PLUMED as
coord: COORDINATION ... GROUPAFirst list of atoms. =322 GROUPBSecond list of atoms (if empty, N*(N-1)/2 pairs in GROUPA are counted). =1-321:3 SWITCHThis keyword is used if you want to employ an alternative to the continuous switching function defined above. ={RATIONAL R_0=0.315 D_MAX=0.5 NN=12 MM=24} NLIST( default=off ) Use a neighbor list to speed up the calculation NL_CUTOFFThe cutoff for the neighbor list =0.55 NL_STRIDEThe frequency with which we are updating the atoms in the neighbor list =10 ...
We will also limit CV space exploration by employing an upper wall on the distance between Na and Cl atoms that is defined in PLUMED as
uwall: UPPER_WALLS ... ARGthe input for this action is the scalar output from one or more other actions. =dist ATcompulsory keyword the positions of the wall. =0.6 KAPPAcompulsory keyword the force constant for the wall. =4000.0 ...
We will start by performing a simulation where we bias the Na-Cl distance.
Every VES simulation has three key ingredients
For the basis set, we will use the recently introduced wavelet-based basis set that are localized basis functions that have been shown to perform better than the previously used Chebyshev or Legendre polynomials. We will employ the least asymmetric variant of these wavelets or so-called symlets (as indicated by the TYPE=SYMLETS keyword). We will use an order 10 of the symlets or Sym10 (as indicated by the ORDER=10 keyword). Furthermore information about the wavelets can be found in the reference above.
We need to select the range on which the bias potential is expanded. Here we will use the range from 0.2 nm to 0.7 nm (as indicated by the MINIMUM and MAXIMUM keywords). We also need to select the number of basis functions, and here we will use 26 basis functions (as indicated by the NUM_BF keyword). The PLUMED action corresponding to this setup is given by
# Basisset for Na-Cl distance bf1: BF_WAVELETS ... TYPEcompulsory keyword Specify the wavelet type. =SYMLETS ORDERcompulsory keyword The order of the basis function expansion. =10 NUM_BFThe number of basis functions that should be used. =26 MINIMUMcompulsory keyword The minimum of the interval on which the basis functions are defined. =0.2 MAXIMUMcompulsory keyword The maximum of the interval on which the basis functions are defined. =0.7 TAILS_THRESHOLDThe threshold for cutting off tail wavelets as a fraction of the maximum value. =0.01 ...
For the target distribution, we employ a well-tempered distribution with a bias factor of 10.
# Target distribution td: TD_WELLTEMPERED BIASFACTORcompulsory keyword The bias factor used for the well-tempered distribution. =10
Then we define the VES bias potential using the VES_LINEAR_EXPANSION action
ves: VES_LINEAR_EXPANSION ... ARGthe input for this action is the scalar output from one or more other actions. =dist BASIS_FUNCTIONScompulsory keyword the label of the one dimensional basis functions that should be used. =bf1 TEMPthe system temperature - this is needed if the MD code does not pass the temperature to PLUMED. =300.0 GRID_BINSthe number of bins used for the grid. =300 OPTIMIZATION_THRESHOLDThreshold value to turn off optimization of localized basis functions. =0.000001 TARGET_DISTRIBUTIONthe label of the target distribution to be used. =td ...
Finally, we need to define the optimization algorithm. The standard is the averaged stochastic gradient descent (OPT_AVERAGED_SGD). We need to define two parameters: the stride and the step size. The stride (given by the keyword STRIDE) is the number of MD steps in which samples are collected to calculate the gradient and hessian of \( \Omega [V] \). The step size (given by the keyword STEPSIZE) is the step by which the coefficients are evolved at every optimization steps, given by \(\mu\) in the equation above. It has become traditional to choose a stride of around 500-2000 MD steps. It must be noted that we are not looking for an accurate estimation of the gradient, since for this we would need to sample all the CV space. The step size in the optimization has a strong connection with the height of typical barriers in the system. The larger the barriers, the larger the step size needed such that the bias can grow fast enough to overcome them. For this example we have chosen a stride of 500 steps (i.e., 1 ps) and a step size of 5.0 kJ/mol. We also need to choose how often we update the target distribution (given by the keyword TARGETDIST_STRIDE) and we do this every 100 bias potential updates (i.e., every 100 ps in the current case).
opt: OPT_AVERAGED_SGD ... BIAScompulsory keyword the label of the VES bias to be optimized =ves STRIDEcompulsory keyword the frequency of updating the coefficients given in the number of MD steps. =500 STEPSIZEcompulsory keyword the step size used for the optimization =5.0 FES_OUTPUThow often the FES(s) should be written out to file. =100 BIAS_OUTPUThow often the bias(es) should be written out to file. =500 COEFFS_OUTPUTcompulsory keyword ( default=100 ) how often the coefficients should be written to file. =10 TARGETDIST_STRIDEstride for updating a target distribution that is iteratively updated during the optimization. =100 TARGETDIST_OUTPUThow often the dynamic target distribution(s) should be written out to file. =500 ...
The other parameters are related to the outputting frequency of various output files.
Finally, we need to define the PRINT action to output all the variables
PRINT ARGthe input for this action is the scalar output from one or more other actions. =dist,coord,ves.*,uwall.* FILEthe name of the file on which to output these quantities =colvar.data STRIDEcompulsory keyword ( default=1 ) the frequency with which the quantities of interest should be output =125
The full PLUMED input file can be found in the Exercise-1 folder. There you also find all the relevant GROMACS input file. First, you need to run the generate-tpr-file.sh script that generates the GROMACS TPR file using the parameters defined in MD-NPT.mdp and the initial geometry defined using the StartingGeometry variable. You can then run the simulation using the run.sh script
./generate-tpr-file.sh ./run.sh
The run might take around 15-20 minutes to run using two MPI processes. You can adjust the number of MIP processes used for the simulation using the NumProcs variable in the run.sh script.
At the end of simulation, you will get several files:
To assess the simulation and its convergence, you should first look at the time evolution of the biased CV and check that it is diffusive in CV space. Second, you should look at how the free energy surfaces behave as a function of time by looking at the fes.<bias-name>.iter-<iteration-number>.data files at different number of iterations (the minimum of the FES is always align to zero to facilitate comparison). To do this, you need to use your favorite way to plot datafiles (e.g., Matplotlib or Gnuplot).
You can also visualize the trajectory by opening it with VMD by using the command
vmd NaCl_StartingStructure-1.gro NaCl_VES_NPT-300K.pbc-whole.xtc
The NaCl_VES_NPT-300K.pbc-whole.xtc is the trajectory file with the periodic boundary conditions made whole. This is done with the run.sh script.
The coeffs.data file includes the values of coefficients \(\boldsymbol\alpha \) and \(\bar{\boldsymbol\alpha} \) at different iterations. To extract the time evolution of a given coefficient, you can use the ExtractCoeff.sh script, for example to extract coefficient 3:
./ExtractCoeff.sh 3 > coeff.3.data
This will create a file with the first column the iteration number, the second column the averaged coefficient \(\bar{\boldsymbol\alpha} \), and the third column the instantaneous coefficients \(\boldsymbol\alpha \). You should create files for different coefficient and visualize both the second and third column to understand how the coefficients converge.
Apart from estimating the FES directly from the bias potential, you can also estimate the FES through reweighting by histogramming where each configuration is weighted by the bias acting on it, \( e^{\beta V(\mathbf{s})} \). The VES bias acting at each time step is given by the ves.bias variable in the colvar.dat file.
When doing performing reweighting, it is better to ignore the initial part of the simulation where the bias potential is changing more rapidly. You can use the trim-colvar-file.py python script in the Exercise-2 folder to do this
./trim-colvar-file.py --colvar-file ../Exercise-1/colvar.data --output-file colvar_reweight.data --time-min 400
where here we ignore the first 400 ps of the colvar.data file from the Exercise-1 and create a new file called colvar_reweight.data.
We can then perform the reweighting for the distance using the following PLUMED input (plumed_reweight.dat in the Exercise-2 folder)
dist: READ FILEcompulsory keyword the name of the file from which to read these quantities =colvar_reweight.data IGNORE_TIME( default=off ) ignore the time in the colvar file. VALUEScompulsory keyword the values to read from the file =dist coord: READ FILEcompulsory keyword the name of the file from which to read these quantities =colvar_reweight.data IGNORE_TIME( default=off ) ignore the time in the colvar file. VALUEScompulsory keyword the values to read from the file =coord ves: READ FILEcompulsory keyword the name of the file from which to read these quantities =colvar_reweight.data IGNORE_TIME( default=off ) ignore the time in the colvar file. VALUEScompulsory keyword the values to read from the file =ves.bias weights: REWEIGHT_BIAS TEMPthe system temperature. =300 ARGcompulsory keyword ( default=*.bias ) the biases that must be taken into account when reweighting =ves.bias hg_dist: HISTOGRAM ... ARGthe input for this action is the scalar output from one or more other actions. =dist GRID_MINcompulsory keyword the lower bounds for the grid =0.2 GRID_MAXcompulsory keyword the upper bounds for the grid =0.7 GRID_BINthe number of bins for the grid =60 KERNELcompulsory keyword ( default=gaussian ) the kernel function you are using. =DISCRETE LOGWEIGHTSlist of actions that calculates log weights that should be used to weight configurations when calculating averages =weights ... fes_dist: CONVERT_TO_FES GRIDcompulsory keyword the action that creates the input grid you would like to use =hg_dist TEMPthe temperature at which you are operating =300 MINTOZERO( default=off ) set the minimum in the free energy to be equal to zero DUMPGRID GRIDcompulsory keyword the action that creates the grid you would like to output =fes_dist FILEcompulsory keyword ( default=density ) the file on which to write the grid. =fes-reweight.dist.data FMTthe format that should be used to output real numbers =%24.16e
You can run this input by using the PLUMED driver
plumed driver --plumed plumed_reweight.dat --noatoms
You should compare the resulting FES (the fes-reweight.dist.data file) to the results obtained directly from the bias potential in Exercise 1.
We can also obtained the FES for CVs that are not biased in the VES simulation. For example, we can obtain the two-dimensional FES for the distance and the solvation CV given by the coordination number CV. For this, you will need to add the following to the plumed_reweight.dat file and repeat the PLUEMD driver run
hg_dist_coord: HISTOGRAM ... ARGthe input for this action is the scalar output from one or more other actions. =dist,coord GRID_MINcompulsory keyword the lower bounds for the grid =0.2,2.5 GRID_MAXcompulsory keyword the upper bounds for the grid =0.7,7.5 GRID_BINthe number of bins for the grid =200,200 BANDWIDTHcompulsory keyword the bandwidths for kernel density estimation =0.004,0.04 LOGWEIGHTSlist of actions that calculates log weights that should be used to weight configurations when calculating averages =weights ... fes_dist_coord: CONVERT_TO_FES GRIDcompulsory keyword the action that creates the input grid you would like to use =hg_dist_coord TEMPthe temperature at which you are operating =300 MINTOZERO( default=off ) set the minimum in the free energy to be equal to zero DUMPGRID GRIDcompulsory keyword the action that creates the grid you would like to output =fes_dist_coord FILEcompulsory keyword ( default=density ) the file on which to write the grid. =fes-reweight.dist-coord.data FMTthe format that should be used to output real numbers =%24.16e
Note that here we use kernel density estimation with Gaussian kernels to obtain smoother results.
You can also try to obtain the one-dimensional FES for the solvation CV by adjusting the input above.
This will generate a two-dimensional FES that you can visualize.
To check the results, it is a good practice to run another independent simulation using different initial conditions. You can achieve this here by changing the initial geometry in the generate-tpr-file.sh script
StartingGeometry=NaCl_StartingStructure-2.gro
and regenerating the GROMACS tpr file. Do this and rerun the simulation, check the convergence, and perform reweighting as before. Make sure that you do this in a new clean folder that is separate from the run in Exercise 1.
We will now bias also the solvation CV. To achieve this, we need first to setup a separate basis set for the solvation CV, where again we use the symlets but with a different range from 2.0 to 8.0
# Basisset for coordination number bf2: BF_WAVELETS ... TYPEcompulsory keyword Specify the wavelet type. =SYMLETS ORDERcompulsory keyword The order of the basis function expansion. =10 NUM_BFThe number of basis functions that should be used. =22 MINIMUMcompulsory keyword The minimum of the interval on which the basis functions are defined. =2.5 MAXIMUMcompulsory keyword The maximum of the interval on which the basis functions are defined. =7.5 TAILS_THRESHOLDThe threshold for cutting off tail wavelets as a fraction of the maximum value. =0.01 ...
We also need to change the relevant keywords in the VES_LINEAR_EXPANSION action, namely the ARG, BASIS_FUNCTIONS, and GRID_BINS keywords
ves: VES_LINEAR_EXPANSION ... ARGthe input for this action is the scalar output from one or more other actions. =dist,coord BASIS_FUNCTIONScompulsory keyword the label of the one dimensional basis functions that should be used. =bf1,bf2 TEMPthe system temperature - this is needed if the MD code does not pass the temperature to PLUMED. =300.0 GRID_BINSthe number of bins used for the grid. =300,300 OPTIMIZATION_THRESHOLDThreshold value to turn off optimization of localized basis functions. =0.000001 TARGET_DISTRIBUTIONthe label of the target distribution to be used. =td PROJ_ARG1arguments for doing projections of the FES or the target distribution.. =dist PROJ_ARG2arguments for doing projections of the FES or the target distribution.. =coord ...
Additionally, we have turned on the calculation of the one-dimensional projection of the FES on the two CVs (we also need to set the keyword FES_PROJ_OUTPUT=100 in OPT_AVERAGED_SGD). This is useful to assess the convergence as this is easier in one-dimension. We can also compare the projection to the results from Exercise 1. These changes should be sufficient to do the simulations using two CVs. You can see full input file in the Exercise-4 folder.
Once you have performed this simulation, you should also try to reweight from this simulations. Furthermore, if you have time, you should also try to perform another independent simulation.
The following three exercises are optional, but they will show you how different parameters effect the results. You should base these exercises on the files from the Exercise-1 folder and make the necessary changes. Make sure that you run these simulations in separate folders and start from clean files from the Exercise-1 folder.
The main parameter in the optimization algorithm is the step size and it is not always easy to choose this parameter. Luckily, the algorithm is quite robust and will work for different step sizes.
Run different simulations using step sizes 0.5 and 50.0 and try to rationalize the behavior. Normally, when the step size is too large, the system gets stuck in CV space and coefficients oscillate wildly. When the step size is too small, the algorithm runs out of "steam" too fast and the simulation converges slowly. These two extreme cases should be avoided.
Perform a simulation using an uniform target distribution and see how this changes the results.
In this case, you need to change the target distribution to
# Target distribution td: TD_UNIFORM
and remove the TARGETDIST_STRIDE and TARGETDIST_OUTPUT keywords from the OPT_AVERAGED_SGD action.
Perform a simulation using Legendre polynomials (BF_LEGENDRE) basis functions instead of the wavelets and see how this will affect the result. As the Legendre polynomials are delocalized basis functions, this should lead to more fluctuations in the bias potential as has been observed in the paper introducing the wavelets.
In this case, you need to change the basis set action in the PLUMED input to
# Basisset bf1: BF_LEGENDRE ... ORDERcompulsory keyword The order of the basis function expansion. =20 MINIMUMcompulsory keyword The minimum of the interval on which the basis functions are defined. =0.2 MAXIMUMcompulsory keyword The maximum of the interval on which the basis functions are defined. =0.7 ...
This should cover the basics of running VES simulations in PLUMED, but the following comments might be of interest to some.
VES simulations can be easily restarted. The code will automatically output all the file needed at the end of the simulation. To restart, you just need to reset the simulation with your MD code in the traditional way and add a RESET keyword to the top of the PLUMED input (for some codes like GROMACS, a restart is automatically detected by PLUMED and thus this keyword is not needed).
VES simulations supports the usage of multiple walkers where different copies of the system share the same bias potential (i.e. coefficients) and cooperatively sample the averages needed for the gradient and Hessian. This can significantly help with convergence in difficult cases. It is of course best to start the different copies from different positions in CV space. To activate this option you just need to add the MULTIPLE_WALKERS keyword to the OPT_AVERAGED_SGD action. Note that this is only supported if the MD code support running multiple replicas connected via MPI (e.g., GROMACS or LAMMPS).