Protein-ligand-water systems with Interchange

In this example, we’ll take a docked protein-ligand system from an OpenFF benchmark data set, parameterize and solvate it, and export the parameterized system to a variety of simulation engines.

import urllib

import nglview
import numpy as np
from openff.toolkit import ForceField, Molecule, Topology
from openff.units import unit

from openff.interchange import Interchange
from openff.interchange.components._packmol import UNIT_CUBE, pack_box
from openff.interchange.drivers import get_openmm_energies
from openff.interchange.drivers.all import get_summary_data

Collecting structures

In this example we’ll use starting coordinates data from MCL1, which is part of the Protein Ligand Benchmark data set curated by the Open Force Field Initiative. Conveniently for the purposes of this example, the ligand is already docked and the protein is relatively small (~2000 atoms), so we can focus on using Interchange without too much prep.

Start by downloading the protein and ligand structure files from GitHub:

url = (
    "https://raw.githubusercontent.com/openforcefield/protein-ligand-benchmark/"
    "8c94c0dcc892dfd77992567294b1ff31c62e8695/plbenchmark/sample_data/2020-08-26_mcl1_sample/"
)

urllib.request.urlretrieve(url + "01_protein/crd/protein.pdb", "protein.pdb")
urllib.request.urlretrieve(url + "02_ligands/lig_23/crd/lig_23.sdf", "lig_23.sdf")

# `protein.pdb` and `lig_23.sdf` should be in the local path now
!ls -lhrt
total 232K
-rw-r--r-- 1 runner runner  23K Feb  5 00:37 protein_ligand.ipynb
-rw-r--r-- 1 runner runner  211 Feb  5 00:37 README.md
-rw-r--r-- 1 runner runner 194K Feb  5 00:45 protein.pdb
-rw-r--r-- 1 runner runner 4.1K Feb  5 00:45 lig_23.sdf

The OpenFF Toolkit recently (version 0.13) added support for loading multi-component PDBs. There are some limitations but for our system - a well-structured PDB file including a protein and some crystal waters - it should work fine.

protein_with_crystal_water = Topology.from_pdb("protein.pdb")
protein_with_crystal_water.n_molecules
29

protein_with_crystal_water is a Topology, not a Molecule, containing the protein and a few crystal waters. We can splice out the protein as a Molecule object and visualize it:

protein = protein_with_crystal_water.molecule(0)
protein

Preparing components

This system has three components: Protein, ligand, and solvent (water). For each component, we need positions and parameters. Our protein and ligand positions come from PDBs, and we’ll generate solvent coordinates ourselves. For parameters, the Sage force field will be perfect for the ligand and water, but doesn’t support proteins - they’re coming in Rosemary. In the meantime, we’ll use a SMIRNOFF port of ff14SB, a popular force field in the Amber force field family which has a compatible functionaln form.

Unfortunately, this means we need to treat each component seperately. Interchange provides an means for combining these systems, which we’ll see in a bit.

Protein component

Let’s start with the protein component. We The Molecule.from_polymer_pdb() method constructs a Molecule from a PDB file encoding a protein. A Molecule object stores a molecule’s chemical identity, bond graph, and co-ordinates. The OpenFF Toolkit doesn’t accept PDB files for small molecules because they don’t have enough chemical information, but it makes an exception for biopolymers like proteins via a chemical substructure dictionary containing information about canonical amino aicds. This saves us from needing to do things like write up a SMILES string for an entire protein.

Our PDB file doesn’t only contain one molecule, though, it contains a protein and crystal waters. The OpenFF Toolkit recently (version 0.13) added support for loading multi-component PDBs. There are some limitations but for our system it should work fine.

protein_with_crystal_water = Topology.from_pdb("protein.pdb")
protein_with_crystal_water.n_molecules
29

protein_with_crystal_water is a Topology, not a Molecule, containing the protein and a few crystal waters. We can splice out the protein as a Molecule object and visualize it to make sure it looks reasonable:

protein = protein_with_crystal_water.molecule(0)
protein.visualize(backend="nglview")

OpenFF maintains a port of Amber ff14sb, which we’ll use for the protein parameters. We’re using the impropers variant because Interchange doesn’t support Amber’s improper torsion function.

ff14sb = ForceField("ff14sb_off_impropers_0.0.3.offxml")

We can use the Interchange.from_smirnoff constructor method to combine the protein molecule (with coordinates) and the ff14sb force field into an Interchange.

protein_intrcg = Interchange.from_smirnoff(
    force_field=ff14sb,
    topology=protein.to_topology(),
)

Ligand component

SDF files encode all the chemical information the Toolkit needs to construct a Molecule, so we can use the general-purpose small molecule from_file method for the ligand:

ligand = Molecule.from_file("lig_23.sdf")
ligand.name = "LIG"
ligand.visualize(backend="nglview")

We’ll use the OpenFF 2.0.0 “Sage” force field for the ligand, which is a production-ready small molecule SMIRNOFF force field. Its coordinates are taken from the lig_23.sdf file we downloaded earlier. We just want to do some point energy calculations as a proof of concept, so we’ll use the unconstrained variant of Sage (see the OpenFF Toolkit FAQ for details).

ligand_intrcg = Interchange.from_smirnoff(
    force_field=ForceField("openff_unconstrained-2.0.0.offxml"),
    topology=[ligand],
)

Now that we have two interchanges, we can combine them with the Interchange.combine method! We’ll need a combined system to solvate too, so this’ll be useful in a second.

docked_intrcg = protein_intrcg.combine(ligand_intrcg)
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.11/site-packages/openff/interchange/operations/_combine.py:104: InterchangeCombinationWarning: Interchange object combination is complex and may produce strange results outside of use cases it has been tested in. Use with caution and thoroughly validate results!
  warnings.warn(
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.11/site-packages/openff/interchange/operations/_combine.py:84: InterchangeCombinationWarning: Found electrostatics 1-4 scaling factors of 5/6 with slightly different rounding (0.833333 and 0.8333333333). This likely stems from OpenFF using more digits in rounding 1/1.2. The value of 0.8333333333 will be used, which may or may not introduce small errors. 
  warnings.warn(

In addition to making it easy to parameterize systems for all sorts of engines, Interchange makes it easy to visualize systems. We can use the visualize() method to view our docked system in NGLView and make sure the co-ordinates make sense:

w = docked_intrcg.visualize()
w.clear_representations()
w.add_representation(
    "licorice",
    radius=0.1,
    selection=[*range(protein_intrcg.topology.n_atoms)],
)
w.add_representation(
    "spacefill",
    selection=[*range(protein_intrcg.topology.n_atoms, docked_intrcg.topology.n_atoms)],
)
w

Before we move on to the solvent, we should check the partial charge of our protein-ligand complex. Since we want to run simulations with net neutral charge, we’ll add counter-ions to balance the total charge of the protein-ligand complex. To be safe, we can also double-check that the total charge in the docked_intrcg object matches the sum of the formal charges of the Molecule representations which more closely describe our intended chemistry.

total_charge = round(sum(docked_intrcg["Electrostatics"].charges.values()), 3)

assert total_charge == protein.total_charge + ligand.total_charge, (
    f"Total charge of the system is {total_charge}, not {protein.total_charge + ligand.total_charge}"
)
total_charge
3.0 elementary_charge

Solvent component

We’ll reuse the Sage force field from earlier here, as it includes parameters for TIP3P water, but first we need coordinates for our solvated system. This is a portion of the OpenFF ecosystem that will be streamlined in the future, but we can use a PACKMOL wrapper to get the job done. We’re adding a fixed amount of water for this quick example so the density will be wrong, but imagine it’s right.

We’ll also add three chloride ions to balance the net charge of the protein-ligand complex. For a production run, you’ll probably want a more realistic salt concentration - the goal here is just to get something running with a net neutral charge.

water = Molecule.from_smiles("O")
water.generate_conformers(n_conformers=1)

ion = Molecule.from_smiles("[Cl-]")
ion.generate_conformers(n_conformers=1)

# Come up with a box size based on the size of the protein plus a 2 nm buffer
xyz = protein.conformers[0]
centroid = xyz.sum(axis=0) / xyz.shape[0]
protein_radius = np.sqrt(((xyz - centroid) ** 2).sum(axis=-1).max())
box_vectors = UNIT_CUBE * (protein_radius * 2 + 2 * unit.nanometer)

# Pack the box with an arbitrary number of water
n_water = 1000

packed_topology = pack_box(
    molecules=[water, ion],
    number_of_copies=[n_water, int(total_charge.m)],
    solute=docked_intrcg.topology,
    box_vectors=box_vectors,
)
# Visualize
packed_topology.to_file("packed.pdb")
nglview.show_structure_file("packed.pdb")

And now we can create the interchange! The Topology we got from pack_box includes the positions we’ll later apply to the solvated complex. For now, we need an Interchange that represents the water component. We can pass it Sage, wihch contains TIP3P parameters, and a topology of n_water water molecules without worrying about the positions, since we’ll just set those later.

water_intrcg = Interchange.from_smirnoff(
    force_field=ForceField("openff_unconstrained-2.0.0.offxml"),
    topology=[water] * n_water + [ion] * int(total_charge.m),
)

Putting the system together

Now that we’ve got all the pieces, we can combine the docked protein-ligand system with the solvent, and add in the positions and box vectors we just worked out

system_intrcg = docked_intrcg.combine(water_intrcg)
system_intrcg.positions = packed_topology.get_positions()
system_intrcg.box = packed_topology.box_vectors
w = system_intrcg.visualize()
w.clear_representations()
# Protein rep
w.add_representation(
    "licorice",
    radius=0.2,
    selection=[*range(protein_intrcg.topology.n_atoms)],
)
# Ligand rep
w.add_representation(
    "spacefill",
    selection=[*range(protein_intrcg.topology.n_atoms, docked_intrcg.topology.n_atoms)],
)
# Water rep
w.add_representation(
    "licorice",
    radius=0.1,
    selection=[*range(docked_intrcg.topology.n_atoms, system_intrcg.topology.n_atoms)],
)
w

Exporting to simulation engines

Finally, we can export the final Interchange object to models understood by various simulation engines. Some of these exports are not yet optimized for large files.

OpenMM

openmm_system = system_intrcg.to_openmm()
openmm_topology = system_intrcg.topology.to_openmm(ensure_unique_atom_names=False)
print(type(openmm_system), type(openmm_topology))
<class 'openmm.openmm.System'> <class 'openmm.app.topology.Topology'>

Amber

system_intrcg.to_inpcrd("out.inpcrd")
system_intrcg.to_prmtop("out.prmtop")

LAMMPS

# The LAMMPS exporter has not yet been optimized for large molecules or systems
if False:
    system_intrcg.to_lammps("out.lmp")

GROMACS

system_intrcg.to_gromacs(prefix="out", monolithic=False)
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.11/site-packages/openff/interchange/components/mdconfig.py:503: UserWarning: Ambiguous failure while processing constraints. Constraining h-bonds as a stopgap.
  warnings.warn(

Energy tests

In order to verify the accuracy of each export, we can use functions in the openff.interchange.drivers module to call out to each engine to evaluate single-point energies. Under the hood, each function uses the export functions just as we did in the above cells. The GROMACS and Amber exports are a little slower than the OpenMM export, so some of these cells might take a minute to execute.

To get a quick look at how a single engine reports energies, use get_openmm_energies (or get_gromacs_energies, get_amber_energies, or get_lammps_energies).

get_openmm_energies(system_intrcg)
EnergyReport(energies={'Bond': <Quantity(11255.7296, 'kilojoule / mole')>, 'Angle': <Quantity(1980.01657, 'kilojoule / mole')>, 'Torsion': <Quantity(7759.41267, 'kilojoule / mole')>, 'Nonbonded': <Quantity(5066671.6, 'kilojoule / mole')>})

For convenience there is a function get_summary_data that runs through all available engines and summarizes the results in a Pandas DataFrame. (This cell might take a minute to execute). We’re setting the argument _engines to a non-defualt value so that the LAMMPS driver is skipped even if it’s available; normally this argument is unnecessary if you don’t have LAMMPS installed on your system.

summary = get_summary_data(system_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
summary
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.11/site-packages/openff/interchange/components/mdconfig.py:503: UserWarning: Ambiguous failure while processing constraints. Constraining h-bonds as a stopgap.
  warnings.warn(
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.11/site-packages/openff/interchange/components/mdconfig.py:431: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
  warnings.warn(
Bond Angle Torsion vdW Electrostatics RBTorsion
OpenMM 11255.729633 1980.016575 7759.412675 5.087932e+06 -21260.085305 NaN
Amber 538.964470 2180.267338 7759.412514 5.087879e+06 -21259.672601 NaN
GROMACS 538.962646 1980.016235 7759.417587 5.087855e+06 -21271.959961 0.0

We can see from these large energy differences that something is wrong - this stems from the experimental Interchange combination operation producing incorrect results.

summary.describe()
Bond Angle Torsion vdW Electrostatics RBTorsion
count 3.000000 3.000000 3.000000 3.000000e+00 3.000000 1.0
mean 4111.218917 2046.766716 7759.414259 5.087889e+06 -21263.905956 0.0
std 6187.327778 115.614930 0.002884 3.908917e+01 6.978025 NaN
min 538.962646 1980.016235 7759.412514 5.087855e+06 -21271.959961 0.0
25% 538.963558 1980.016405 7759.412595 5.087867e+06 -21266.022633 0.0
50% 538.964470 1980.016575 7759.412675 5.087879e+06 -21260.085305 0.0
75% 5897.347052 2080.141956 7759.415131 5.087905e+06 -21259.878953 0.0
max 11255.729633 2180.267338 7759.417587 5.087932e+06 -21259.672601 0.0

In the future this should work more smoothly with identical energies reported by each engine. In lieu of that, we can evaluate the energy of each component that we previously added together. This requires setting box vectors for each component and also setting the water positions, which we skipped earlier since we were able to use PACKMOL results directly.

for subset in [ligand_intrcg, protein_intrcg, water_intrcg]:
    subset.box = system_intrcg.box

water_intrcg.positions = system_intrcg.positions[-3003:,]
get_summary_data(ligand_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.11/site-packages/openff/interchange/components/mdconfig.py:431: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
  warnings.warn(
Bond Angle Torsion vdW Electrostatics RBTorsion
OpenMM 25.837656 356.516457 27.319830 105.023359 -31.915307 NaN
Amber 25.837455 356.516548 27.319846 104.965682 -31.912623 NaN
GROMACS 25.838177 356.516541 27.319841 105.025609 -31.987274 0.0
get_summary_data(protein_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.11/site-packages/openff/interchange/components/mdconfig.py:431: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
  warnings.warn(
Bond Angle Torsion vdW Electrostatics RBTorsion
OpenMM 11229.891977 1623.500118 7732.092845 5.086854e+06 -20644.012231 NaN
Amber 11229.891982 1623.499947 7732.092668 5.086796e+06 -20643.369819 NaN
GROMACS 11229.884766 1623.498413 7732.097672 5.086778e+06 -20647.463379 0.0
get_summary_data(water_intrcg, _engines=["OpenMM", "GROMACS", "Amber"])
/home/runner/micromamba/envs/openff-docs-examples/lib/python3.11/site-packages/openff/interchange/components/mdconfig.py:431: SwitchingFunctionNotImplementedWarning: A switching distance 8.0 angstrom was specified by the force field, but Amber does not implement a switching function. Using a hard cut-off instead. Non-bonded interactions will be affected.
  warnings.warn(
Bond Angle Torsion vdW Electrostatics RBTorsion
OpenMM 0.0 0.000000 0.0 860.237819 -161.044844 NaN
Amber 0.0 200.250842 0.0 859.633762 -161.283158 NaN
GROMACS 0.0 0.000000 0.0 860.232862 -169.509521 0.0

We can see from these results that each engine reports nearly identical energies for each individual component.

Verifying charge neutrality

Earlier, when adding waters to the system, we added three chloride ions to balance the net +3 charge of the protein-ligand complex. Did we actually end up with a charge-neutral system?

import openmm

nbforce = openmm_system.getForces()[0]
charge = 0 * openmm.unit.elementary_charge
for i in range(nbforce.getNumParticles()):
    charge += nbforce.getParticleParameters(i)[0]
charge
-8.012210317431823e-08 e