Data Loading

class PDBData(filename=None, topology=None, fix_terminal=False, atoms=None, standardise=True)[source]

Create object enabling the manipulation of multi-PDB files into a dataset suitable for training.

Parameters:
  • filename (None | str | list[str]) – if not None, import_pdb is called on each filename provided.

  • topology (None | str) – if not None, import_pdb is called with the topology file.

  • fix_terminal (bool) – if True, calls fix_terminal after import, and before atomselect

  • atoms (list[str]) – if not None, calls atomselect

  • standardise (bool) – if True, standardise the dataset by removing the mean and dividing by the standard deviation.

analysis_bundle() dict[source]

Convenience accessor used by analysis utilities.

atomselect(atoms: str | list[str])[source]

Select atoms used for training.

Parameters:

atoms (str | list[str]) – if str then should be used with the MDAnalysis atom selection syntax https://userguide.mdanalysis.org/1.1.1/selections.html or a list like [“CA”, …, “O”]

fix_terminal()[source]

Rename OT1 N-terminal Oxygen to O if terminal oxygens are named OT1 and OT2 otherwise no oxygen will be selected during an atomselect using atoms = [‘CA’, ‘C’,’N’,’O’,’CB’]. No template will be found for terminal residue in openmm_loss.

frame()[source]

Return biobox.Molecule object with loaded data

get_atominfo()[source]

Generate list of all atoms in dataset, where every line contains [atom name, residue name, resid]

get_dataloader(batch_size, validation_split=0.1, pin_memory=True, manual_seed=None, save_indices=False, indices_dir='.') tuple[DataLoader, DataLoader][source]
Parameters:
  • batch_size (int) – size of the training batches

  • validation_split (float) – ratio of data to randomly assigned as validation

  • pin_memory (bool) – if True, pin memory for the dataloader

  • manual_seed (int | None)

  • save_indices (bool) – if True, save train and valid indices to “train_indices.txt” and “valid_indices.txt”

Returns:

torch.utils.data.DataLoader for training set

Returns:

torch.utils.data.DataLoader for validation set

get_datasets(validation_split=0.1, valid_size=None, train_size=None, manual_seed=None, save_indices=False)[source]

Create a training and validation set from the imported data. This is deprecated. Use get_dataloader instead.

Parameters:
  • validation_split – ratio of data to randomly assigned as validation

  • valid_size – if not None, specify number of train structures to be returned

  • train_size – if not None, speficy number of valid structures to be returned

  • manual_seed – seed to initialise the random number generator used for splitting the dataset. Useful to replicate a specific split.

Returns:

two torch.Tensor, for training and validation structures.

import_pdb(filename: str | list[str], topology: str | None = None) None[source]

Load one or multiple trajectory files as MDAnalysis Universe.

Parameters:
  • filename (str | list[str]) – the path the trajectory as a str or a list of filepaths to multiple trajectories

  • topology (str | None) – the path the topology file for the trajector(y)ies

metadata() dict[source]

Return a metadata dictionary for trainers and analysis tools.

prepare_dataset(std=None, mean=None) Tensor[source]

Prepare dataset from the loaded trajectory data to create a standardised/unstandardised tensor.

write_statistics(filename: str)[source]

Write mean and standard deviation to a JSON file.

Parameters:

filename (str) – path to the output JSON file

class DataAssembler(traj_path: str, topo_path: str | None = None, test_size: float = 0.15, n_cluster: int = 1500, image_mol: bool = False, outpath: str = '', verbose: bool = False, dist_mat: bool = True)[source]

Create clustered trajectories, stride trajectories and randomly sampled test frames and change either the respective trajectory frame indices or create a new trajectory. Will also concatenate multiple trajectories into one and center protein in the water box. The topology of the newly created trajectory will be saved in self.outpath/trajectory_name_NEW_TOPO.pdb

Parameters:
  • traj_path (str) – path to the trajectory

  • topo_path (str) – path to the topology

  • test_size (int) – size of the test dataset (0.0 if no test set should be created)

  • n_cluster (int) – number of clusters to be created (representative frames)

  • image_mol (bool) – True to image to molecule (center it in the box)

  • outpath (str) – directory path where the new trajector(y)ies should be stored

  • verbose (bool) – True to get info which steps are currently performed

  • verbose – True to calculate the n_frames x n_frames distance matrix

calc_rmsd_f() tuple[ndarray[tuple[int], dtype[int64]], ndarray[tuple[int], dtype[int64]], ndarray[tuple[int], dtype[int64]]][source]

calculate rmsd and rmsf over the course of the trajectory

Returns:

tuple[

np.ndarray[tuple[int], np.dtype[np.int_]], np.ndarray[tuple[int], np.dtype[np.int_]], np.ndarray[tuple[int], np.dtype[np.int_]],

]

create_dendrogram(distance_threshold=50, output_path='dendrogram.png') None[source]

Cluster the trajectory with hierarchical clustering (linkage) based on the RMSD between the frames and plot a dendrogram. Group frames that have pairwise distances less than “distance_threshold” in one cluster (default is 50).

create_trajectories() None[source]

Saves clustered or strided indices for the present training trajectory and optionally saves them as new trajectory

create_trajectories_by_dendrogram(test_cluster: int) None[source]

Create test trajectories based on a specific cluster and create the train trajectories from all the frames excluding the specific cluster.

Parameters:

test_cluster (int) – Cluster to use as the test set.

distance_cluster() None[source]

cluster the trajectory with AgglomerativeClustering based on the rmsd between the frames

own_idx(file_path: str | ndarray[tuple[int], dtype[int64]])[source]

Provide indices for frames to create a new trajectory. Useful if trajectory should be sub sampled by some external metric.

Parameters:

file_path (str | np.ndarray[tuple[int], np.dtype[np.int64]]) – path where the file storing the indices is located. Needs to have each index in a separate line. Or can be a numpy array.

pca_cluster(n: int = 3) None[source]

cluster the trajectory with KMeans based on the first 5 principal components of a PCA of the trajectory

Parameters:

n (int) – number of principal components (per frame) to use for the clustering

read_traj(atom_indices=None, ref_atom_indices=None) None[source]

Read in one or multiple trajectories, remove everything but protein atoms and image the molecule to center it in the water box, and create a training/validation and test split. :param array_like | None atom_indices: The indices of the atoms to superpose. If not supplied, all atoms will be used. :param array_like | None ref_atom_indices: Use these atoms on the reference structure. If not supplied, the same atom indices will be used for this trajectory and the reference one.

stride() None[source]

reduce the training dataset size to n samples using stride as ‘sampling method’