Data Loading¶

class PDBData(filename=None, topology=None, fix_terminal=False, atoms=None, standardise=True)[source]¶

Create object enabling the manipulation of multi-PDB files into a dataset suitable for training.

Parameters:

filename (None | str | list[str]) – if not None, import_pdb is called on each filename provided.
topology (None | str) – if not None, import_pdb is called with the topology file.
fix_terminal (bool) – if True, calls fix_terminal after import, and before atomselect
atoms (list[str]) – if not None, calls atomselect
standardise (bool) – if True, standardise the dataset by removing the mean and dividing by the standard deviation.

analysis_bundle() → dict[source]¶: Convenience accessor used by analysis utilities.

atomselect(atoms: str | list[str])[source]¶

Select atoms used for training.

Parameters:: atoms (str | list[str]) – if str then should be used with the MDAnalysis atom selection syntax https://userguide.mdanalysis.org/1.1.1/selections.html or a list like [“CA”, …, “O”]

fix_terminal()[source]¶: Rename OT1 N-terminal Oxygen to O if terminal oxygens are named OT1 and OT2 otherwise no oxygen will be selected during an atomselect using atoms = [‘CA’, ‘C’,’N’,’O’,’CB’]. No template will be found for terminal residue in openmm_loss.

frame()[source]¶: Return biobox.Molecule object with loaded data

get_atominfo()[source]¶: Generate list of all atoms in dataset, where every line contains [atom name, residue name, resid]

get_dataloader(batch_size, validation_split=0.1, pin_memory=True, manual_seed=None, save_indices=False, indices_dir='.') → tuple[DataLoader, DataLoader][source]¶

Parameters:

batch_size (int) – size of the training batches
validation_split (float) – ratio of data to randomly assigned as validation
pin_memory (bool) – if True, pin memory for the dataloader
manual_seed (int | None)
save_indices (bool) – if True, save train and valid indices to “train_indices.txt” and “valid_indices.txt”

Returns:

torch.utils.data.DataLoader for training set

Returns:

torch.utils.data.DataLoader for validation set

get_datasets(validation_split=0.1, valid_size=None, train_size=None, manual_seed=None, save_indices=False)[source]¶

Create a training and validation set from the imported data. This is deprecated. Use get_dataloader instead.

Parameters:

validation_split – ratio of data to randomly assigned as validation
valid_size – if not None, specify number of train structures to be returned
train_size – if not None, speficy number of valid structures to be returned
manual_seed – seed to initialise the random number generator used for splitting the dataset. Useful to replicate a specific split.

Returns:

two torch.Tensor, for training and validation structures.

import_pdb(filename: str | list[str], topology: str | None = None) → None[source]¶

Load one or multiple trajectory files as MDAnalysis Universe.

Parameters:

filename (str | list[str]) – the path the trajectory as a str or a list of filepaths to multiple trajectories
topology (str | None) – the path the topology file for the trajector(y)ies

metadata() → dict[source]¶: Return a metadata dictionary for trainers and analysis tools.

prepare_dataset(std=None, mean=None) → Tensor[source]¶: Prepare dataset from the loaded trajectory data to create a standardised/unstandardised tensor.

write_statistics(filename: str)[source]¶

Write mean and standard deviation to a JSON file.

Parameters:: filename (str) – path to the output JSON file

class DataAssembler(traj_path: str, topo_path: str | None = None, test_size: float = 0.15, n_cluster: int = 1500, image_mol: bool = False, outpath: str = '', verbose: bool = False, dist_mat: bool = True)[source]¶

Create clustered trajectories, stride trajectories and randomly sampled test frames and change either the respective trajectory frame indices or create a new trajectory. Will also concatenate multiple trajectories into one and center protein in the water box. The topology of the newly created trajectory will be saved in self.outpath/trajectory_name_NEW_TOPO.pdb

Parameters:

traj_path (str) – path to the trajectory
topo_path (str) – path to the topology
test_size (int) – size of the test dataset (0.0 if no test set should be created)
n_cluster (int) – number of clusters to be created (representative frames)
image_mol (bool) – True to image to molecule (center it in the box)
outpath (str) – directory path where the new trajector(y)ies should be stored
verbose (bool) – True to get info which steps are currently performed
verbose – True to calculate the n_frames x n_frames distance matrix

calc_rmsd_f() → tuple[ndarray[tuple[int], dtype[int64]], ndarray[tuple[int], dtype[int64]], ndarray[tuple[int], dtype[int64]]][source]¶

calculate rmsd and rmsf over the course of the trajectory

Returns:

tuple[: np.ndarray[tuple[int], np.dtype[np.int_]], np.ndarray[tuple[int], np.dtype[np.int_]], np.ndarray[tuple[int], np.dtype[np.int_]],

]

create_dendrogram(distance_threshold=50, output_path='dendrogram.png') → None[source]¶: Cluster the trajectory with hierarchical clustering (linkage) based on the RMSD between the frames and plot a dendrogram. Group frames that have pairwise distances less than “distance_threshold” in one cluster (default is 50).

create_trajectories() → None[source]¶: Saves clustered or strided indices for the present training trajectory and optionally saves them as new trajectory

create_trajectories_by_dendrogram(test_cluster: int) → None[source]¶

Create test trajectories based on a specific cluster and create the train trajectories from all the frames excluding the specific cluster.

Parameters:: test_cluster (int) – Cluster to use as the test set.

distance_cluster() → None[source]¶: cluster the trajectory with AgglomerativeClustering based on the rmsd between the frames

own_idx(file_path: str | ndarray[tuple[int], dtype[int64]])[source]¶

Provide indices for frames to create a new trajectory. Useful if trajectory should be sub sampled by some external metric.

Parameters:: file_path (str | np.ndarray[tuple[int], np.dtype[np.int64]]) – path where the file storing the indices is located. Needs to have each index in a separate line. Or can be a numpy array.

pca_cluster(n: int = 3) → None[source]¶

cluster the trajectory with KMeans based on the first 5 principal components of a PCA of the trajectory

Parameters:: n (int) – number of principal components (per frame) to use for the clustering

read_traj(atom_indices=None, ref_atom_indices=None) → None[source]¶: Read in one or multiple trajectories, remove everything but protein atoms and image the molecule to center it in the water box, and create a training/validation and test split. :param array_like | None atom_indices: The indices of the atoms to superpose. If not supplied, all atoms will be used. :param array_like | None ref_atom_indices: Use these atoms on the reference structure. If not supplied, the same atom indices will be used for this trajectory and the reference one.

stride() → None[source]¶: reduce the training dataset size to n samples using stride as ‘sampling method’

Data Loading¶

Table of Contents

Previous topic

Next topic

This Page