Data Loading¶
- class PDBData(filename=None, topology=None, fix_terminal=False, atoms=None, standardise=True)[source]¶
Create object enabling the manipulation of multi-PDB files into a dataset suitable for training.
- Parameters:
filename (None | str | list[str]) – if not None,
import_pdbis called on each filename provided.topology (None | str) – if not None,
import_pdbis called with the topology file.fix_terminal (bool) – if True, calls
fix_terminalafter import, and before atomselectatoms (list[str]) – if not None, calls
atomselectstandardise (bool) – if True, standardise the dataset by removing the mean and dividing by the standard deviation.
- atomselect(atoms: str | list[str])[source]¶
Select atoms used for training.
- Parameters:
atoms (str | list[str]) – if str then should be used with the MDAnalysis atom selection syntax https://userguide.mdanalysis.org/1.1.1/selections.html or a list like [“CA”, …, “O”]
- fix_terminal()[source]¶
Rename OT1 N-terminal Oxygen to O if terminal oxygens are named OT1 and OT2 otherwise no oxygen will be selected during an atomselect using atoms = [‘CA’, ‘C’,’N’,’O’,’CB’]. No template will be found for terminal residue in openmm_loss.
- get_atominfo()[source]¶
Generate list of all atoms in dataset, where every line contains [atom name, residue name, resid]
- get_dataloader(batch_size, validation_split=0.1, pin_memory=True, manual_seed=None, save_indices=False, indices_dir='.') tuple[DataLoader, DataLoader][source]¶
- Parameters:
batch_size (int) – size of the training batches
validation_split (float) – ratio of data to randomly assigned as validation
pin_memory (bool) – if True, pin memory for the dataloader
manual_seed (int | None)
save_indices (bool) – if True, save train and valid indices to “train_indices.txt” and “valid_indices.txt”
- Returns:
torch.utils.data.DataLoader for training set
- Returns:
torch.utils.data.DataLoader for validation set
- get_datasets(validation_split=0.1, valid_size=None, train_size=None, manual_seed=None, save_indices=False)[source]¶
Create a training and validation set from the imported data. This is deprecated. Use get_dataloader instead.
- Parameters:
validation_split – ratio of data to randomly assigned as validation
valid_size – if not None, specify number of train structures to be returned
train_size – if not None, speficy number of valid structures to be returned
manual_seed – seed to initialise the random number generator used for splitting the dataset. Useful to replicate a specific split.
- Returns:
two torch.Tensor, for training and validation structures.
- import_pdb(filename: str | list[str], topology: str | None = None) None[source]¶
Load one or multiple trajectory files as MDAnalysis Universe.
- Parameters:
filename (str | list[str]) – the path the trajectory as a str or a list of filepaths to multiple trajectories
topology (str | None) – the path the topology file for the trajector(y)ies
- class DataAssembler(traj_path: str, topo_path: str | None = None, test_size: float = 0.15, n_cluster: int = 1500, image_mol: bool = False, outpath: str = '', verbose: bool = False, dist_mat: bool = True)[source]¶
Create clustered trajectories, stride trajectories and randomly sampled test frames and change either the respective trajectory frame indices or create a new trajectory. Will also concatenate multiple trajectories into one and center protein in the water box. The topology of the newly created trajectory will be saved in self.outpath/trajectory_name_NEW_TOPO.pdb
- Parameters:
traj_path (str) – path to the trajectory
topo_path (str) – path to the topology
test_size (int) – size of the test dataset (0.0 if no test set should be created)
n_cluster (int) – number of clusters to be created (representative frames)
image_mol (bool) – True to image to molecule (center it in the box)
outpath (str) – directory path where the new trajector(y)ies should be stored
verbose (bool) – True to get info which steps are currently performed
verbose – True to calculate the n_frames x n_frames distance matrix
- calc_rmsd_f() tuple[ndarray[tuple[int], dtype[int64]], ndarray[tuple[int], dtype[int64]], ndarray[tuple[int], dtype[int64]]][source]¶
calculate rmsd and rmsf over the course of the trajectory
- create_dendrogram(distance_threshold=50, output_path='dendrogram.png') None[source]¶
Cluster the trajectory with hierarchical clustering (linkage) based on the RMSD between the frames and plot a dendrogram. Group frames that have pairwise distances less than “distance_threshold” in one cluster (default is 50).
- create_trajectories() None[source]¶
Saves clustered or strided indices for the present training trajectory and optionally saves them as new trajectory
- create_trajectories_by_dendrogram(test_cluster: int) None[source]¶
Create test trajectories based on a specific cluster and create the train trajectories from all the frames excluding the specific cluster.
- Parameters:
test_cluster (int) – Cluster to use as the test set.
- distance_cluster() None[source]¶
cluster the trajectory with AgglomerativeClustering based on the rmsd between the frames
- own_idx(file_path: str | ndarray[tuple[int], dtype[int64]])[source]¶
Provide indices for frames to create a new trajectory. Useful if trajectory should be sub sampled by some external metric.
- Parameters:
file_path (str | np.ndarray[tuple[int], np.dtype[np.int64]]) – path where the file storing the indices is located. Needs to have each index in a separate line. Or can be a numpy array.
- pca_cluster(n: int = 3) None[source]¶
cluster the trajectory with KMeans based on the first 5 principal components of a PCA of the trajectory
- Parameters:
n (int) – number of principal components (per frame) to use for the clustering
- read_traj(atom_indices=None, ref_atom_indices=None) None[source]¶
Read in one or multiple trajectories, remove everything but protein atoms and image the molecule to center it in the water box, and create a training/validation and test split. :param array_like | None atom_indices: The indices of the atoms to superpose. If not supplied, all atoms will be used. :param array_like | None ref_atom_indices: Use these atoms on the reference structure. If not supplied, the same atom indices will be used for this trajectory and the reference one.