Data Loading¶
- class PDBData(filename=None, fix_terminal=False, atoms=None)[source]¶
Create object enabling the manipulation of multi-PDB files into a dataset suitable for training.
- Parameters:
filename – None, str or list of strings. If not None,
import_pdb
is called on each filename provided.fix_terminal – if True, calls
fix_terminal
after import, and before atomselectatoms – if not None, calls
atomselect
- atomselect(atoms, ignore_atoms=[])[source]¶
From all imported PDBs, extract only atoms of interest.
import_pdb
must have been called at least once, either at class instantiation or as a separate call.- Parameters:
atoms – list of atom names, or “no_hydrogen”.
- fix_terminal()[source]¶
Rename OT1 N-terminal Oxygen to O if terminal oxygens are named OT1 and OT2 otherwise no oxygen will be selected during an atomselect using atoms = [‘CA’, ‘C’,’N’,’O’,’CB’]. No template will be found for terminal residue in openmm_loss. Alternative solution is to use atoms = [‘CA’, ‘C’, ‘N’, ‘O’, ‘CB’, ‘OT1’]. instead.
- get_atominfo()[source]¶
generate list of all atoms in dataset, where every line contains [atom name, residue name, resid]
- get_dataloader(batch_size, validation_split=0.1, pin_memory=True, dataset_sample_size=-1, manual_seed=None, shuffle=True, sampler=None)[source]¶
- Parameters:
batch_size –
validation_split –
pin_memory –
dataset_sample_size –
manual_seed –
shuffle –
sampler –
- Returns:
torch.utils.data.DataLoader for training set
- Returns:
torch.utils.data.DataLoader for validation set
- get_datasets(validation_split=0.1, valid_size=None, train_size=None, manual_seed=None)[source]¶
Create a training and validation set from the imported data
- Parameters:
validation_split – ratio of data to randomly assigned as validation
valid_size – if not None, specify number of train structures to be returned
train_size – if not None, speficy number of valid structures to be returned
manual_seed – seed to initialise the random number generator used for splitting the dataset. Useful to replicate a specific split.
- Returns:
two torch.Tensor, for training and validation structures.
- import_pdb(filename)[source]¶
Load multiPDB file. This command can be called multiple times to load many datasets, if these feature the same number of atoms
- Parameters:
filename – path to multiPDB file.
- prepare_dataset()[source]¶
Once all datasets have been loaded, normalise data and convert into torch.Tensor (ready for training)
- split(*args, **kwargs)[source]¶
Split
PDBData
into two otherPDBData
objects corresponding to train and valid sets.- Parameters:
manual_seed – manual seed used to split dataset
validation_split – ratio of data to randomly assigned as validation
train_size – if not None, specify number of train structures to be returned
valid_size – if not None, speficy number of valid structures to be returned
- Returns:
PDBData
object corresponding to train set- Returns:
PDBData
object corresponding to validation set