Data Loading¶

class PDBData(filename=None, fix_terminal=False, atoms=None)[source]¶

Create object enabling the manipulation of multi-PDB files into a dataset suitable for training.

Parameters:

filename – None, str or list of strings. If not None, import_pdb is called on each filename provided.
fix_terminal – if True, calls fix_terminal after import, and before atomselect
atoms – if not None, calls atomselect

atomselect(atoms, ignore_atoms=[])[source]¶

From all imported PDBs, extract only atoms of interest. import_pdb must have been called at least once, either at class instantiation or as a separate call.

Parameters:: atoms – list of atom names, or “no_hydrogen”.

fix_terminal()[source]¶: Rename OT1 N-terminal Oxygen to O if terminal oxygens are named OT1 and OT2 otherwise no oxygen will be selected during an atomselect using atoms = [‘CA’, ‘C’,’N’,’O’,’CB’]. No template will be found for terminal residue in openmm_loss. Alternative solution is to use atoms = [‘CA’, ‘C’, ‘N’, ‘O’, ‘CB’, ‘OT1’]. instead.

frame()[source]¶: return biobox.Molecule object with loaded data

get_atominfo()[source]¶: generate list of all atoms in dataset, where every line contains [atom name, residue name, resid]

get_dataloader(batch_size, validation_split=0.1, pin_memory=True, dataset_sample_size=-1, manual_seed=None, shuffle=True, sampler=None)[source]¶

Parameters:

batch_size –
validation_split –
pin_memory –
dataset_sample_size –
manual_seed –
shuffle –
sampler –

Returns:

torch.utils.data.DataLoader for training set

Returns:

torch.utils.data.DataLoader for validation set

get_datasets(validation_split=0.1, valid_size=None, train_size=None, manual_seed=None)[source]¶

Create a training and validation set from the imported data

Parameters:

validation_split – ratio of data to randomly assigned as validation
valid_size – if not None, specify number of train structures to be returned
train_size – if not None, speficy number of valid structures to be returned
manual_seed – seed to initialise the random number generator used for splitting the dataset. Useful to replicate a specific split.

Returns:

two torch.Tensor, for training and validation structures.

import_pdb(filename)[source]¶

Load multiPDB file. This command can be called multiple times to load many datasets, if these feature the same number of atoms

Parameters:: filename – path to multiPDB file.

prepare_dataset()[source]¶: Once all datasets have been loaded, normalise data and convert into torch.Tensor (ready for training)

split(*args, **kwargs)[source]¶

Split PDBData into two other PDBData objects corresponding to train and valid sets.

Parameters:

manual_seed – manual seed used to split dataset
validation_split – ratio of data to randomly assigned as validation
train_size – if not None, specify number of train structures to be returned
valid_size – if not None, speficy number of valid structures to be returned

Returns:

PDBData object corresponding to train set

Returns:

PDBData object corresponding to validation set

Data Loading¶

Table of Contents

Previous topic

Next topic

This Page