Data Loading

class PDBData(filename=None, fix_terminal=False, atoms=None)[source]

Create object enabling the manipulation of multi-PDB files into a dataset suitable for training.

Parameters:
  • filename – None, str or list of strings. If not None, import_pdb is called on each filename provided.

  • fix_terminal – if True, calls fix_terminal after import, and before atomselect

  • atoms – if not None, calls atomselect

atomselect(atoms, ignore_atoms=[])[source]

From all imported PDBs, extract only atoms of interest. import_pdb must have been called at least once, either at class instantiation or as a separate call.

Parameters:

atoms – list of atom names, or “no_hydrogen”.

fix_terminal()[source]

Rename OT1 N-terminal Oxygen to O if terminal oxygens are named OT1 and OT2 otherwise no oxygen will be selected during an atomselect using atoms = [‘CA’, ‘C’,’N’,’O’,’CB’]. No template will be found for terminal residue in openmm_loss. Alternative solution is to use atoms = [‘CA’, ‘C’, ‘N’, ‘O’, ‘CB’, ‘OT1’]. instead.

frame()[source]

return biobox.Molecule object with loaded data

get_atominfo()[source]

generate list of all atoms in dataset, where every line contains [atom name, residue name, resid]

get_dataloader(batch_size, validation_split=0.1, pin_memory=True, dataset_sample_size=-1, manual_seed=None, shuffle=True, sampler=None)[source]
Parameters:
  • batch_size

  • validation_split

  • pin_memory

  • dataset_sample_size

  • manual_seed

  • shuffle

  • sampler

Returns:

torch.utils.data.DataLoader for training set

Returns:

torch.utils.data.DataLoader for validation set

get_datasets(validation_split=0.1, valid_size=None, train_size=None, manual_seed=None)[source]

Create a training and validation set from the imported data

Parameters:
  • validation_split – ratio of data to randomly assigned as validation

  • valid_size – if not None, specify number of train structures to be returned

  • train_size – if not None, speficy number of valid structures to be returned

  • manual_seed – seed to initialise the random number generator used for splitting the dataset. Useful to replicate a specific split.

Returns:

two torch.Tensor, for training and validation structures.

import_pdb(filename)[source]

Load multiPDB file. This command can be called multiple times to load many datasets, if these feature the same number of atoms

Parameters:

filename – path to multiPDB file.

prepare_dataset()[source]

Once all datasets have been loaded, normalise data and convert into torch.Tensor (ready for training)

split(*args, **kwargs)[source]

Split PDBData into two other PDBData objects corresponding to train and valid sets.

Parameters:
  • manual_seed – manual seed used to split dataset

  • validation_split – ratio of data to randomly assigned as validation

  • train_size – if not None, specify number of train structures to be returned

  • valid_size – if not None, speficy number of valid structures to be returned

Returns:

PDBData object corresponding to train set

Returns:

PDBData object corresponding to validation set