Data Loading

class PDBData(filename=None, fix_terminal=False, atoms=None)[source]

Create object enabling the manipulation of multi-PDB files into a dataset suitable for training.

  • filename – None, str or list of strings. If not None, import_pdb is called on each filename provided.

  • fix_terminal – if True, calls fix_terminal after import, and before atomselect

  • atoms – if not None, calls atomselect

atomselect(atoms, ignore_atoms=[])[source]

From all imported PDBs, extract only atoms of interest. import_pdb must have been called at least once, either at class instantiation or as a separate call.


atoms – list of atom names, or “no_hydrogen”.


Rename OT1 N-terminal Oxygen to O if terminal oxygens are named OT1 and OT2 otherwise no oxygen will be selected during an atomselect using atoms = [‘CA’, ‘C’,’N’,’O’,’CB’]. No template will be found for terminal residue in openmm_loss. Alternative solution is to use atoms = [‘CA’, ‘C’, ‘N’, ‘O’, ‘CB’, ‘OT1’]. instead.


return biobox.Molecule object with loaded data


generate list of all atoms in dataset, where every line contains [atom name, residue name, resid]

get_dataloader(batch_size, validation_split=0.1, pin_memory=True, dataset_sample_size=-1, manual_seed=None, shuffle=True, sampler=None)[source]
  • batch_size

  • validation_split

  • pin_memory

  • dataset_sample_size

  • manual_seed

  • shuffle

  • sampler

Returns: for training set

Returns: for validation set

get_datasets(validation_split=0.1, valid_size=None, train_size=None, manual_seed=None)[source]

Create a training and validation set from the imported data

  • validation_split – ratio of data to randomly assigned as validation

  • valid_size – if not None, specify number of train structures to be returned

  • train_size – if not None, speficy number of valid structures to be returned

  • manual_seed – seed to initialise the random number generator used for splitting the dataset. Useful to replicate a specific split.


two torch.Tensor, for training and validation structures.


Load multiPDB file. This command can be called multiple times to load many datasets, if these feature the same number of atoms


filename – path to multiPDB file.


Once all datasets have been loaded, normalise data and convert into torch.Tensor (ready for training)

split(*args, **kwargs)[source]

Split PDBData into two other PDBData objects corresponding to train and valid sets.

  • manual_seed – manual seed used to split dataset

  • validation_split – ratio of data to randomly assigned as validation

  • train_size – if not None, specify number of train structures to be returned

  • valid_size – if not None, speficy number of valid structures to be returned


PDBData object corresponding to train set


PDBData object corresponding to validation set