Analysis

class MolearnAnalysis(batch_size=1, processes=1)[source]

This class provides methods dedicated to the quality analysis and structure generation with a trained model.

generate(crd: ndarray[tuple[int, int], dtype[float64]], pdb_path: str | None = None, relax: bool = False) ndarray[tuple[int, int, int], dtype[float64]][source]

Generate a collection of protein conformations, given coordinates in the latent space.

Parameters:
  • crd (numpy.array) – coordinates in the latent space, as a (Nx2) array

  • pdb_path (str) – path where to pdb_files should be stored as files named s_i.pdb where i is the index in the crd array

  • relax (bool) – Relax generated structures with energy minimisation. s_i_relaxed.pdb file

Returns:

collection of protein conformations in the Cartesian space (NxMx3, where M is the number of atoms in the protein)

get_all_dope_score(tensor, refine=True)[source]

Calculate DOPE score of an ensemble of atom coordinates.

Parameters:
  • tensortorch.Tensor or numpy.ndarray with shape [B, N, 3] containing Cartesian coordinates of atoms.

  • refine (bool) – if True, return DOPE score of input and output structure after refinement

get_all_ramachandran_score(tensor)[source]

Calculate Ramachandran score of an ensemble of atomic conrdinates.

Parameters:

tensortorch.Tensor or numpy.ndarray with shape [B, N, 3] containing Cartesian coordinates of atoms.

Returns:

dictionary with keys ‘favored’, ‘allowed’, ‘outliers’, and ‘total’ containing arrays of Ramachandran scores

get_bondlengths(key)[source]

Get backbone bond lengths of a dataset and its decoded counterpart.

get_dataset(key, scale=False)[source]
Parameters:
  • key (str) – key pointing to a dataset previously loaded with set_dataset

  • scale (bool) – if True, return the dataset scaled (i.e. with mean and std applied)

Returns:

torch.Tensor for dataset with the key

get_decoded(key, update=False, scale=False)[source]
Parameters:
  • key (str) – key pointing to a dataset previously loaded with set_dataset

  • update (bool) – if True, re-decode and overwrite the existing data

  • scale (bool) – if True, return the dataset scaled (i.e. with mean and std applied)

Returns:

torch.Tensor for decoded dataset with the key

get_dope(key, refine=True, **kwargs)[source]
Parameters:
  • key (str) – key pointing to a dataset previously loaded with set_dataset

  • refine (bool) – if True, refine structures before calculating DOPE score

Returns:

dictionary containing DOPE score of dataset, and its decoded counterpart

get_encoded(key, update=False)[source]
Parameters:
  • key (str) – key pointing to a dataset previously loaded with set_dataset

  • update (bool) – if True, re-encode and overwrite the existing data

Returns:

array containing the encoding in latent space of dataset associated with key

get_error(key, align=True)[source]

Calculate the reconstruction error of a dataset encoded and decoded by a trained neural network.

Parameters:
  • key (str) – key pointing to a dataset previously loaded with set_dataset

  • align (bool) – if True, the RMSD will be calculated by finding the optimal alignment between structures

Returns:

1D array containing the RMSD between input structures and their encoded-decoded counterparts

get_inversions(key)[source]

Get the chirality of Cα atoms in a dataset and its decoded counterpart.

get_ramachandran(key)[source]
Parameters:

key (str) – key pointing to a dataset previously loaded with set_dataset

num_trainable_params()[source]
Returns:

number of trainable parameters in the neural network previously loaded with set_dataset

reference_dope_score(frame)[source]
Parameters:

frame (numpy.array) – array with shape [1, N, 3] with Cartesian coordinates of atoms

Returns:

DOPE score

scan_bondlength()[source]

Derive bond-length statistics across the latent grid.

Each grid structure is analysed for backbone bond lengths, and the mean and standard deviation for every bond type are cached in surfaces using descriptive keys (e.g. "N-CA" and "N-CA_std").

Raises:

ValueError – If the latent grid has not been initialised.

scan_ca_chirality()[source]

Populate a surface describing Cα chirality inversions.

Decodes the latent grid and counts the number of chirality inversions per structure (i.e. negative triple products for Cα neighbourhoods). The resulting surface is cached under the "Chirality" key within surfaces.

Raises:

ValueError – If the latent grid has not been initialised prior to invocation.

scan_custom(fct, params, key)[source]

Evaluate a user-defined metric over the latent grid.

Decodes the latent grid, applies fct to each structure, and stores the resulting surface under key within surfaces.

Parameters:
  • fct – Callable accepting coordinates shaped (1, N, 3) (after rescaling) and returning a scalar.

  • params – Iterable of additional parameters forwarded to fct.

  • key – Cache key for the resulting surface.

Returns:

Tuple (surface, xvals, yvals) with surface shaped (samples, samples).

Raises:

ValueError – If the latent grid has not been initialised.

scan_dope(key=None, refine=True, **kwargs)[source]

Score decoded structures with DOPE over the latent grid.

Each point on the latent grid is decoded, optionally refined, and assessed using the DOPE potential. The resulting surface is cached under key for reuse. When refine == "both" the returned array contains two channels: the raw DOPE score and the score after refinement.

Parameters:
  • key – Optional cache key. When omitted a descriptive key based on the refine option is generated.

  • refine – If True (default) the decoded coordinates are minimised prior to scoring. When set to "both" raw and refined scores are computed.

  • kwargs – Extra keyword arguments forwarded to get_all_dope_score() (e.g. Modeller configuration).

Returns:

Tuple (surface, xvals, yvals) where surface is of shape (samples, samples) or (samples, samples, 2) when refine is "both".

Raises:

ValueError – If the latent grid has not been initialised.

scan_error(s_key='Network_RMSD', z_key='Network_z_drift')[source]

Evaluate autoencoder consistency over the latent grid.

The method decodes the latent grid, re-encodes the resulting structures and performs a second decode to estimate reconstruction drift. Two surfaces are produced: RMSD between first- and second-pass decodes, and the latent space drift between original and re-encoded grid coordinates. Results are cached in surfaces using the provided keys.

Parameters:
  • s_key – Cache key for the RMSD surface (overwritten internally to the default value in order to maintain backwards compatibility).

  • z_key – Cache key for the latent drift surface (also overwritten to the default value).

Returns:

Tuple (rmsd_surface, z_surface, xvals, yvals) where each surface is shaped (samples, samples).

Raises:

ValueError – If a latent grid has not been initialised via setup_grid().

scan_error_from_target(key, index=None, align=True)[source]

Compute an RMSD surface against a specific target structure.

A dataset must already be registered under key and a latent space grid generated via setup_grid(). The method decodes the grid, compares each structure against the selected target, and caches the resulting surface in surfaces under an autogenerated key.

Parameters:
  • key – Dataset label previously registered with set_dataset().

  • index – Optional index selecting a single structure from a dataset containing multiple conformations. If omitted, the dataset is expected to contain exactly one frame.

  • align – Whether to superimpose decoded structures onto the target prior to RMSD evaluation.

Returns:

Tuple (surface, xvals, yvals) where surface is a (samples, samples) NumPy array containing RMSD values and xvals/yvals are the latent grid axes.

Raises:

ValueError – If the latent grid has not been initialised or the target dataset does not contain exactly one conformation.

scan_ramachandran()[source]

Evaluate Ramachandran statistics for each decoded grid structure.

The method decodes the latent grid, executes Ramachandran scoring for every structure, and stores four resulting surfaces (favoured/allowed/outliers/ total) in surfaces with keys prefixed by "Ramachandran_".

Returns:

Tuple (favoured_surface, xvals, yvals) with the preferred surface for convenience. The remaining surfaces can be retrieved from surfaces using their respective keys.

Raises:

ValueError – If the latent grid has not been initialised.

set_dataset(key, data, atomselect='protein')[source]
Parameters:
  • key (str) – label to be associated with data

  • dataPDBData object containing atomic coordinates

  • atomselect (list/str) – list of atom names to load, or ‘protein’ to indicate that all atoms are loaded.

set_decoded(key, structures)[source]
Parameters:
  • key (str) – key pointing to a dataset previously loaded with set_dataset

  • structurestorch.Tensor containing the decoded structures to be associated with the key.

set_encoded(key, coords)[source]
Parameters:
  • key (str) – key pointing to a dataset previously loaded with set_dataset

  • coords – coordinates in latent space to be associated with the key.

set_network(network)[source]
Parameters:

network – a trained neural network defined in molearn.models

setup_grid(samples=64, bounds_from=None, bounds=None, padding=0.1)[source]

Define a NxN point grid regularly sampling the latent space.

Parameters:
  • samples (int) – grid size (build a samples x samples grid)

  • bounds_from (str/list) – Name(s) of datasets to use as reference, either as single string, a list of strings, or ‘all’

  • bounds (tuple/list) – tuple (xmin, xmax, ymin, ymax) or None

  • padding (float) – define size of extra spacing around boundary conditions (as ratio of axis dimensions)

class MolearnGUI(MA=None)[source]

This class produces an interactive visualisation for data stored in a MolearnAnalysis object, viewable within a Jupyter notebook.

Parameters:

MA – Either MolearnAnalysis instance, or None (default). If None an empty GUI will be produced.

get_path(idx_start, idx_end, landscape, xvals, yvals, smooth=3)[source]

Find shortest path between two points on a weighted grid

Parameters:
  • idx_start (int) – index on a 2D grid, as start point for a path

  • idx_end (int) – index on a 2D grid, as end point for a path

  • landscape (numpy.array) – 2D grid

  • xvals (numpy.array) – x-axis values, to yield actual coordinates

  • yvals (numpy.array) – y-axis values, to yield actual coordinates

  • smooth (int) – size of kernel for running average (must be >=1, default 3)

Returns:

array of 2D coordinates each with an associated value on lanscape

get_path_aggregate(crd, landscape, xvals, yvals, input_is_index=False)[source]

Create a chain of shortest paths via give waypoints

Parameters:
  • crd (numpy.array) – waypoints coordinates (Nx2 array)

  • landscape (numpy.array) – 2D grid

  • xvals (numpy.array) – x-axis values, to yield actual coordinates

  • yvals (numpy.array) – y-axis values, to yield actual coordinates

  • input_is_index (bool) – if False, assume crd contains actual coordinates, graph indexing otherwise

Returns:

array of 2D coordinates each with an associated value on lanscape

get_point_index(crd, xvals, yvals)[source]

Extract index (of 2D surface) closest to a given real value coordinate

Parameters:
  • crd (numpy.array/list) – coordinate

  • xvals (numpy.array) – x-axis of surface

  • yvals (numpy.array) – y-axis of surface

Returns:

1D array with x,y coordinates

oversample(crd, pts=10)[source]

Add extra equally spaced points between a list of points.

Parameters:
  • crd (numpy.array) – Nx2 numpy array with latent space coordinates

  • pts (int) – number of extra points to add in each interval

Returns:

Mx2 numpy array, with M>=N.

plot_analysis_surface(MA, dataset, cmap='gist_heat_r', fname=None, **kwargs)[source]

Plot a specific dataset overlayed on its analysis surface.

Parameters:
  • MA (MolearnAnalysis) – Analysis instance containing latent grid axes and the requested surface entry.

  • dataset (str) – Key identifying the surface to render. Use MolearnAnalysis.scan_dataset() (or similar) to populate MA.surfaces[dataset] beforehand.

  • cmap (str) – Matplotlib colormap name applied to the surface.

  • fname (Path) – Destination path for the saved figure. When omitted the plot is displayed only.

Returns:

None

plot_angle_hist(MA, plot_data=None, bins: int = 100, wkdir=None, **kwargs)[source]

Plot bond-angle distributions for original and decoded structures.

Parameters:
  • MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.

  • plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to a original/encoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]

  • bins (int) – No of bins in the histogram.

  • wkdir (Path) – A directory where figures are saved.

Returns:

None

plot_bondlength_hist(MA, plot_data=None, bins: int = 100, wkdir=None, **kwargs)[source]

Plot bond-length distributions for original and decoded structures.

Parameters:
  • MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.

  • plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to a original/encoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]

  • bins (int) – No of bins in the histogram.

  • wkdir (Path) – A directory where figures are saved.

Returns:

None

plot_dihedral_hist(MA, plot_data=None, bins: int = 100, wkdir=None, **kwargs)[source]

Plot backbone dihedral distributions for dataset and decoded structures.

Parameters:
  • MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.

  • plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to a original/encoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]

  • bins (int) – No of bins in the histogram.

  • wkdir (Path) – A directory where figures are saved.

Returns:

None

plot_dope_hist(MA, plot_data=None, fname=None, refine=True, **kwargs)[source]

Plot DOPE score distributions as split violin plots comparing datasets and decoded structures.

Parameters:
  • MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.

  • plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to an original/decoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]

  • fname (Path) – File name to save the plot.

  • refine (bool) – If True, refine structures before calculating DOPE score. Can also be ‘both’ to plot both refined and unrefined.

Returns:

None

plot_dope_surface(MA, refine=True, truncate_at=None, plot_data=None, cmap='gist_heat_r', fname=None, **kwargs)[source]

Plot the latent grid coloured by decoded DOPE scores.

Parameters:
  • MA (MolearnAnalysis) – Analysis instance containing latent grid axes and precomputed DOPE surfaces via MolearnAnalysis.scan_dope().

  • refine (bool) – When True the refined DOPE surface ('DOPE_refined') is rendered; otherwise the unrefined surface ('DOPE_unrefined') is used.

  • truncate_at (float) – Upper bound for the colour scale. Defaults to the maximum value in the selected surface.

  • plot_data (list) – Optional overlays given as (key, label, colour, plot_type) tuples, matching the semantics described for plot_network_rmsd_surface().

  • cmap (str) – Matplotlib colormap name applied to the surface.

  • fname (Path) – Destination path for the saved figure. When omitted the plot is displayed only.

Returns:

None

plot_inversion_hist(MA, plot_data, fname=None, **kwargs)[source]

Plot distributions of number of D-amino acids in each structure in datasets as bar plots

Parameters:
  • MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.

  • plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to a original/encoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]

  • fname (Path) – File name of the plot.

  • refine (bool) – if True, refine structures before calculating DOPE score

  • kwargs (dict) – Additional keyword arguments to pass to plt.savefig.

Returns:

None

plot_inversion_surface(MA, plot_data=None, levels=10, cmap='gist_heat_r', fname=None, **kwargs)[source]

Plot the latent grid coloured by predicted D-amino acid counts.

Parameters:
  • MA (MolearnAnalysis) – Analysis instance with latent grid axes and chirality surface data produced by MolearnAnalysis.scan_ca_chirality().

  • plot_data (list) – Optional overlays given as (key, label, colour, plot_type) tuples, matching the semantics described for plot_network_rmsd_surface().

  • levels (int) – Number of discrete contour levels used for the colour map.

  • cmap (str) – Matplotlib colormap name applied to the surface.

  • fname (Path) – Destination path for the saved figure. When omitted the plot is displayed only.

Returns:

None

plot_network_rmsd_surface(MA, plot_data=None, cmap='gist_heat_r', fname=None, **kwargs)[source]

Plot the latent grid coloured by reconstruction RMSD.

Parameters:
  • MA (MolearnAnalysis) – Analysis instance with latent grid axes and a precomputed surfaces['Network_RMSD'] entry (produced by MolearnAnalysis.scan_error()).

  • plot_data (list) – Optional overlays given as (key, label, colour, plot_type) tuples. plot_type must be either 'scatter' or 'kde' and each key must correspond to an encoded dataset available through MolearnAnalysis.get_encoded().

  • cmap (str) – Matplotlib colormap name for the background surface.

  • fname (Path) – Destination path for the saved figure. When omitted the plot is displayed only.

Returns:

None

plot_rmsd_hist(MA, plot_data=None, fname=None, **kwargs)[source]

Plot distributions of RMSD scores of the chosen datasets.

Parameters:
  • MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.

  • plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain a pair of keys corresponding to two datasets in MolearnAnalysis object which would be plotted on a split violin, a label for the legend, and keywords indicating if datasets are train or test. Format: [(key1, key2, label, kw1, kw2), …]

  • fname (Path) – The file name to save the plots.

  • kwargs (dict) – Additional keyword arguments to pass to plt.savefig.

Returns:

None