Analysis¶
- class MolearnAnalysis(batch_size=1, processes=1)[source]¶
This class provides methods dedicated to the quality analysis and structure generation with a trained model.
- generate(crd: ndarray[tuple[int, int], dtype[float64]], pdb_path: str | None = None, relax: bool = False) ndarray[tuple[int, int, int], dtype[float64]][source]¶
Generate a collection of protein conformations, given coordinates in the latent space.
- Parameters:
crd (numpy.array) – coordinates in the latent space, as a (Nx2) array
pdb_path (str) – path where to pdb_files should be stored as files named s_i.pdb where i is the index in the crd array
relax (bool) – Relax generated structures with energy minimisation. s_i_relaxed.pdb file
- Returns:
collection of protein conformations in the Cartesian space (NxMx3, where M is the number of atoms in the protein)
- get_all_dope_score(tensor, refine=True)[source]¶
Calculate DOPE score of an ensemble of atom coordinates.
- Parameters:
tensor – torch.Tensor or numpy.ndarray with shape [B, N, 3] containing Cartesian coordinates of atoms.
refine (bool) – if True, return DOPE score of input and output structure after refinement
- get_all_ramachandran_score(tensor)[source]¶
Calculate Ramachandran score of an ensemble of atomic conrdinates.
- Parameters:
tensor – torch.Tensor or numpy.ndarray with shape [B, N, 3] containing Cartesian coordinates of atoms.
- Returns:
dictionary with keys ‘favored’, ‘allowed’, ‘outliers’, and ‘total’ containing arrays of Ramachandran scores
- get_dataset(key, scale=False)[source]¶
- Parameters:
key (str) – key pointing to a dataset previously loaded with
set_datasetscale (bool) – if True, return the dataset scaled (i.e. with mean and std applied)
- Returns:
torch.Tensor for dataset with the key
- get_decoded(key, update=False, scale=False)[source]¶
- Parameters:
key (str) – key pointing to a dataset previously loaded with
set_datasetupdate (bool) – if True, re-decode and overwrite the existing data
scale (bool) – if True, return the dataset scaled (i.e. with mean and std applied)
- Returns:
torch.Tensor for decoded dataset with the key
- get_dope(key, refine=True, **kwargs)[source]¶
- Parameters:
key (str) – key pointing to a dataset previously loaded with
set_datasetrefine (bool) – if True, refine structures before calculating DOPE score
- Returns:
dictionary containing DOPE score of dataset, and its decoded counterpart
- get_encoded(key, update=False)[source]¶
- Parameters:
key (str) – key pointing to a dataset previously loaded with
set_datasetupdate (bool) – if True, re-encode and overwrite the existing data
- Returns:
array containing the encoding in latent space of dataset associated with key
- get_error(key, align=True)[source]¶
Calculate the reconstruction error of a dataset encoded and decoded by a trained neural network.
- Parameters:
key (str) – key pointing to a dataset previously loaded with
set_datasetalign (bool) – if True, the RMSD will be calculated by finding the optimal alignment between structures
- Returns:
1D array containing the RMSD between input structures and their encoded-decoded counterparts
- get_inversions(key)[source]¶
Get the chirality of Cα atoms in a dataset and its decoded counterpart.
- get_ramachandran(key)[source]¶
- Parameters:
key (str) – key pointing to a dataset previously loaded with
set_dataset
- num_trainable_params()[source]¶
- Returns:
number of trainable parameters in the neural network previously loaded with
set_dataset
- reference_dope_score(frame)[source]¶
- Parameters:
frame (numpy.array) – array with shape [1, N, 3] with Cartesian coordinates of atoms
- Returns:
DOPE score
- scan_bondlength()[source]¶
Derive bond-length statistics across the latent grid.
Each grid structure is analysed for backbone bond lengths, and the mean and standard deviation for every bond type are cached in
surfacesusing descriptive keys (e.g."N-CA"and"N-CA_std").- Raises:
ValueError – If the latent grid has not been initialised.
- scan_ca_chirality()[source]¶
Populate a surface describing Cα chirality inversions.
Decodes the latent grid and counts the number of chirality inversions per structure (i.e. negative triple products for Cα neighbourhoods). The resulting surface is cached under the
"Chirality"key withinsurfaces.- Raises:
ValueError – If the latent grid has not been initialised prior to invocation.
- scan_custom(fct, params, key)[source]¶
Evaluate a user-defined metric over the latent grid.
Decodes the latent grid, applies
fctto each structure, and stores the resulting surface underkeywithinsurfaces.- Parameters:
fct – Callable accepting coordinates shaped
(1, N, 3)(after rescaling) and returning a scalar.params – Iterable of additional parameters forwarded to
fct.key – Cache key for the resulting surface.
- Returns:
Tuple
(surface, xvals, yvals)withsurfaceshaped(samples, samples).- Raises:
ValueError – If the latent grid has not been initialised.
- scan_dope(key=None, refine=True, **kwargs)[source]¶
Score decoded structures with DOPE over the latent grid.
Each point on the latent grid is decoded, optionally refined, and assessed using the DOPE potential. The resulting surface is cached under
keyfor reuse. Whenrefine == "both"the returned array contains two channels: the raw DOPE score and the score after refinement.- Parameters:
key – Optional cache key. When omitted a descriptive key based on the
refineoption is generated.refine – If
True(default) the decoded coordinates are minimised prior to scoring. When set to"both"raw and refined scores are computed.kwargs – Extra keyword arguments forwarded to
get_all_dope_score()(e.g. Modeller configuration).
- Returns:
Tuple
(surface, xvals, yvals)wheresurfaceis of shape(samples, samples)or(samples, samples, 2)whenrefineis"both".- Raises:
ValueError – If the latent grid has not been initialised.
- scan_error(s_key='Network_RMSD', z_key='Network_z_drift')[source]¶
Evaluate autoencoder consistency over the latent grid.
The method decodes the latent grid, re-encodes the resulting structures and performs a second decode to estimate reconstruction drift. Two surfaces are produced: RMSD between first- and second-pass decodes, and the latent space drift between original and re-encoded grid coordinates. Results are cached in
surfacesusing the provided keys.- Parameters:
s_key – Cache key for the RMSD surface (overwritten internally to the default value in order to maintain backwards compatibility).
z_key – Cache key for the latent drift surface (also overwritten to the default value).
- Returns:
Tuple
(rmsd_surface, z_surface, xvals, yvals)where each surface is shaped(samples, samples).- Raises:
ValueError – If a latent grid has not been initialised via
setup_grid().
- scan_error_from_target(key, index=None, align=True)[source]¶
Compute an RMSD surface against a specific target structure.
A dataset must already be registered under
keyand a latent space grid generated viasetup_grid(). The method decodes the grid, compares each structure against the selected target, and caches the resulting surface insurfacesunder an autogenerated key.- Parameters:
key – Dataset label previously registered with
set_dataset().index – Optional index selecting a single structure from a dataset containing multiple conformations. If omitted, the dataset is expected to contain exactly one frame.
align – Whether to superimpose decoded structures onto the target prior to RMSD evaluation.
- Returns:
Tuple
(surface, xvals, yvals)wheresurfaceis a(samples, samples)NumPy array containing RMSD values andxvals/yvalsare the latent grid axes.- Raises:
ValueError – If the latent grid has not been initialised or the target dataset does not contain exactly one conformation.
- scan_ramachandran()[source]¶
Evaluate Ramachandran statistics for each decoded grid structure.
The method decodes the latent grid, executes Ramachandran scoring for every structure, and stores four resulting surfaces (favoured/allowed/outliers/ total) in
surfaceswith keys prefixed by"Ramachandran_".- Returns:
Tuple
(favoured_surface, xvals, yvals)with the preferred surface for convenience. The remaining surfaces can be retrieved fromsurfacesusing their respective keys.- Raises:
ValueError – If the latent grid has not been initialised.
- set_dataset(key, data, atomselect='protein')[source]¶
- Parameters:
key (str) – label to be associated with data
data –
PDBDataobject containing atomic coordinatesatomselect (list/str) – list of atom names to load, or ‘protein’ to indicate that all atoms are loaded.
- set_decoded(key, structures)[source]¶
- Parameters:
key (str) – key pointing to a dataset previously loaded with
set_datasetstructures – torch.Tensor containing the decoded structures to be associated with the key.
- set_encoded(key, coords)[source]¶
- Parameters:
key (str) – key pointing to a dataset previously loaded with
set_datasetcoords – coordinates in latent space to be associated with the key.
- set_network(network)[source]¶
- Parameters:
network – a trained neural network defined in
molearn.models
- setup_grid(samples=64, bounds_from=None, bounds=None, padding=0.1)[source]¶
Define a NxN point grid regularly sampling the latent space.
- Parameters:
samples (int) – grid size (build a samples x samples grid)
bounds_from (str/list) – Name(s) of datasets to use as reference, either as single string, a list of strings, or ‘all’
bounds (tuple/list) – tuple (xmin, xmax, ymin, ymax) or None
padding (float) – define size of extra spacing around boundary conditions (as ratio of axis dimensions)
- class MolearnGUI(MA=None)[source]¶
This class produces an interactive visualisation for data stored in a
MolearnAnalysisobject, viewable within a Jupyter notebook.- Parameters:
MA – Either
MolearnAnalysisinstance, or None (default). If None an empty GUI will be produced.
- get_path(idx_start, idx_end, landscape, xvals, yvals, smooth=3)[source]¶
Find shortest path between two points on a weighted grid
- Parameters:
idx_start (int) – index on a 2D grid, as start point for a path
idx_end (int) – index on a 2D grid, as end point for a path
landscape (numpy.array) – 2D grid
xvals (numpy.array) – x-axis values, to yield actual coordinates
yvals (numpy.array) – y-axis values, to yield actual coordinates
smooth (int) – size of kernel for running average (must be >=1, default 3)
- Returns:
array of 2D coordinates each with an associated value on lanscape
- get_path_aggregate(crd, landscape, xvals, yvals, input_is_index=False)[source]¶
Create a chain of shortest paths via give waypoints
- Parameters:
crd (numpy.array) – waypoints coordinates (Nx2 array)
landscape (numpy.array) – 2D grid
xvals (numpy.array) – x-axis values, to yield actual coordinates
yvals (numpy.array) – y-axis values, to yield actual coordinates
input_is_index (bool) – if False, assume crd contains actual coordinates, graph indexing otherwise
- Returns:
array of 2D coordinates each with an associated value on lanscape
- get_point_index(crd, xvals, yvals)[source]¶
Extract index (of 2D surface) closest to a given real value coordinate
- Parameters:
crd (numpy.array/list) – coordinate
xvals (numpy.array) – x-axis of surface
yvals (numpy.array) – y-axis of surface
- Returns:
1D array with x,y coordinates
- oversample(crd, pts=10)[source]¶
Add extra equally spaced points between a list of points.
- Parameters:
crd (numpy.array) – Nx2 numpy array with latent space coordinates
pts (int) – number of extra points to add in each interval
- Returns:
Mx2 numpy array, with M>=N.
- plot_analysis_surface(MA, dataset, cmap='gist_heat_r', fname=None, **kwargs)[source]¶
Plot a specific dataset overlayed on its analysis surface.
- Parameters:
MA (MolearnAnalysis) – Analysis instance containing latent grid axes and the requested surface entry.
dataset (str) – Key identifying the surface to render. Use
MolearnAnalysis.scan_dataset()(or similar) to populateMA.surfaces[dataset]beforehand.cmap (str) – Matplotlib colormap name applied to the surface.
fname (Path) – Destination path for the saved figure. When omitted the plot is displayed only.
- Returns:
None
- plot_angle_hist(MA, plot_data=None, bins: int = 100, wkdir=None, **kwargs)[source]¶
Plot bond-angle distributions for original and decoded structures.
- Parameters:
MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.
plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to a original/encoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]
bins (int) – No of bins in the histogram.
wkdir (Path) – A directory where figures are saved.
- Returns:
None
- plot_bondlength_hist(MA, plot_data=None, bins: int = 100, wkdir=None, **kwargs)[source]¶
Plot bond-length distributions for original and decoded structures.
- Parameters:
MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.
plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to a original/encoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]
bins (int) – No of bins in the histogram.
wkdir (Path) – A directory where figures are saved.
- Returns:
None
- plot_dihedral_hist(MA, plot_data=None, bins: int = 100, wkdir=None, **kwargs)[source]¶
Plot backbone dihedral distributions for dataset and decoded structures.
- Parameters:
MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.
plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to a original/encoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]
bins (int) – No of bins in the histogram.
wkdir (Path) – A directory where figures are saved.
- Returns:
None
- plot_dope_hist(MA, plot_data=None, fname=None, refine=True, **kwargs)[source]¶
Plot DOPE score distributions as split violin plots comparing datasets and decoded structures.
- Parameters:
MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.
plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to an original/decoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]
fname (Path) – File name to save the plot.
refine (bool) – If True, refine structures before calculating DOPE score. Can also be ‘both’ to plot both refined and unrefined.
- Returns:
None
- plot_dope_surface(MA, refine=True, truncate_at=None, plot_data=None, cmap='gist_heat_r', fname=None, **kwargs)[source]¶
Plot the latent grid coloured by decoded DOPE scores.
- Parameters:
MA (MolearnAnalysis) – Analysis instance containing latent grid axes and precomputed DOPE surfaces via
MolearnAnalysis.scan_dope().refine (bool) – When
Truethe refined DOPE surface ('DOPE_refined') is rendered; otherwise the unrefined surface ('DOPE_unrefined') is used.truncate_at (float) – Upper bound for the colour scale. Defaults to the maximum value in the selected surface.
plot_data (list) – Optional overlays given as
(key, label, colour, plot_type)tuples, matching the semantics described forplot_network_rmsd_surface().cmap (str) – Matplotlib colormap name applied to the surface.
fname (Path) – Destination path for the saved figure. When omitted the plot is displayed only.
- Returns:
None
- plot_inversion_hist(MA, plot_data, fname=None, **kwargs)[source]¶
Plot distributions of number of D-amino acids in each structure in datasets as bar plots
- Parameters:
MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.
plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain the key to a original/encoded dataset in the MolearnAnalysis object, a label for the legend, and a colour for the decoded data plot. Format: [(key, label, colour), …]
fname (Path) – File name of the plot.
refine (bool) – if True, refine structures before calculating DOPE score
kwargs (dict) – Additional keyword arguments to pass to plt.savefig.
- Returns:
None
- plot_inversion_surface(MA, plot_data=None, levels=10, cmap='gist_heat_r', fname=None, **kwargs)[source]¶
Plot the latent grid coloured by predicted D-amino acid counts.
- Parameters:
MA (MolearnAnalysis) – Analysis instance with latent grid axes and chirality surface data produced by
MolearnAnalysis.scan_ca_chirality().plot_data (list) – Optional overlays given as
(key, label, colour, plot_type)tuples, matching the semantics described forplot_network_rmsd_surface().levels (int) – Number of discrete contour levels used for the colour map.
cmap (str) – Matplotlib colormap name applied to the surface.
fname (Path) – Destination path for the saved figure. When omitted the plot is displayed only.
- Returns:
None
- plot_network_rmsd_surface(MA, plot_data=None, cmap='gist_heat_r', fname=None, **kwargs)[source]¶
Plot the latent grid coloured by reconstruction RMSD.
- Parameters:
MA (MolearnAnalysis) – Analysis instance with latent grid axes and a precomputed
surfaces['Network_RMSD']entry (produced byMolearnAnalysis.scan_error()).plot_data (list) – Optional overlays given as
(key, label, colour, plot_type)tuples.plot_typemust be either'scatter'or'kde'and eachkeymust correspond to an encoded dataset available throughMolearnAnalysis.get_encoded().cmap (str) – Matplotlib colormap name for the background surface.
fname (Path) – Destination path for the saved figure. When omitted the plot is displayed only.
- Returns:
None
- plot_rmsd_hist(MA, plot_data=None, fname=None, **kwargs)[source]¶
Plot distributions of RMSD scores of the chosen datasets.
- Parameters:
MA (MolearnAnalysis) – A MolearnAnalysis object with datasets loaded.
plot_data (list) – A list of tuples containing the data to plot. Each tuple should contain a pair of keys corresponding to two datasets in MolearnAnalysis object which would be plotted on a split violin, a label for the legend, and keywords indicating if datasets are train or test. Format: [(key1, key2, label, kw1, kw2), …]
fname (Path) – The file name to save the plots.
kwargs (dict) – Additional keyword arguments to pass to plt.savefig.
- Returns:
None