Add PCA and UMAP transformed coordinates and amino acid subtypes based on analysis and clustering of the combined deep mutational landscape dataset.

annotate(x)

Arguments

x

deep_mutational_scan to annotate

Value

An annotated deep_mutational_scan whose data object contains the following added columns:

  • cluster: the assigned amino acid subtype.

  • PC1 - PC20: Principal Component coordinates.

  • umap1/2: UMAP coordinates.

  • base_cluster: The nearest primary cluster centroid (i.e. not outlier or permissive clusters).

  • permissive: The position is identified as permissive (|ER| < 0.4 for all amino acids).

  • ambiguous: The distance to two clusters is very similar, so the assignment is low confidence and marked as an ambiguous.

  • high_distance: The position is distant from all cluster centroids and is marked an outlier.

  • dist1-8: The distance to each cluster of the WT amino acid.

  • cluster_notes: Notes on the cluster assignment.

Details

PCA and UMAP coordinates are assigned using the original models fit on the deep mutational landscape dataset. The PCA coordinates are then used to assign each position to an amino acid cluster. These are marked X1-8 for the main described clusters of amino acid X, XP for the permissive cluster (|ER| < 0.4 for all positions), XO for outliers and XA when assignment is ambiguous.

Clusters are initially assigned based on cosine distance to cluster centroids. Positions that are a similar distance from multiple centroids (difference < 0.03) are marked as ambiguous, those that are > 0.45 away from all clusters marked as outliers and those with all |ER| scores < 0.4 marked as permissive. The first two thresholds were determined by benchmarking assignment on positions in the original dataset and the latter is the the threshold used for permissive clusters in the original dataset.

Examples

dms <- annotate(deepscanscape::deep_scans$p53)