215: Protein Set Transformer for high-diversity viromics

01/12/2025 20 min Temporada 1 Episodio 215

                    215: Protein Set Transformer for high-diversity viromics

Listen " 215: Protein Set Transformer for high-diversity viromics "

Episode Synopsis


️ Episode 215: Protein Set Transformer for high-diversity viromics
In this episode of PaperCast Base by Base, we explore Protein Set Transformer (PST) is a protein-based genome language model that represents genomes as sets of proteins to improve genome and protein representations across diverse viral datasets
Study Highlights:PST embeds proteins with ESM2, concatenates positional and strand vectors, contextualizes proteins with a multi-head attention encoder, and produces genome embeddings via a learnable weighted decoder pooling. The foundation PST-TL models were pretrained on >100k dereplicated viral genomes encoding >6M proteins using a triplet-loss objective with PointSwap augmentation and evaluated on IMG/VR v4 and MGnify soil virus test sets. PST-TL outperformed other protein- and nucleotide-based methods at recovering genome–genome relationships, including remote relationships, and its protein embeddings clustered structural capsid folds and late-gene functional modules. PST improved annotation transfer for hypothetical proteins via embedding and structure-aware clustering and boosted viral host-species prediction when used in a graph link-prediction framework.
Conclusion:PST provides transferable genome- and protein-level embeddings that strengthen representation, annotation, and host-prediction tasks for diverse viral and microbial genomics applications
Music:Enjoy the music based on this article at the end of the episode.
Reference:Martin, C., Gitter, A., Anantharaman, K. Protein Set Transformer: a protein-based genome language model to power high-diversity viromics. Nat Commun (2025). https://doi.org/10.1038/s41467-025-66049-4
License:This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/
Support:Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00
Official website https://basebybase.com
Castos player https://basebybase.castos.com
On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.
Episode link: https://basebybase.castos.com/episodes/protein-set-transformer

Chapters
(00:00:00) - Deep Learning in Viral Biology(00:02:31) - Preliminary insights into viral biology(00:08:01) - PSTTL: The Hidden Genome of Viruses(00:11:29) - PSTTL: The Virality Model(00:14:22) - Preston 2, Context-aware viral evolution(00:16:19) - Signs and Numbers in the Code