Additional data for 'iPRESTO: automated discovery of biosynthetic sub-clusters linked to specific natural product substructures'.
Contents:
- antismashdb_dataset_tokenised_bgcs_clusterfile.csv contains the pre-processed BGCs (tokenised and filtered) from the antiSMASH-DB dataset. Genes are separated by commas and domains by semicolons.
- Pfam_100subs_tc.hmm is the unpressed Pfam database, additionally containing subPfam HMMs for the 112 most important biosynthetic Pfams.
- PRESTO-STAT_subclusters.txt contains all sub-clusters found by PRESTO-STAT applied to the antiSMASH-DB dataset.
- lda_model and files with lda_model* are the LDA model trained and used by PRESTO-TOP, that can be read using gensim in python.
- presto_top_model_annotations.xlsx contains annotations for the sub-cluster motifs (S1 File in the paper).
- mibig_gbk_2.0_clusterfile_ipresto_output_visualisation.html contains the visualisation of all MIBiG BGCs queried to the set of sub-clusters generated in this study.