Description
I was hoping to ask a question about a draft Nocardiopsis spp. model I've built using PopPUNK; I work in the Natural Products space, primarily with phylum Actinomycetota—a group of bacteria that can be incredibly genetically diverse even at the species level. I am working with a novel Nocardiopsis species and was using PopPUNK to try to determine its taxonomic position relative to Nocardiopsis genomes available via NCBI. Because my strain is a new species (confirmed chemotaxonomically), I need a tool that would be appropriate for mixed-species taxonomic characterization. I started with PopPUNK because the 2019 publication highlighted its utility in classifying multi-species cohorts of bacterial pathogens, and also because I'd like to use core + accessory genomes to build a phylogeny. My question is this—is PopPUNK appropriate for a diverse genus like Nocardiopsis, given the following results?
I've attached all files generated from the database creation here as a .tar.gz
file.
Reference Nocardiopsis genomes were downloaded from NCBI using Datasets. 157 genomes (either complete or draft) were used.
The following workflow was used to generate this first database:
Sketch:
poppunk --create-db --output nocardiopsis_ref --r-files ref.txt --min-k 13 --max-k 29
QC database:
poppunk --qc-db --ref-db nocardiopsis_ref --qc-keep
This resulted in 128 files failing QC for various reasons, mostly for failing distance.
Fit model:
poppunk --fit-model bgmm --ref-db nocardiopsis_ref --output nocardiopsis_ref
Fit summary:
Avg. entropy of assignment 0.0160
Number of components used 2
Scaled component means:
[0.69171908 0.07866528]
[0.36643846 0.3914161 ]
Network summary:
Components 1
Density 0.1921
Transitivity 0.5733
Mean betweenness 0.1924
Weighted-mean betweenness 0.1924
Score 0.4631
Score (w/ betweenness) 0.3740
Score (w/ weighted-betweenness) 0.3740
Removing 136 sequences
Thanks!