Data paper
Issue: № 1 (23), 2024


This study reports the complete genome sequence of Sphaerochaeta associata GLS2T (=VKM B-2742T =DSM 26261T), which was isolated from a consortium with methanogenic archaeon Methanosarcina mazei JL01. The consortium was collected from permafrost of the Kolyma lowland in Russia. The hybrid approach, combining paired-end Illumina reads with Oxford Nanopore Technologies MinION reads, was used to assemble the genome. The final assembly resulted in a circular chromosome that is 3,554,903 bp long. This high-quality genome assembly serves as a basis for algorithmic pathway reconstruction and postgenomic analysis. To further this research, the genome was imported into research portals for the algorithmic reconstruction of metabolic pathways, in both common sense (KEGG) and with special attention to carbohydrate metabolism (CAZy). These portals offer high-quality workplaces for in-depth studies.

1. Introduction

Currently, eight species from the family Sphaerochaetaceae are validly described, namely Sphaerochaeta associata, S. globosa, S. halotolerans, S. pleomorpha, Parasphaerochaeta coccoides, Pleomorphochaeta caudata, P. naphthae, and P. multiformis

. For S. globosa (CP002541), S. pleomorpha (CP003155), S. halotolerans (QUWK00000000), and P. coccoides (CP002659), whole genome sequences are available

The draft genome of GLS2T was generated at the DOE Joint Genome Institute (JGI) (Berkeley, CA, USA) under the umbrella of the Genomic Encyclopedia of Type Strains, Phase III project

. The GLS2T genome project (Gp0157006) has been registered in the Genomes OnLine Database
, the draft sequence was annotated with the IMG annotation pipeline
and deposited in GenBank (FXUH00000000.1). The complete genome sequence of GLS2T was subsequently obtained using a combination of Illumina and Oxford Nanopore Technologies (ONT) sequencing platforms.

2. Materials and Methods

S. associata strain GLS2T was grown anaerobically in 1,1 L flasks containing 0,7 L of MS medium at 30°C

. Cells in the exponential growth phase were collected by centrifugation (20 min at 4500 rpm), washed with saline, and used for DNA isolation. Genomic DNA for Illumina sequencing was extracted using guanidinium thiocyanate and Triton X-100, followed by purification using the Cleanup Standard BC022 kit (Evrogen, Russia). Illumina shotgun libraries were constructed with an insert size ranging from 262 to 287 bp, and sequencing was carried out on the Illumina HiSeq 2500 platform, generating 2×151-bp paired-end reads. To ensure data quality, all raw Illumina sequence data (7,415,066 reads) were filtered using BBDuk ( to remove known Illumina artifacts and PhiX sequences. Reads with more than one N, average quality scores (before trimming) below 8, or reads shorter than 51 bp (after trimming) were discarded. The remaining reads were aligned to masked versions of the human, cat, and dog references using BBMAP (, and reads with identities above 95% were filtered out. Sequence masking was performed using BBMask (

For long-read sequencing with ONT MinION, high molecular weight genomic DNA was extracted using a combination of guanidinium thiocyanate-Triton X-100 lysis, followed by enzymatic treatments with proteinase K and RNAse A. Subsequently, DNA was further processed using the Circulomics nanobind plant nuclei big DNA kit, and size selection was performed with the Circulomics SRE XL kit. The library was prepared using the SQK-LSK110 kit, following the manufacturer's specifications, with a final long DNA selection using the large fragment buffer (LFB). Sequencing was conducted in a MinION R9.4.1 flow cell. Basecalling was performed using Guppy v6.0.1+652ffd179 in GPU mode, utilizing the super-accurate model (sup) with the --config dna_r9.4.1_450bps_sup.cfg option. A total of 137,155 "passed" reads were obtained, with an N50 of 28,900 bp.

The de novo hybrid assembly was performed using the Unicycler pipeline v.0.5.0

. Attempts to assemble a single circular chromosome using all reads from both the Illumina and ONT platforms proved unsuccessful. The optimal assembly was achieved by moderately covering both types of reads. The best assembly incorporated only the first 106 Illumina reads from the SRR5832273 file and 8,872 random ONT reads (submitted as SRR18671923). As a result, the genome was assembled into a single circular contig with an overall sequence coverage of approximately 90X, with a roughly equal contribution of long and short reads.

The assembled genome was annotated by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) v6.1, employing the best-placed reference protein set, GeneMarkS-2+


3. Results and Discussion

The genome consists of a circular 3,554,903 bp long chromosome with a G+C content of 51 %. Of the 3,302 total genes, 3,241 CDSs (total), 3 pseudogenes, 9 rRNAs (3 of 5S, 3 of 16S and 3 of 23S), 49 tRNAs and 3 noncoding RNAs were predicted. All three copies of 16S rRNA as well as of 5S rRNA gene sequences are identical, but 23S rRNA gene sequences reveal intragenomic variation (T – CC substitution in one copy).

The genome also was deposited in the JGI GOLD with the GOLD ID's (Study ID: Gs0156768; Project ID: Gp0622633; Analysis ID: Ga0536652) as Sphaerochaeta associata VKM B-2742. The annotation was conducted by IMG Annotation Pipeline v.5.1.5. The comprehensive analysis revealed the gene products that connected to the metabolic pathways and the gene families. The results are presented in the form of a multi-level text database with convenient hyperlinks.

The complete genome assembly of S. associata GLS2T provides the basis for algorithmic pathway reconstruction and other studies, including comparative genomics and postgenomic analysis, on specialized public Internet portals. However, it is important to note that only high-quality completed genomes are useful for this purpose. The finalized assembly of the S. associata GLS2T genome was easily imported into research portals, such as KEGG

and CAZy
, for algorithmic reconstruction of metabolic pathways with a focus on carbohydrate metabolism. These portals provide high-quality environments for in-depth studies.

4. Data availability

The genome sequence and raw sequencing reads for VKM B-2742T were deposited under GenBank accession number CP094929 (version CP094929.1), BioProject accession number PRJNA822125, BioSample accession number SAMN27176868, and SRA accession numbers SRR18671923 (MinION reads) and SRR5832273 (Illumina reads). The JGI GOLD's ID are: Gs0156768 (Study ID); Gp0622633 (Project ID); Ga0536652 (Analysis ID). The KEGG database identifier is T-number: T08141. The reference for S. associata GLS2T in the CAZy database is


Article metrics