首页 > 分享 > 国家微生物科学数据中心

国家微生物科学数据中心

花匠小妙招
2024-11-26 09:57

Home Database Metadata Download Analysis Submission Service Contact us

分析工具

科技基础性工作专项项目科学数据汇交科技项目数据汇交服务平台作为国家基础学科公共科学数据中心项目数据汇交服务系统。为基础学科领域国家重点专项项目提供汇交计划提交、科学数据异地协同制备，项目完整周期内项目组可持续汇聚项目数据、汇交凭证出具等服务，帮助各级科技项目顺利验收。

Pathogenic genome analysis (12) Plasmids in Pathogens genome analysis (3) SARS-CoV-2 analysis (3) Influenza analysis (3) HIV analysis (5) Mycobacterium Tuberculosis genome analysis (1) Phage genome analysis (2) Prokaryotic regulator analysis (2) gcType genome analysis (3) Fungi genome analysis (8) Others (3) Blast analysis tools (5) Convenient analysis tools (7) Metagenome analysis pipeline (3) Assembly tools (25) Genome structural analysis tools (11) Genome annotation tools (4) Community profiling tools (20) Comparative analysis tools (13) Pathogenic genome analysis: 12

Reference Guided Assembly

The reference guided assembly tool is suitable for genome assembly of bacteria and viruses. It mainly uses BWA and Minimap for reads splicing, and iVar for assembly. The reference genome library for comparison includes 10,600 genomes from 10,401 bacterial strains. Meanwhile, users are allowed to independently upload virus reference genomes for virus genome assembly. Then conduct genome integrity assessment using QUAST, and finally use CGView for complete genome mapping.

Pathogen Identification

The species identification tool mainly uses KRAKEN2, RNAmmer, BLASTn, Mash, and FastANI to compare the similarity between the target sequence and the reference library sequences, and provide the optimal comparison results for species identification of the pathogenic bacteria.

BLAST-pathogen

The BLAST-pathogen tool is suitable for sequence alignment of bacteria, viruses, and influenza. It mainly uses NCBI-BLAST for sequence alignment, referring to 465,736 high-quality assembly and 53,568,325 high-quality contig sequence from 359 pathogenic bacteria causing human diseases; This also includes 7787 high-quality assembly from 195 viruses and 1,101,148 high-quality contig sequences, including 1,018,990 contig sequences from influenza. Then, use chewBBACA to draw a genetic development tree for the top 20 sequences and query sequences with the highest similarity.

Reference-free Assembly and Genomic Annotation

This tool is suitable for bacterial sequence assembly and annotation. It can perform reference free assembly on the original reads data, and perform principal component analysis and gene annotation on the assembled genome. The main gene annotation databases include KEGG, COG, NR, SwissProt, AntiSMASH, MetaCyc, PHI, Pfam, CARD, VFDB. The analysis results will be fed back to users by email. This website does not support older browsers like Internet Explorer 6 – 8. Please update your browser or use new version of Chrome, Microsoft Edge and 360SE.

SNP Analysis

SNP analysis tool is suitable for bacteria and virus SNP analysis. It mainly use Snippy for SNP calling of bacterial sequences and iVar for SNP calling of viral sequences. The reference genome library for comparison includes 10,600 genomes from 10,401 bacterial strains. Then, use Gubbins to construct the Core SNP matrix and remove the SNPs in the recombination area. When there are more than 3 sequences to be tested, you can choose to upload the metadata table and use IQ-TREE 2 to draw the genetic development tree.

Multilocus Sequence Typing (MLST)

Multilocus Sequence Typing (MLST) enable to scan bacterial genome against traditional PubMLST typing schemes for sequencing typing. The analysis results will be fed back to users by email. This website does not support older browsers like Internet Explorer 6 – 8. Please update your browser or use new version of Chrome, Microsoft Edge and 360SE.

Core Genome Multilocus Sequence Typing (cgMLST)

Core genome multilocus sequence typing (cgMLST) can be used for sequence typing of pathogenic bacteria. We proved cgMLST analysis and 112 cgMLST schema (downloadable) from pathogen species or genus, which were constructed by the ChewBBACA that performs the schema creation and allele calls on complete or draft genomes. The cgMLST analysis result will be a visualization and interactive phylogenetic tree based on the result file of ChewBBACA and the metadata submitted by users (if provided).

MGEs and Transferable ARGs and VFs Detection

This tool can synchronously compare and annotate insertion sequences (IS), integrated convergent elements (ICE), integers (IN), plasma, and transposons (Tn) on the genome of pathogenic bacteria. Subsequently, use Diamond alignment to predict ARG and VF on the tested genome; By analyzing the positional information of ARG and VF on the genome, we can determine whether there is an interaction relationship between them and MGE. The criteria for determining the horizontal transfer of ARG and VF are: 1) Within 10kb upstream and downstream of an ARG or VF, if both sides contain the same IS/IN sequence, it is considered that this ARG or VF may have transferability; 2) If the position of an ARG or VF is within the sequence range of IN, Tn, plasma, and ICE, it is considered to have the possibility of horizontal transfer.

Genomic Annotation of Pathogenic Fungi

This tool is suitable for sequence assembly and annotation of fungi, mainly by assembling raw reads and annotating the assembled genomes with functional genes. The gene annotation databases include: signalp, VFDB, CARD, CAZy, NR, FungalP450, DFVF, SwissProt, emapper, antisMash.The analysis results will be fed back to users by email. This website does not support older browsers like Internet Explorer 6 – 8. Please update your browser or use new version of Chrome, Microsoft Edge and 360SE.

mNGS Online Analysis Platform

This tool is based on mNGS data and can analyze pathogens from infection samples of different organs in the human body, including blood infections, central nervous system infections, respiratory system infections, bone and joint infections, reproductive and urinary tract infections, abdominal and thoracic infections.

Fungal Pan-genome Analysis

This tool is specifically designed for fungal pangenome analysis, capable of efficiently processing large-scale genome data and accurately identifying and annotating core genes, accessory genes, and unique genes in the pangenome. For the first time, it combines the removal of human and bacterial contamination from fungal genome data, sequence quality control, gene prediction, pangenome analysis, and protein annotation, providing a new analytical approach for fungal research and assisting researchers in deeply exploring the mysteries of fungal genomes.

Fungal SNP calling, annotation, and prediction of anti-fungal resistant phenotype

This tool is mainly used for SNP detection and annotation of fungal genomes, as well as prediction of drug-resistant phenotypes through the detection of drug-resistant mutation sites. The reference genome covers a total of 293 fungal species at the species level. At the same time, it can satisfy the prediction of 11 drug resistance phenotypes for 4 major classes of antibiotics, involving 233 drug-resistant site mutations in 10 drug-resistant genes.

Plasmids in Pathogens genome analysis: 3

Plasmids identification from assembled bacterial genomes

This tool is mainly aimed at plasmid contigs idetification in bacterial genomes, which can accurately and quickly predict plasmid contigs from fragmented assembled bacterial genomes, which is helpful for studying the spread of antimicrobial resistance genes and the adaptive evolution of bacterial genomes.

Pipeline for plasmid annotation

This tool is mainly aimed at annotation of plasmid contigs, which can accurately and rapidly identify drug resistance genes, virulence genes, heavy metal resistance genes, IS, OriT and other plasmid characteristics carried by plasmids, which is helpful for studying the spread of drug resistance genes and the adaptive evolution of bacterial genomes.

BLAST-plasmids

This tool enables users to upload plasmid sequences and conduct similarity alignment (BLASTn) with all the plasmid sequences in PIPdb. In the output alignment results, the list of top 30 hits is displayed in reverse order according to the score value, including important parameters such as query ID, subject ID, qstart, qend, sstart, send, identity, score, and evalue. Users can click on any subject ID to view the basic information and distribution of the sequence in PIPdb.

SARS-CoV-2 analysis: 3

Variation identification

This tool identifies single nucleotide polymorphisms (SNPs) and amino acid variations within the SARS-CoV-2 nucleotide sequence and provides a detailed evaluation of associated risks, including antibody and receptor binding interactions and the difficulty of amino acid substitutions. The analysis output includes a comprehensive variant risk evaluation table and a visual representation of variant frequencies, facilitated through integration with the VarEPS database. Results are delivered to users via email, enabling convenient access for further study and analysis.

Primer Evealution

This tool evaluates the effectiveness of SARS-CoV-2 primer sequences by analyzing the variation frequency of the last three nucleotides in the 3' primer region across different lineages. Additionally, it provides a weighted evaluation of the mutation frequency in this critical region. The results are delivered to users via email for convenient access, facilitating further study and analysis.

EvarEPS

This tool is designed to detect SARS-CoV-2 and its lineages in wastewater samples, offering a comprehensive analysis that includes sample quality control, a summary of detected lineages along with newly identified SNPs and amino acid variations, and an evaluation of SNP frequency and associated risks such as their impact on antibody and receptor binding interactions and the difficulty of amino acid substitutions, with results conveniently delivered to users via email for further research and analysis.

Influenza analysis: 3

Variation identification

This tool identifies single nucleotide polymorphisms (SNPs) and amino acid variations within the influenza nucleotide sequence and provides a detailed evaluation of associated risks, including antibody and receptor binding interactions and the difficulty of amino acid substitutions. The analysis output includes a comprehensive variant risk evaluation table and a visual representation of variant frequencies, facilitated through integration with the VarEPS-Influ database. Results are delivered to users via email, enabling convenient access for further study and analysis.

Primer Evaluation

This tool evaluates the effectiveness of influenza primer sequences by analyzing the variation frequency of the last three nucleotides in the 3' primer region across different subtypes and clades. Additionally, it provides a weighted evaluation of the mutation frequency in this critical region. The results are delivered to users via email for convenient access, facilitating further study and analysis.

A/H9 influenza virus lineage and clade assignment

This tool identifies the lineage and clade assignment of H9 influenza virus based on its HA segment nucleotide sequences. It generates a comprehensive table detailing the clade and lineage assignment for each sequence, along with a phylogenetic tree constructed by integrating the submitted sequences with reference sequences from the corresponding lineages. The analysis results are delivered via email, providing convenient access for further in-depth study and comparative analysis.

HIV analysis: 5

HIV sequence typing

On the basis of the latest HIV sequence reference database and the China-specific CRF01_AE and CRF07_BC reference sequences, the HIV sequences were genotyped phylogenetic tree and BLAST method. At the same time, this tool gives the sub -cluster information of the sequence for CRF01_AE and CRF07_BC.

HIV transmission network

Based on HIV Trace and Cluster Picker, HIV transmission networking were constructed on this platform . Users only need to upload the fasta sequence file and corresponding meta information to get the molecular network results and several key network evaluation indicators, and the molecular network results support personalized display.

HIV sequence quality control

By integrating and recoding HIV sequence quality control codes, this platform provides an alternative quality control tool for HIV sequencing data. This tool including sequence length, mixed base ratio, frameshift mutation, stop codon mutation, and hypermutation.

HIV drug resistance analysis

This tool uses convolutional neural network model (CNN) to train HIV phenotypic drug resistance data. After supplementing the phenotypic resistance data of non-subtype B, the model fine-tuning method was used to improve the accuracy of drug resistance prediction of non-subtype B sequence data.

HIV transmission prediction and early warning

The molecular network was constructed based on HIV Trace method, and the sequence typing and drug resistance information were supplemented by HIV sequence typing tool and HIV drug resistance analysis tool. On this basis, the transmission risk of HIV transmission clusters and nodes in the network was assessed by the self-defined quantitative screening criteria. At the same time, the risk degree of cross-regional transmission and drug-resistant transmission was given by weight adjustment.

Mycobacterium Tuberculosis genome analysis: 1

MTB Lineage Determination and Drug Resistance Analysis

This method utilizes the official tb-profile 4.3.0 Docker image for MTB lineage determination and drug resistance analysis.

Phage genome analysis: 2

Phage one-stop analysis pipeline

This pipeline is designed for quality control, filtering, assembly, and downstream analysis of the next-generation sequencing paired-end reads of bacteriophages. Users can also directly submit their own genome for analysis.

Prokaryotic Genome Prophage Analysis Pipeline

This pipeline is used to predict prophage sequences from prokaryotic genomes and annotate protein functions, antibiotic resistance genes, virulence genes, and predict family of prophage sequences.

Prokaryotic regulator analysis: 2

Prokaryotic Genome Analysis for Global Regulator

The process of Prokavotic Genome Analysis for Global Requlator is to perform global transcription factor and target gene analysis on the entire genome sequence input by users. And provide information on category, species distribution, and functional annotation information of global regulator factors and their target genes .

Prokaryotic Global Regulator Identification

The process of predicting global transcription factors in prokaryotes involves identifying the global transcription factors for protein sequences input by users. This process is to compare and analyze the global transcription factor database of prokaryotes and provide their information such as the category, function, and species distribution.

gcType genome analysis: 3

Genome Assembly and Annotation Pipeline

The genome assembly and annotation pipeline is suitable for the assembly and annotation of bacterial sequences. It involves assembling and analyzing the raw sequencing data, and then annotating the assembled genome. The raw sequencing data can be assembled using various software tools for both long reads (PacBio or Nanopore) and Illumina short reads. The genomic component analysis includes identifying CRISPR arrays, detecting repetitive structures, predicting non-coding RNAs, prophage prediction, defense system prediction, mobile element detection, and gene prediction using Prodigal. The gene annotation mainly utilizes several databases, including KEGG, GO, COG, NR, Swiss-Prot, AntiSMASH, MetaCyc, PHI, Pfam, CARD and VFDB.

Species identification pipeline

The new species identification pipeline uses the genome sequence of the new type strain as the query to perform a similarity search against the gcType 16S rRNA gene and genome sequence reference database. This allows for the identification of the closest related species, which can then be used to perform a phylogenetic analysis.

Structure-based alignment of protein sequences

This tool first uses Swiss-Prot database and Pfam database to annotate the input type strain protein sequence. Then, based on the locally privately constructed type strain protein database, the input sequences were matched and aligned using TM-vec and DeepBLAST (https://m.biorxiv.org/content/10.1101/2022.07.25.501437v1.full.pdf) software, which considered both the sequence and structural characteristics of the input sequences. Finally, the results of the above two parts are displayed.Please read the manual before you submit your job, if it is your the first time using protein annotation + AI search pipeline.

Fungi genome analysis: 8

Fungal Names:Pairwise Sequence Alignment

This tool identifies fungal species by carring out pairwise sequence alignment based on ITS sequences of UNITE+INSD dataset from UNITE. The user can customize the maximum target sequence number. And phylogenetic tree will be constructed based on those target sequences by using FastTree. The results support personalized visual display.

Fungi type:Genomic Annotation of Fungi

This tool is suitable for sequence assembly and annotation of fungi, mainly by assembling raw reads and annotating the assembled genomes with functional genes.

Fungi type:Fungal Pan-genome Analysis

QM:Platform for Pairwise Alignment

The platform for identification of quarantine fungi species is based on multiple gene sequence benchmark databases. By integrating sequence alignment software, phylogenetic analysis software, and genetic distance analysis modules, as well as designing multiple query function modules, such as sequence input, sequence alignment, construction of phylogenetic trees, species confirmation and other functional modules, individual genes can be automatically analyzed for phylogeny and the function of individual barcode identification screening can be realized.

QM:Platform for Multi-locus Identification

The species identification platform based on with multilocus data is mainly based on multi-gene reference databases. By integrating sequence alignment software, phylogenetic analysis software, and genetic distance analysis software modules, and designing multiple query function modules, and a series of functional modules, a 'one-stop' multi-gene pedigree typing screening system is established to achieve accurate and rapid species identification based on multiple gene sequences.

QM:Platform for Multi-objective Inspection

The high-throughput screening platform establishes a detection method for rapidly screening quarantine species from quarantine samples. By extracting total DNA from quarantine samples, amplifying their ITS2 region, and performing second-generation sequencing, the bioinformatics software modules embedded in this platform are used for primary information analysis to determine whether there is a possibility of multiple quarantine species in the samples to be tested.

Poisonous mushroom:Pairwise sequence alignment

This tool will mark species name of poisonous mushrooms with an asterisk based on the world list of poisonous mushrooms (He et al. 2022, Fungal Biology Reviews). The pairwise sequence alignment is based on ITS sequences of UNITE+INSD dataset from UNITE. The results support personalized display and the hit of poisonous mushroom will be linked to the entry of species detail page.

Poisonous mushroom:High throughput analysis

This tool supports fungal identification based on Internal Transcribed Spacer for fungi from environment. The analysis includes de-duplication, de-singleton (customized frequency), OTU clustering (customized sequence identity) based on reference database of UNITE+INSD from UNITE and species annotation by using BLAST+. Outputs includes results of OTU table, taxonomy table, taxonomy statistics and their visual display.

Others: 3

Human Antibody Convergent Evolution Prediction and Analysis System

The website has established a new precision labeling algorithm for antibody V(D)J segments, using a reference library consistent with mainstream IMGT and IgBlast databases to ensure the international applicability of the results obtained by users.

GRAPE-WEB

The GRAPE webserver allows non-expert users to improve protein thermostability following our recently developed GRAPE strategy. The strategy combines the advantages of hybrid method and greedy algorithm to search beneficial combination pathways in an expanded mutation library by reducing the dimensionality of the data.

IPGA v1.09

IPGA (https://nmdc.cn/ipga/) is a one-stop web service to analyze, compare, and visualize pan-genome as well as individual genomes, which avoid users to install any tools. IPGA features a score system that helps users to evaluate the reliability of pan-genome profiles generated by different packages.

Blast analysis tools: 5

BlastN

Searches a nucleotide query against a nucleotide database

BlastP

Searches a protein query against a protein database

BlastX

Searches a nucleotide query against a protein database

TblastN

Searches a protein query against a nucleotide database

TblastX

Searches a nucleotide query against a protein database

Convenient analysis tools: 7

orthoANI

Tool_OrthoANI

MUMmer

MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form

PILER-CR

PILER-CR is public domain software for finding CRISPR repeats

tRNAscan-SE

tRNAscan-SE identifies transfer RNA genes in genomic DNA or RNA sequences

RNAmmer

The RNAmmer 1.2 server predicts 5s/8s, 16s/18s, and 23s/28s ribosomal RNA in full genome sequences

LEfSe

LEfSe (Linear discriminant analysis Effect Size) determines the features most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding biological consistency and effect relevance

XSTREAM

XSTREAM is a tool for rapidly identifying and modeling the architecture of fundamental Tandem Repeats (TRs) in protein sequences. Due to the general nature of TRs, however, any sequence including DNA (or even numbers!) can be processed

Metagenome analysis pipeline: 3

GenomeAssemblyAnnotation

GenomeAssemblyAnnotationPipe is a pipeline for genome assembly and annotation.

ngsMetaAssembly

simpleMetagenomeAnalysis is a pipeline for Metagenome assemble and Annotation.

simpleMetagenomeAnalysis

simpleMetagenomeAnalysis is a pipeline for Metagenome Annotation.

Assembly tools: 25

SOAPdenovo2

SOAPdenovo is a novel short-read assembly method that can build a denovo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes ina cost effective way

SPAdes

SPAdes（St. Petersburg genome assembler）is an assembly toolkit containing various assembly pipelines.

MetaVelvet

Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge,in the United Kingdom.Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs

ALLPATH-LG

Tool_ALLPATH_LG

Meta-IDBA

Tool_MetaIDBA

MEGAHIT

MegaHit is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252 Gbps in 44.1 and 99.6 h on a single computing node with and without a graphics processing unit, respectively. MegaHit assembles the data as a whole, i.e. no pre-processing like partitioning and normalization was needed. When compared with previous methods on assembling the soil data, MegaHitgenerated a three-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a fourfold improvement

RayMeta

Ray is a parallel de novo genome assembler that utilises the message-passing interface everywhere and is implemented using peer-to-peer communication

CANU

Canu is a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION)

CAP3

The program has a capability to clip 5′ and 3′ low-quality regions of reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also uses forward–reverse constraints to correct assembly errors and link contigs

SSPACE

SSPACE is not a de novo assembler, it is used after a preassembled run. SSPACE is a script to extend and scaffold preassembled contigs using a numbe of mate pairs or paired-end libraries

OPERA

OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program. It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011)

QUAST

QUAST stands for QUality ASsessment Tool. The tool evaluates genome assemblies by computing various metrics

REAPR

REAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison. It can be used in any stage of an assembly pipeline to automatically break incorrect scaffolds and flag other errors in an assembly for manual inspection. It reports mis assemblies and other warnings, and produces a new broken assembly based on the error calls

CheckM

CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes

BUSCO

BUSCO assessments are implemented in open-source software, with a large selection of lineage-specific sets of Benchmarking Universal Single-Copy Orthologs. These conserved orthologs are ideal candidates for large-scale phylogenomics studies, and the annotated BUSCO gene models built during genome assessments provide a comprehensive gene predictor training set for use as part of genome annotation pipelines

cufflinks

Transcriptome assembly and differential expression analysis for RNA-Seq

StringTie

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only the alignments of raw reads used by other transcript assemblers, but also alignments longer sequences that have been assembled from those reads

Cuffdiff

Comparing expression levels of genes and transcripts in RNA-Seq experiments is a hard problem. Cuffdiff is a highly accurate tool for performing these comparisons, and can tell you not only which genes are up- or down-regulated between two or more conditions, but also which genes are differentially spliced or are undergoing other types of isoform-level regulation

Sailfish

Sailfish is a tool for transcript quantification from RNA-seq data.It requires a set of target transcripts (either from a reference or de-novo assembly) to quantify. All you need to run sailfish is a fasta file containing your reference transcripts and a (set of) fasta/fastq file(s) containing your reads. Sailfish runs in two phases; indexing and quantification. The indexing step is independent of the reads, and only needs to be run once for a particular set of reference transcripts and choice of k (the k-mer size).

Kallisto

kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads

DESeq2

Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution

Ballgown

Tools for statistical analysis of assembled transcriptomes, including flexible differential expression analysis, visualization of transcript structures, and matching of assembled transcripts to annotation

Trinity

Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software children: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads

Oases

Oases is a de novo transcriptome assembler designed to produce transcripts from short read sequencing technologies, such as Illumina, SOLiD, or 454 in the absence of any genomic assembly

SOAPdenovo-Trans

SOAPdenovo-Trans is a de novo transcriptome assembler basing on the SOAPdenovo framework, adapt to alternative splicing and different expression level among transcripts.The assembler provides a more accurate, complete and faster way to construct the full-length transcript sets.

Genome structural analysis tools: 11

PILER-CR

PILER-CR is public domain software for finding CRISPR repeats

minced

MinCED is a program to find Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) in full genomes or environmental datasets such as assembled contigs from metagenomes

tRNAscan-SE

tRNAscan-SE identifies transfer RNA genes in genomic DNA or RN sequences

RNAmmer

The RNAmmer 1.2 server predicts 5s/8s, 16s/18s, and 23s/28s ribosomal RNA in full genome sequences

Prodigal

Prodigal is a protein-coding gene prediction software tool for bacterial and archaeal genomes. The acronym stands for PROkaryotic DYnamic Programming Genefinding Algorithm

Glimmer

Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA

GeneMark

A family of gene prediction programs,such as Gene Prediction in Bacteria, Archaea, Metagenomes and Metatranscriptomes、Gene Prediction in Eukaryotes、Gene Prediction in Transcripts etc

FragGeneScan

FragGeneScan is an application for finding (fragmented) genes in short reads. It can also be applied to predict prokaryotic genes in incomplete assemblies or complete genomes

XSTREAM

PRISM

PRISM is a software for split read (reads which span across a structrual variant -- SV ) mapping and SV calling from the mapping result. PRISM is able to detect small insertions and abitrary size deletions, inversions and tandom duplications with the direction of discordant read pairs

LUMPY

A probabilistic framework for structural variant discovery

Genome annotation tools: 4

Prokka

Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files

DFAST

DFAST is a flexible and customizable pipeline for prokaryotic genome annotation as well as data submission to the INSDC. It is originally developed as the background engine for the DFAST web service and is also available as a stand-alone command-line tool

InterProScan

InterPro is a database which integrates together predictive information about proteins' function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains

PfamScan

PILER-CR is public domain software for finding CRISPR repeats

Community profiling tools: 20

QIIME

Obtaining and importing data/Demultiplexing sequences/Sequence quality control and feature table construction/Generate a tree for phylogenetic diversity analyses

LEfSe

PICRUSt

PICRUSt (pronounced “pie crust”) is a bioinformatics software package designed to predict metagenome functional content from marker gene (e.g., 16S rRNA) surveys and full genomes

MetaCV

MetaCV is a composition and phylogeny based algorithm to classify very short metagenomic reads (75-100 bp) into specific taxonomic and functional groups. MetaCV performs (for both sensitivity and specificity) as good as BlastX-based methods on simulated short reads, but runs 300 times faster thus provides effectively and efficiently analysis on huge amount of NGS data

k-SLAM

k-SLAM is a program for alignment based metagenomic analysis of large sets of high-throughput sequence data. k-SLAM uses a k-mer based technique to rapidly find alignments between reads and genomes which are then validated using the Smith-Waterman algorithm. Alignments are chained together into a pesudo-assembly to increase specificity

Kaiju

Kaiju is a program for sensitive taxonomic classification of high-throughput sequencing reads from metagenomic whole genome sequencing or metatranscriptomics experiments

Centrifuge

Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (5.8 GB for all complete bacterial and viral genomes plus the human genome) and classifies sequences at a very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers

DUDes

a top-down taxonomic profiler for metagenomics

mOTU

Phylogenetic marker genes are suitable to reconstruct the evolutionary history of organisms and to profile the taxonomic composition of environmental samples. For this purpose, a set of 40 protein-coding phylogenetic marker genes (MGs) have been identified . In the vast majority of known organisms, these 40 MGs occur in single copy and they have recently been used to delineate prokaryotic organisms at the species level. Due to these properties, they can be used to detect and accuratelyquantify not only known species, but also those that still lack genomic information. Based on a subset of these MGs that are suitable for shotgun sequencing data, we developed a method for taxonomic composition profiling of environmental samples

StrainEst

StrainEst is a novel, reference-based method that uses the Single Nucleotide Variants (SNV) profiles of the available genomes of selected species to determine the number and identity of coexisting strains and their relative abundances in mixed metagenomic samples

Mash

Estimate the distance of each query sequence to the reference.

sourmash

sourmash is a command-line tool and Python library for computing MinHash sketches from DNA sequences, comparing them to each other, and plotting the results. This allows you to estimate sequence similarity between even very large data sets quickly and accurately

MetaPhlAn2

MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling

HUMAnN2

HUMAnN2 (the HMP Unified Metabolic Analysis Network) is a method for efficiently and accurately determining the presence, absence, and abundance of metabolic pathways in a microbial community from metagenomic or metatranscriptomic sequencing data. It is appropriate for any type of microbial community shotgun sequence profiling

CONCOCT

CONCOCT “bins” metagenomic contigs. Metagenomic binning is the process of clustering sequences into clusters corresponding to operational taxonomic units of some level

MaxBin2

MaxBin is software for binning assembled metagenomic sequences based on an Expectation-Maximization algorithm. Users can understand the underlying bins (genomes) of the microbes in their metagenomes by simply providing assembled metagenomic sequences and the reads coverage information or sequencing reads

MetaBAT2

These tools are easy to use and identify genome bins automatically

AbundanceBin

Abundance Bin is anabundance-based tool for binning metagenomic sequences,such that the reads classified in a bin belong to species of identical or very similar abundances. Abundance Bin also gives estimations of species abundances and their genomesize, these two important characteristic parameters for amicrobial community

VirFinder

R package for identifying viral sequences from metagenomic data using sequence signatures

VirHostMatcher

matching hosts of viruses based on oligonucleotide frequency (ONF) comparison

Comparative analysis tools: 13

orthoANI

calculate the ANI between two 16s RNA gene sequences

CD-hit

CD-HIT stands for Cluster Database at High Identity with Tolerance. The program (cd-hit) takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the sequence 'groupies' for each nr sequence representative

MUMmer

MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form

BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate.

Bowtie2

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes

samtools

The index command creates a new index file that allows fast look-up of data in a (sorted) SAM or BAM

BLAST

Searches a nucleotide query against a nucleotide database（version：latest）

BLAT

BLAST-Like Alignment Tool, BLAT is a legacy tool for sequence alignment that is not under active development.

diamond

DIAMOND is a new high-throughput program for aligning DNA reads or protein sequences against a protein reference database

STAR

STAR (Spliced Transcripts Alignment to a Reference) is an ultrafast universal RNA-seq aligner. It not only can perform unbiased de novo detection of canonical junctions, but also can discover non-canonical splices and chimeric (fusion) transcripts, and is capable of mapping full-length RNA sequences.

Tophat2

Tool_TopHat

hisat2

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome)

blasr

The method BLASR (Basic Local Alignment with Successive Refinement) was used to map Single Molecule Sequencing (SMS) reads that are thousands of bases long, with divergence between the read and genome dominated by insertion and deletion error. The method is benchmarked using both simulated reads and reads from a bacterial sequencing project

Contact US Address：NO.1 Beichen West Road, Chaoyang District, Beijing Institute of Microbiology, Chinese Academy of Sciences Contact：010-64806052 Email：nmdc@im.ac.cn QQ：3415782117