T13 – Modern and scalable tools for efficient analysis of very large metagenomic datasets


Date: Saturday September 8, 2018

Time: 9:00 – 17:00

Venue: Stavros Niarchos Foundation Cultural Center

Room: A_Computer Room & Maker Space



This tutorial is aimed at bioinformatics practitioners with experience in command line usage and scripting, who are interested to learn about powerful tools for the efficient analysis of even very large metagenomic datasets. More than 50% of this workshop will involve hands-on exercises.

The amount of data generated by metagenomics is growing rapidly, making the data analysis the main bottleneck to get to novel biological insights. The goal of this tutorial is to introduce modern bioinformatic tools and pipeline construction methods that will enable you to efficiently cope with the enormous amount of metagenomic data through modular and reproducible, workflow-based analysis.

We will first give a summary of metagenomic tools for assembly, binning and taxonomic profiling in a comprehensive way by reviewing the  results from the CAMI challenge [1]. This should give you a taste of which tools fit best in your own projects. We will then introduce the Common Workflow Language (CWL [2]), which allows you to build reproducible and flexible metagenomic workflows.

In the afternoon session, we will train you in efficient metagenomic data analysis on the protein level using the MMseqs2 software suite. Exercises will cover different topics including efficient protein-level assembly, ultra-fast ORF clustering, sensitive homology search as well as building goal-specific custom pipelines. You will learn by hands-on exercises how to build your own efficient workflows in MMseqs2 by combining its various modules.

[1]: Sczyrba et al. (2017). Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software. Nature methods, 14(11), 1063, https://www.nature.com/articles/nmeth.4458.

[2]: Amstutz et al. (2016). Common Workflow Language, v1.0. Specification, Common Workflow Language working group. https://w3id.org/cwl/v1.0/ doi:10.6084/m9.figshare.3115156.v2

[3]: Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35, 1026–1028, https://www.nature.com/articles/nbt.3988.


Provisional Schedule

Time Topic Instructors
09:00 – 09:15 Intro to shotgun metagenomics & applications Sczyrba
09:15 – 09:45 Critical Assessment of Metagenome Interpretation: Results of the 1st CAMI Challenge Sczyrba
09:45 – 10:30 CWL Introduction, Metagenomics Pipeline Henke
10:30 – 11:00 Coffee break
11:00 – 12:30 Hands-on: Building your own CWL pipeline Henke &


12:30 – 13:30 Lunch
13:30 – 14:00 MMseqs2 principles & algorithms Söding
14:00 – 15:00 Hands-on: standard workflows in MMseqs (assembly, clustering, annotation) Mirdita, Galiez, Söding
15:00 – 15:30 Coffee break
15:30 – 17:00 Hands-on: expert tools, how to build custom workflows in MMseqs2 (e.g. abundance analysis) Mirdita, Galiez, Söding
17:00 End of workshop

Intended audience and possible prerequisites

Bioinformaticians experienced in command line usage and basic scripting.

Material or infrastructure required

You will need to bring your own notebook with an SSH client installed (Linux or MacOS systems already have one, on Windows you can install e.g. PuTTY).


For the morning session, you will use the de.NBI Cloud infrastructure.

For the afternoon session, you will either run software on your own notebook (preferred). In this case you need a notebook with VirtualBox installed and at least 8GB of RAM and  12GB of free disk space. Or you will run on the de.NBI cloud.


Contact information for the organizer

Alexander Sczyrba (asczyrba@cebitec.uni-bielefeld.de, Bielefeld University, Germany)
Christian Henke (chenke@cebitec.uni-bielefeld.de, Bielefeld University, Germany)
Clovis Galiez (clovis.galiez@mpibpc.mpg.de, MPI for Biophysical Chemistry, Germany)
Milot Mirdita (milot.mirdita@mpibpc.mpg.de, MPI for Biophysical Chemistry, Germany)
Johannes Soeding (soeding@mpibpc.mpg.de, MPI for Biophysical Chemistry,, Germany)