T10 – Hands-on on Protein Function Prediction with Machine Learning and Interactive Analytics


Date: Saturday September 8, 2018

Time: 9:00 – 17:00

Venue: TBA



This tutorial approaches the issue of functional annotation of proteins by means of Machine Learning. Hands-on will be carried on within the Interactive Analytics framework Zeppelin and using different components of Apache Spark.



Understanding protein functions is crucial to unlocking the value of genomic data for biomedical research and innovation. Delivering personalized health and precision medicine requires a detailed understanding of the consequences of sequence variants in proteins and their impact on phenotype. The widening gap between known proteins and their functions has encouraged the development of methods to automatically infer annotations. Automatic functional annotation of proteins is expected to meet the conflicting requirements of maximizing annotation coverage while minimizing erroneous functional assignments. This trade-off imposes a great challenge in designing intelligent automatic annotation systems.

This topic has gained growing interest among the computational community. For example, the Critical Assessment of Function Annotation (CAFA) is a challenge whose aim is the assessment of computational methods in the protein function area. Several models from different research groups are submitted to participate in this challenge with the aim of providing larger but more accurate predictions of protein functions.

Artificial intelligence and machine learning hold a large repertoire of algorithms and methodologies to discover and infer prediction models. Coupled with the new big data technologies for interactive analytics and data transformation, the AI/ML methods represent valuable assets that could enhance the discovery of protein functions.

In addition to understanding the importance of protein data and the motivation behind functional annotation, attendees will learn using two emerging technologies in the field of Big Data, namely Apache Spark and Apache Zeppelin. Spark is a fast and general engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Spark can be used within the Interactive Analytics framework Zeppelin and coupled with other backend languages and tools to provide deeper insights. This tutorial will help you understanding how to use Spark and Interactive Analytics to make sense of protein data and build Machine Learning models to infer their functions.

Proposed length

1 day (four 1:30h slots)


  1. Protein data (30min)
    1. Labelled and unlabelled data
    2. Data sources
  2. Functional annotation of proteins (30min)
    1. Motivation
    2. Challenges
  3. Data transformation with Spark (30min)
  4. Interactive analytics with Zeppelin (30min)
  5. Hands-on: Machine learning for protein annotation (4:00h)
    1. Environment setup (Zeppelin + Spark)
    2. Preprocessing and data transformation
    3. Creation and application of prediction models with Spark/MLlib


Target audience

Typical attendees for this tutorial are researchers and practitioners from industry and academia, developers, graduate and senior undergraduate students, and university faculty, who are looking for leveraging Machine Learning and Big Data technologies to build Analytics Workflows on Bioinformatics data.


Motivation for attendees

This tutorial has two major advantages. The first one is to bring the attention of AI and Bioinformatics scholars and practitioners to one the important questions raised in Bioinformatics where data are increasingly growing. The second one is to practice Machine Learning within a Big Data mindset i.e., data transformation and interactive analytics. The tutorial will mainly focus on the practise as ⅔ of it will be dedicated to hands-on. Attendees will learn how to analyse, transform and visualise data along with discovering insights and building models to predict protein functions.



Presenter 1: Rabie Saidi


Rabie Saidi, PhD Lead Data Scientist

Protein Function Development Team (UniProt)

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK

Email: rsaidi [at] ebi.ac.uk, Tel: +44 (0) 1223 49 4106


ORCID Publication List



Dr. Rabie Saidi obtained his PhD in Computer Science from Blaise Pascal University. He is currently Lead Data Scientist in the UniProt team at EMBL-EBI (Cambridge, UK), where he conducts research and development in the intersection of data mining, big data, machine learning and bioinformatics. He is responsible for adapting and creating new technologies for descriptive and predictive purposes in Bioinformatics. His main activities are centred on smart data-driven solutions for the annotation of proteins in UniProtKB, the largest knowledgebase of protein data.


Scientific animation and responsibilities

  • Member of the Organisation Committee of “Conference d’Apprentissage artificiel” CAP 2010, held in Clermont-Ferrand (a machine learning conference). http://cap10.isima.fr/
  • Member of the Organisation Committee of “Advances in Bioinformatics and Artificial Intelligence: Bridging the Gap” BAI 2015, held in Buenos Aires within IJCAI 2015 (a Bioinformatics and                                        artificial          intelligence           workshop). http://bioinfo.uqam.ca/IJCAI_BAI2015/
  • Reviewer in several scientific journals and
  • Teaching experience in Blaise Pascal University (2010 – 2012): Data Mining, Machine Learning, Data Analysis
  • In my current position at the EBI, I host and supervise MSc and PhD students and supervise them in learning and developing Machine Learning software to analyse Big Bioinformatics


Publications Selection

  • Saidi R, Boudellioua I, Martin MJ, Solovyev V. Rule Mining Techniques to Predict Prokaryotic Metabolic Pathways. Methods Mol Biol. 2017;1613 311-331. doi:10.1007/978-1-4939-7027-8_12. PMID:

  • Ison J et al. Tools and data services registry: a community effort to document bioinformatics resources. Nucleic Acids Res. 2016 Jan;44(D1) D38-47. doi:10.1093/nar/gkv1116. PMID: 26538599; PMCID:
  • Slim Bouker, Rabie Saidi, Sadok Ben Yahia, and Engelbert Mephu Nguifo. Mining Undominated Association Rules Through Interestingness Measures. International Journal on Artificial Intelligence Tools 2014 23:04
  • Dhifli Wajdi, Saidi Rabie, and Nguifo Engelbert Mephu. Smoothing 3D Protein Structure Motifs Through Graph Mining and Amino Acid Similarities. Journal of Computational Biology. January 2014, 21(2): 162-172. https://doi.org/10.1089/cmb.2013.0092.


Presenter 2: Tunca Dogan


Tunca Dogan, Ph.D Adjunct faculty member

Department of Health Informatics, Graduate School of Informatics, METU, 06800 Ankara, Turkey Email: tdogan@metu.edu.tr


Research fellow

Protein Function Development Team (UniProt)

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK

Email: tdogan@ebi.ac.uk


ORCID Publication List



Dr Dogan has completed his BSc and MSc studies in METU, Faculty of Engineering. He was introduced to computational biology and bioinformatics during his PhD study back in 2010. He received his joint PhD degree from the interdisciplinary Bioengineering program, hosted by Electrical-Electronics Engineering Department in Izmir Institute of Technology and Graduate School of Health Sciences in Dokuz Eylul University, Turkey in 2013. Dr Dogan served as a post-doctoral researcher at the European Bioinformatics Institute (EMBL-EBI) in UK between 2013 and 2016. Dr Tunca Dogan currently is an adjunct faculty member and a senior researcher at the Department of Health Informatics, METU. He also holds a research fellow position in the Protein Function Development team at the European Bioinformatics Institute (EMBL-EBI). His research focus can be summarized as: developing novel computational methods for biomolecular sequence analysis, protein function prediction and computational drug discovery; using statistical learning, data mining and machine learning techniques and graph theory approaches.

Selected Publications in Peer-Reviewed Journals

  • Rifaioglu, A.S., Doğan, T., et al. (2017). Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants. Proteins: Structure, Function, and Bioinformatics, doi:10.1002/prot.25416.
  • Rifaioglu, A.S., Doğan, T., et al. (2017). Multi-task Deep Neural Networks in Automated Protein Function Prediction. arXiv preprint arXiv:1705.04802.
  • UniProt Consortium (2017). UniProt: the universal protein knowledgebase. Nucleic acids research, 45(D1), D158-D169.
  • Dogan, T., et al. (2016). UniProt-DAAC: Domain Architecture Alignment and Classification, a New Method for Automatic Functional Annotation in UniProtKB. Bioinformatics, 32(15): 2264-2271.
  • Dogan, T. and Karaçalı, B. (2013). Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences. PLoS ONE 8(9): e75458. doi:10.1371/journal.pone.0075458.


Selected Oral Presentations in Peer-reviewed Scientific Conferences

  • Dogan, T., et al. (2017). Computational Prediction of Novel Drug Candidate Compound – Target Protein Interactions and Their Verification on PI3K/AKT/mTOR Signalling Pathway. GLBIO 2017: Great Lakes Bioinformatics Conference, May 15-17, 2017, Chicago,
  • Rifaioğlu, A., Dogan, T., et al. (2015). UniGOPred and ECPred: Automated Function Prediction Tools Based on A Combination of Different Classifiers. AFP-CAFA SIG, ISMB/ECCB 2015: 23th Annual International Conference on Intelligent Systems for Molecular Biology, July 10-14, 2015, Dublin, Republic of
  • Dogan, T. & Karaçalı, B. (2013). 2-D Thresholding of the Connectivity Map Following the Multiple Sequence Alignments of Diverse Datasets. The 10th IASTED International Conference on Biomedical Engineering, February 13-15, 2013, Innsbruck, Austria. DOI:10.2316/P.2013.791-092.


Professional Service

  • Conference organization:
    • Co-chair of “HIBIT-2017: 10th International Symposium on Health Informatics and Bioinformatics”, June 28 – 30, 2017, METU NCC, Northern Cyprus (http://hibit2017.ii.metu.edu.tr).
  • Reviewer for “Bioinformatics”, “BMC Bioinformatics”, “PROTEINS: Structure, Function, and Bioinformatics” and “Molecular Biosystems”

Teaching Experience

  • Course instruction experience:
    • BIN503: Biological Databases & Data Analysis Tools – METU, Bioinformatics MSc (Semesters: Fall 2016, Spring 2017, Fall 2017)
    • BIN590: Graduate Seminar in Bioinformatics – METU, Bioinformatics MSc (Semesters: Spring 2017)
  • Student supervision experience:
  • 3 MSc students in Bioinformatics Program, METU (1 graduated, 2 ongoing)
  • 1 PhD student in Medical Informatics Program, METU (ongoing)