T10 – Hands-on on Protein Function Prediction with Machine Learning and Interactive Analytics

Short Description

This tutorial approaches the issue of functional annotation of proteins by means of Machine Learning. Hands-on will be carried on within the Interactive Analytics framework Zeppelin and using different components of Apache Spark.

Understanding protein functions is crucial to unlocking the value of genomic data for biomedical research and innovation. Delivering personalized health and precision medicine requires a detailed understanding of the consequences of sequence variants in proteins and their impact on phenotype. The widening gap between known proteins and their functions has encouraged the development of methods to automatically infer annotations. Automatic functional annotation of proteins is expected to meet the conflicting requirements of maximizing annotation coverage while minimizing erroneous functional assignments. This trade-off imposes a great challenge in designing intelligent automatic annotation systems.

This topic has gained growing interest among the computational community. For example, the Critical Assessment of Function Annotation (CAFA) is a challenge whose aim is the assessment of computational methods in the protein function area. Several models from different research groups are submitted to participate in this challenge with the aim of providing larger but more accurate predictions of protein functions.

Artificial intelligence and machine learning hold a large repertoire of algorithms and methodologies to discover and infer prediction models. Coupled with the new big data technologies for interactive analytics and data transformation, the AI/ML methods represent valuable assets that could enhance the discovery of protein functions.

In addition to understanding the importance of protein data and the motivation behind functional annotation, attendees will learn using two emerging technologies in the field of Big Data, namely Apache Spark and Apache Zeppelin. Spark is a fast and general engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Spark can be used within the Interactive Analytics framework Zeppelin and coupled with other backend languages and tools to provide deeper insights. This tutorial will help you understanding how to use Spark and Interactive Analytics to make sense of protein data and build Machine Learning models to
infer their functions.