Thumbnails:
List:
Year:
Category:
Session:
Poster:
Getting poster data...
Duccio Malinverni, Besian Sejdiu, Madan M. Babu (St. Jude Children's Research Hospital, Memphis TN, USA)
The advent of high-throughput new generation sequencing technology resulted in an astonishing growth of genomic sequence databases. It is nowadays common to deal with protein families containing sets of 10K-100K homologous sequences. Such massive sets, combined with the advent of novel deep-learning methods, have led to spectacular successes in fields ranging from structural bioinformatics to population genetics. However, performing data-exploration using such large-scale datasets creates new challenges for traditional visualisation and inspection methods, such as phylogenetic trees or sequence alignment viewers. To circumvent their limitations, we present a web-based tool to interactively visualise, annotate, inspect and pre-process massive protein datasets. The proposed app relies on modern sequence embedding techniques. It allows analysing datasets comprising more than 100K sequences, visually revealing sub-structure in the data and automatically incorporates annotations from several public databases. We present a use case of a protein family comprised of 86’000 sequences and show how combining sequence embeddings, interactive visualisation, and automatic annotation can help exp