Please use this identifier to cite or link to this item:
Publication type: Article in scientific journal
Type of review: Peer review (publication)
Title: Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
Authors: Tørresen, Ole K
Star, Bastiaan
Mier, Pablo
Andrade-Navarro, Miguel A
Bateman, Alex
Jarnot, Patryk
Gruca, Aleksandra
Grynberg, Marcin
Kajava, Andrey V
Promponas, Vasilis J
Anisimova, Maria
Jakobsen, Kjetill S
Linke, Dirk
et. al: No
DOI: 10.1093/nar/gkz841
Published in: Nucleic Acids Research
Volume(Issue): 47
Issue: 21
Page(s): 10994
Pages to: 11006
Issue Date: 4-Oct-2019
Publisher / Ed. Institution: Oxford University Press
ISSN: 0305-1048
Language: English
Subjects: Genomics; Bioinformatics
Subject (DDC): 572: Biochemistry
Abstract: The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where misannotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
Fulltext version: Published version
License (according to publishing contract): CC BY 4.0: Attribution 4.0 International
Departement: Life Sciences and Facility Management
Organisational Unit: Institute of Computational Life Sciences (ICLS)
Published as part of the ZHAW project: Discovering evolutionary innovations by assessing variation and natural selection in protein tandem repeats
Appears in collections:Publikationen Life Sciences und Facility Management

Files in This Item:
File Description SizeFormat 
2019Toerresen_tandem-repeats-lead-to-sequence-assembly-errors_NucleidAcidsResearch.pdf916.98 kBAdobe PDFThumbnail

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.