SANSparallel: interactive hom*ology search against Uniprot (2024)

Article Navigation

Volume 43 Issue W1 1 July 2015

Article Contents

  • Abstract

  • INTRODUCTION

  • MATERIALS AND METHODS

  • RESULTS

  • DISCUSSION

  • FUNDING

  • REFERENCES

  • < Previous
  • Next >

Journal Article

,

Panu Somervuo

1Institute of Biotechnology, University of Helsinki, PO Box 65, Finland

2Department of Biosciences, University of Helsinki, PO Box 65, Finland

Search for other works by this author on:

Oxford Academic

Liisa Holm *

1Institute of Biotechnology, University of Helsinki, PO Box 65, Finland

2Department of Biosciences, University of Helsinki, PO Box 65, Finland

*To whom correspondence should be addressed. Tel: +358 294 191 59115; Fax: +358 294 59366; Email: liisa.holm@helsinki.fi

Search for other works by this author on:

Oxford Academic

Nucleic Acids Research, Volume 43, Issue W1, 1 July 2015, Pages W24–W29, https://doi.org/10.1093/nar/gkv317

Published:

08 April 2015

Article history

Received:

05 February 2015

Revision received:

18 March 2015

Accepted:

28 March 2015

Published:

08 April 2015

  • PDF
  • Split View
  • Views
    • Article contents
    • Figures & tables
    • Video
    • Audio
    • Supplementary Data
  • Cite

    Cite

    Panu Somervuo, Liisa Holm, SANSparallel: interactive hom*ology search against Uniprot, Nucleic Acids Research, Volume 43, Issue W1, 1 July 2015, Pages W24–W29, https://doi.org/10.1093/nar/gkv317

    Close

Search

Close

Search

Advanced Search

Search Menu

Abstract

Proteins evolve by mutations and natural selection. The network of sequence similarities is a rich source for mining hom*ologous relationships that inform on protein structure and function. There are many servers available to browse the network of hom*ology relationships but one has to wait up to a minute for results. The SANSparallel webserver provides protein sequence database searches with immediate response and professional alignment visualization by third-party software. The output is a list, pairwise alignment or stacked alignment of sequence-similar proteins from Uniprot, UniRef90/50, Swissprot or Protein Data Bank. The stacked alignments are viewed in Jalview or as sequence logos. The database search uses the suffix array neighborhood search (SANS) method, which has been re-implemented as a client-server, improved and parallelized. The method is extremely fast and as sensitive as BLAST above 50% sequence identity. Benchmarks show that the method is highly competitive compared to previously published fast database search programs: UBLAST, DIAMOND, LAST, LAMBDA, RAPSEARCH2 and BLAT. The web server can be accessed interactively or programmatically at http://ekhidna2.biocenter.helsinki.fi/cgi-bin/sans/sans.cgi. It can be used to make protein functional annotation pipelines more efficient, and it is useful in interactive exploration of the detailed evidence supporting the annotation of particular proteins of interest.

INTRODUCTION

Recent years have witnessed a remarkable growth in the number of sequences. This has made database searches (1–4) take longer and longer and forced free computing services and pre-computed databases to close down or resort to crowd-sourcing (5–7). SANSparallel is a web server that takes protein sequences as input and returns an approximate set of closest sequence neighbors in the blink of an eye. At the core of our web server is a fast database search engine that only takes a fraction of a second to compare a query protein against 90 million sequences in Uniprot (8). SANSparallel is a re-implemented, improved and parallelized version of our previous suffix array neighborhood search (SANS) algorithm (9). It belongs to a new generation of fast database search programs indexing the database so that short words (seeds) matching to the query can be found efficiently and independent of database size (10–15). Similar sequences can then be identified by seed extension or by counting how many seeds match one database protein. Suffix arrays bring the advantage that seed length can be adapted to increase selectivity. On the other hand, spaced seeds and reduced alphabets have been introduced to increase sensitivity (16). Programs implementing these techniques are orders of magnitude faster than BLAST. However, it is hard to match BLAST's sensitivity. These approaches are very suitable for mapping problems, where the match is very close and gives a clear signal. We have found previously that the approach works reliably in protein database searches above 50% sequence identity (9). Here, we present more benchmarking and show that SANSparallel is highly competitive in comparison with recently published programs.

MATERIALS AND METHODS

System architecture

SANSparallel runs as a client and a server. The server holds the database in memory and performs the search. We have a separate server for each database. Client processes connect to the server and transmit the query sequence to the server and the result to the user. Multiple clients can connect to the server. Concurrent clients are served one query at a time in round-robin fashion. From the users’ perspective this means that the time it takes to process a query increases linearly with server load, but all users experience similar speed. Linearity of response times was maintained up to at least 100 concurrent clients (data not shown).

Underlying the web server is a CGI script which calls the client program with appropriate options and post-processes the database search results into the desired output format (Figure 1). Some processing steps use third-party software. The primary result from SANSparallel is a set of sequence-similar proteins retrieved from the database. Pairwise alignments between this set of sequences and the query sequence are generated using FASTA (17). The same program is used to output a BLAST-like report. The pairwise alignments are stacked against the query sequence, omitting insertions to generate gapped alignments. The stacked alignment can be colorized by Mview (18) or sent to Skylign (19) to generate a sequence logo. Aligned or unaligned sequences can be output in FASTA format and sent to Jalview (20) for alignment visualization and editing. Our server does not provide multiple sequence alignments as this can be very time consuming. Instead, multiple sequence alignments can be requested from Jalview Desktop's web service menu. The response of the server is immediate and no user data or results are stored on disk except for results viewed with the Jalview applet, which requires file input.

SANSparallel: interactive hom*ology search against Uniprot (3)

Figure 1.

Flowchart of the SANSparallel web server. Computations done by the web server are blue. Results sent to the user include textual outputs (green) and alignment visualizations (orange). Multiple alignment (instantiated from Jalview Desktop) and sequence logo computations utilize third party resources in the cloud (pink).

Open in new tabDownload slide

SANSparallel was developed in a Linux operating system and parallelized using openmpi. The web server runs on a cluster of computers with 500-Gb memory and 64 cores. SANSparallel was written in Fortran using legacy code from SANS (9), socket communications in C and the CGI script in Perl. Storage of the database in memory and additional work space take about 9 bytes per amino acid.

Database search algorithm

SANSparallel is a re-implemented, improved and parallelized version of the suffix array neighborhood search algorithm SANS (9). Briefly, the algorithm accumulates a vote for database proteins that are found within a window of the position where a suffix of the query sequence would be inserted in the suffix array of the database. Database proteins with the highest votes are collected and, optionally, aligned and resorted by the alignment score. The following changes were introduced: (i) a binary search to find the suffix array insertion position replaces the original mergesort. This enables searching single query sequences instead of the original batch processing. (ii) Votes are summed over diagonal bands rather than the whole protein. This improves selectivity. A similar strategy is used in the FASTA algorithm (17). (iii) Alignments are computed by dynamic programming in a diagonal band. This replaces the original program's greedy algorithm to combine high-scoring segment pairs. e-values are computed from the alignment score using Karlin–Altschul statistics (21). (iv) There is a positive but not perfect correlation between the vote and pairwise alignment score. An option was added to moving down the sorted list of database proteins until the Hth-best alignment score remains stable. This results in more closely similar hits in the output. (v) The program was parallelized using MPI (Message Passing Interface). We chose a micro parallelization strategy in order to achieve fast response times for a single query. One node is reserved for communication with the client. The other nodes are dedicated to the database search. Each node works on a section of the database. The database search nodes go into hibernation when traffic is low. Search speed increased linearly up to 8–16 nodes; above 32 nodes there was not enough work to match communication overheads (data not shown).

Databases

The Uniprot, UniRef90, UniRef50 and Swissprot databases are downloaded monthly from ftp.ebi.ac.uk. The sequences of Protein Data Bank entries are taken weekly from the Dali server (22).

Benchmark data sets

The server was benchmarked using the same test set and database as in (9). The test set consists of 4174 predicted proteins of Dickeya solani, an emerging plant pathogen (23). The reference database is Uniprot frozen in 2012, which did not yet contain D. solani. The reference set of TRUE hits was generated using SSEARCH (17) and an e-value cutoff of 1.0. Others have observed before us that implementations differ between programs and e-values are not directly comparable between programs (12). Therefore programs being evaluated were asked to output 1000 best hits. Hits found in the reference set were counted as true positives. Most programs compute an e-value for the hits, which operationally eliminates false positives. The hits were also subdivided into bins according to the sequence identity of the pair in the reference set. The wall-clock time to process the test set was also recorded to compare speeds.

BLAST, UBLAST, LAMBDA, RAPSEARCH2 and SANSparallel are natively parallel. LAST was run with GNUparallel using blocksize 36 000. We used pre-compiled LAMBDA v0.4.7 which could not output more than 500 hits per query; this bug was fixed but a new version was not available in time for our benchmarks (Hannes Hausdewell, personal communication). All software used were 64-bit versions except UBLAST of which only a 32-bit version is freely available. Due to 32 bits, reference database needed to be split into several chunks in order to index it. Also BLAT required the reference data to be split into several segments in order to work. The e-value threshold was set to 1.0 in all software where this option was available. In LAST, the score threshold was calculated to correspond to e-value 1.0 and was set accordingly. LAST parameter –m 500 was used in order to get more hits. Otherwise default parameters were used.

RESULTS

Benchmarking

We tested SANSparallel against BLAST (1), UBLAST (14), LAMBDA (12), LAST (13), DIAMOND(15), BLAT (10) and RAPSEARCH2 (11) using the same benchmark as in (9). Four modes of SANSparallel (verifast, fast, slow and verislow) were used which differ in the depth and speed of the search. LAMBDA outputs maximally 500 hits, therefore comparisons are shown for 1000 hits and 500 hits. The performance of all methods is quite similar above 50% sequence identity, differences are mainly seen in the detection of remote hom*ologs below 50% sequence identity (Figure 2). The sensitivity of UBLAST is closest to BLAST. RAPSEARCH2 and BLAT are both slower and less sensitive than at least one competing method. Some aligners have tunable parameters whereby one can arbitrarily trade speed for sensitivity. Also SANSparallel gets faster when fewer hits are output (Table 1). Considering both speed and sensitivity, a group of four methods emerges with small differences between them: SANSparallel fast mode, DIAMOND, LAMBDA and LAST. Fast is the default mode in the SANSparallel web server.

SANSparallel: interactive hom*ology search against Uniprot (4)

Figure 2.

Benchmark results showing the number of true positives detected in the top-1000 hits and top-500 hits binned by sequence identity.

Open in new tabDownload slide

Speed comparison of database search programs: time taken to search 4174 queries of the Dickeya solani benchmark

Table 1.

Open in new tab

Speed comparison of database search programs: time taken to search 4174 queries of the Dickeya solani benchmark

ProgramHitsCoresTime (s)Relative speed
verifast10016625903
fast10016655631
verifast500161113298
verifast1000161702153
fast500161782056
LAMBDA500162161695
slow100162351558
fast1000163241130
LAST100016 a3271119
slow50016406902
DIAMOND100016446821
slow100016612598
verislow50016624587
verislow100016792462
verifast100011009363
UBLAST b100016 a1310279
RAPSEARCH21000161469249
LAMBDA50012052178
LAST100012957124
fast100013297111
SANSc10001380996
BLAT b10001430785
slow10001501573
verislow10001709452
RAPSEARCH2100011876120
UBLAST b100012839913
BLAST100016a3214911
BLAST100013660461
ProgramHitsCoresTime (s)Relative speed
verifast10016625903
fast10016655631
verifast500161113298
verifast1000161702153
fast500161782056
LAMBDA500162161695
slow100162351558
fast1000163241130
LAST100016 a3271119
slow50016406902
DIAMOND100016446821
slow100016612598
verislow50016624587
verislow100016792462
verifast100011009363
UBLAST b100016 a1310279
RAPSEARCH21000161469249
LAMBDA50012052178
LAST100012957124
fast100013297111
SANSc10001380996
BLAT b10001430785
slow10001501573
verislow10001709452
RAPSEARCH2100011876120
UBLAST b100012839913
BLAST100016a3214911
BLAST100013660461

aGNUparallel.

bDatabase split to chunks (UBLAST: 19, BLAT: 5) due to program's size limit.

cSerial implementation (9).

Table 1.

Open in new tab

Speed comparison of database search programs: time taken to search 4174 queries of the Dickeya solani benchmark

ProgramHitsCoresTime (s)Relative speed
verifast10016625903
fast10016655631
verifast500161113298
verifast1000161702153
fast500161782056
LAMBDA500162161695
slow100162351558
fast1000163241130
LAST100016 a3271119
slow50016406902
DIAMOND100016446821
slow100016612598
verislow50016624587
verislow100016792462
verifast100011009363
UBLAST b100016 a1310279
RAPSEARCH21000161469249
LAMBDA50012052178
LAST100012957124
fast100013297111
SANSc10001380996
BLAT b10001430785
slow10001501573
verislow10001709452
RAPSEARCH2100011876120
UBLAST b100012839913
BLAST100016a3214911
BLAST100013660461
ProgramHitsCoresTime (s)Relative speed
verifast10016625903
fast10016655631
verifast500161113298
verifast1000161702153
fast500161782056
LAMBDA500162161695
slow100162351558
fast1000163241130
LAST100016 a3271119
slow50016406902
DIAMOND100016446821
slow100016612598
verislow50016624587
verislow100016792462
verifast100011009363
UBLAST b100016 a1310279
RAPSEARCH21000161469249
LAMBDA50012052178
LAST100012957124
fast100013297111
SANSc10001380996
BLAT b10001430785
slow10001501573
verislow10001709452
RAPSEARCH2100011876120
UBLAST b100012839913
BLAST100016a3214911
BLAST100013660461

aGNUparallel.

bDatabase split to chunks (UBLAST: 19, BLAT: 5) due to program's size limit.

cSerial implementation (9).

User interface

Inputs and outputs

The website is free and open to all and there is no login requirement. The input to the server are FASTA-formatted sequences. One or multiple query sequences can be submitted in one request. The user can also choose the maximum number of hits to be output (H), the database to be searched (Uniprot, UniRef90, UniRef50, Swissprot or PDB) and a search protocol. The protocols are pre-set parameter combinations: (i) verifast mode reports H proteins with the highest vote; no alignments are computed. (ii) Fast mode is like the previous mode but reports alignment scores. (iii) Slow mode inspects 2H proteins with the highest vote and sorts them by alignment score. (iv) Verislow mode maximizes accuracy when H is small. It always inspects 4000 proteins with the highest vote and sorts them by alignment score. The vote threshold of verifast mode is set so that the false positive rate is 1–2% in our benchmark. The other modes only report hits with an e-value below 1. Figure 3 illustrates the search result for a predicted protein from the butterflyMelitaea cinxia (24), which the cgi-script generated in 51 milliseconds. The primary output of the server is a tabular report of the hits with links to different output options (Figure 3). For example, we generate stacked alignments that are automatically loaded to Jalview (20) for alignment editing/visualization or to Skylign (19) for drawing sequence logos. Jalview Desktop is a standalone Java application that can be downloaded from http://www.jalview.org/download. The Jalview applet is launched from our website which must be added to the user's list of trusted sites as instructed in the tutorial (http://ekhidna2.biocenter.helsinki.fi/sans/Tutorial.html#exercises). Skylign outputs HTML5 which works on modern web browsers.

SANSparallel: interactive hom*ology search against Uniprot (5)

Figure 3.

Example output.

Open in new tabDownload slide

Programmatic access

SANSparallel can be used for both interactive and high-throughput analyses. All input and output options of the cgi-script can be included in the URL as explained in the web tutorial (http://ekhidna2.biocenter.helsinki.fi/sans/Tutorial.html#external). Thus, another web server can link to SANSparallel to retrieve information about the sequence neighbors of a particular protein. Another use of SANSparallel is in high-throughput functional annotation of proteomes or transcriptomes. For example, the web tutorial demonstrates (http://ekhidna2.biocenter.helsinki.fi/sans/Tutorial.html#perl) how to build a simple annotation pipeline where (i) the predicted protein sequences (in FASTA format) are sent to the server, (ii) the result is parsed and filtered, (iii) the best informative hit is selected as a source of annotation of the query sequence and (iv) a summary table is generated which reports the predicted annotation of each query protein and links its sequence back to SANSparallel so that anyone interested can study the evidence for the prediction interactively. Finally, it is possible to download the client-server programs in source code (http://ekhidna2.biocenter.helsinki.fi/sans/download/) and run the programs locally on local databases.

DISCUSSION

We have improved and parallelized the suffix array neighborhood search algorithm SANS (9). Our benchmarking results were in line with previously published comparisons identifying UBLAST as sensitive and LAST and LAMBDA as fast. SANSparallel is competitive with DIAMOND, LAST and LAMBDA. All these programs are based on similar principles but with different implementations. Benchmarking showed that they miss few hits when sequence identity is above 50% but fall behind BLAST when sequence identity gets lower (Figure 2). Future work will focus on improving sensitivity by increasing the sequence space coverage of the seeds. The speed of SANSparallel depends on the amount of output (Table 1). LAST has no direct control on the number of hits, but this is influenced by the –m parameter for the uniqueness of seeds in the database (13). DIAMOND (15) and LAMBDA (12) are designed for batch processing of large query sets like the original SANS algorithm (9). The SANSparallel server supports both interactive analysis of individual queries and high-throughput analysis of genomes or transcriptomes. It is simple to link to other tools, as inputs and outputs are FASTA-formatted sequences or alignments. Much can be learned by studying groups of hom*ologous proteins instead of individual proteins. Evolutionary conservation sharpens the signal for function (25,26), secondary structure (27) and deeper hom*ology detection (1). SANSparallel facilitates such analyses by retrieving hom*ologs from the database and performing an alignment. It is so fast that the user can change output formats, search parameters or the database interactively. Speed opens up new ways to operate. For example, functional annotations of genomes could be updated on demand, database clustering need not store all-against-all search results on disk, and sequence similarity based data integration could be done on the fly.

FUNDING

Biocenter Finland. Funding for open access charge: Biocenter Finland.

Conflict of interest statement. None declared.

REFERENCES

1.

Altschul

S.F.

Madden

T.L.

Schäffer

A.A.

Zhang

J.

Zhang

Z.

Miller

W.

Lipman

D.J.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res.

1997

25

3389

3402

2.

McGinnis

S.

Madden

T.L.

BLAST: at the core of a powerful and diverse set of sequence analysis tools

Nucleic Acids Res.

2004

32

W20

W25

3.

Analysis Tool Web Services from the EMBL-EBI

Nucleic Acids Res.

2013

41

W597

W600

5.

Sun

S.

Chen

J.

Li

W.

Altinatas

I.

Lin

A.

Peltier

S.

Stocks

K.

Allen

E.E.

Ellisman

M.

Grethe

J.

et al.

Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource

Nucleic Acids Res.

2011

39

D546

D551

6.

Heger

A.

Korpelainen

E.

Hupponen

T.

Mattila

K.

Ollikainen

V.

Holm

L.

PairsDB atlas of protein sequence space

Nucleic Acids Res.

2008

36

D276

D280

7.

Rattei

T.

Arnold

R.

Tischler

P.

Lindner

D.

Stümpflen

V.

Mewes

H.W.

SIMAP: the similarity matrix of proteins

Nucleic Acids Res.

2006

34

D252

D256

8.

The UniProt Consortium

UniProt: a hub for protein information

Nucleic Acids Res.

2015

43

D204

D212

9.

Koskinen

P.

Holm

L.

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

Bioinformatics

2012

28

i438

i443

10.

Kent

W.J.

BLAT—the BLAST-like alignment tool

Genome Res.

2002

12

656

664

11.

Zhao

Y.

Tang

H.

Ye

Y.

RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data

Bioinformatics

2012

28

125

126

12.

Hauswedell

H.

Singer

J.

Reinert

K.

Lambda: the local aligner for massive biological data

Bioinformatics

2014

30

i349

i355

13.

Kielbasa

S.M.

Wan

R.

Sato

K.

Horton

P.

Frith

M.C.

Adaptive seeds tame genomic sequence comparison

Genome Res.

2011

21

487

493

14.

Edgar

Robert C.

Search and clustering orders of magnitude faster than BLAST

Bioinformatics

2010

26

2460

2461

15.

Buchfink

B.

Xie

C.

Huson

D.H.

Fast and sensitive protein alignment using DIAMOND

Nat. Methods

2014

12

59

60

16.

Roytberg

M.

Gambin

A.

Noé

L.

Lasota

S.

Furletova

E.

Szczurek

E.

Kucherov

G.

On subset seeds for protein alignment

IEEE/ACM Trans. Comput. Biol. Bioinform.

2009

6

483

494

17.

Pearson

W.R.

Effective protein sequence comparison

Methods Enzymol.

1996

266

227

258

Google Scholar

OpenURL Placeholder Text

18.

Brown

N.P.

Leroy

C.

Sander

C.

MView: a web-compatible database search or multiple alignment viewer

Bioinformatics.

1998

14

380

381

19.

Wheeler

T.J.

Clements

J.

Finn

R.D.

Skylign: a tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models

BMC Bioinformatics.

2014

15

7

20.

Waterhouse

A.M.

Procter

J.B.

Martin

D.M.

Clamp

M.

Barton

G.J.

Jalview Version 2—a multiple sequence alignment editor and analysis workbench

Bioinformatics.

2009

25

1189

1191

21.

Korf

I.

Yandell

M.

Bedell

J.

2003

Sebastopol, CA

O'Reilly & Associates

ISBN-13: 978-0596002992

22.

Holm

L.

Rosenström

P.

Dali server: conservation mapping in 3D

Nucleic Acids Res.

2010

38

W545

W549

23.

Garlant

L.

Koskinen

P.

Liu

Y.

Nykyri

J.

Ahamed

S.

Rouhiainen

L.

Laine

P.

Paulin

L.

Auvinen

P.

Holm

L.

Genome sequence of Dickeya solani, a new soft rot pathogen of potato, suggests its emergence may be related to a novel combination of non-ribosomal peptide/polyketide synthetase clusters

Diversity

2013

5

824

842

24.

Ahola

V.

Lehtonen

R.

Somervuo

P.

Salmela

L.

Koskinen

P.

Rastas

P.

Välimäki

N.

Paulin

L.

Kvist

J.

Wahlberg

N.

et al.

The Glanville fritillary butterfly retains an ancient karyotype and reveals selective chromosomal fusions in Lepidoptera

Nat. Commun.

2014

5

4737

25.

Koskinen

P.

Toronen

P.

Nokso-Koivisto

J.

Holm

L.

PANNZER—high-throughput functional annotation of uncharacterized proteins in an error-prone environment

Bioinformatics

2014

doi:10.1093/bioinformatics/btu851

Google Scholar

OpenURL Placeholder Text

26.

O'Donoghue

S.I.

Sabir

K.S.

Kalemanov

M.

Stolte

C.

Wellmann

B.

Ho

V.

Roos

M.

Perdigão

N.

Buske

F.A.

Heinrich

J.

et al.

Aquaria: simplifying discovery and insight from protein structures

Nat. Methods

2015

12

98

99

27.

Rost

B.

Sander

C.

Combining evolutionary information and neural networks to predict protein secondary structure

Proteins

1994

19

55

72

© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Issue Section:

Web Server issue

Download all slides

Comments

0 Comments

Comments (0)

Submit a comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Advertisem*nt

Citations

Views

1,668

Altmetric

More metrics information

Metrics

Total Views 1,668

1,075 Pageviews

593 PDF Downloads

Since 12/1/2016

Month: Total Views:
December 2016 3
January 2017 2
February 2017 3
March 2017 4
April 2017 10
May 2017 2
June 2017 4
July 2017 6
August 2017 5
September 2017 2
October 2017 6
November 2017 6
December 2017 21
January 2018 21
February 2018 16
March 2018 29
April 2018 20
May 2018 16
June 2018 7
July 2018 16
August 2018 25
September 2018 16
October 2018 20
November 2018 19
December 2018 28
January 2019 17
February 2019 17
March 2019 37
April 2019 29
May 2019 16
June 2019 17
July 2019 20
August 2019 16
September 2019 18
October 2019 20
November 2019 8
December 2019 5
January 2020 25
February 2020 13
March 2020 13
April 2020 12
May 2020 15
June 2020 35
July 2020 26
August 2020 18
September 2020 19
October 2020 19
November 2020 15
December 2020 18
January 2021 17
February 2021 16
March 2021 22
April 2021 42
May 2021 14
June 2021 9
July 2021 15
August 2021 19
September 2021 13
October 2021 11
November 2021 18
December 2021 7
January 2022 20
February 2022 14
March 2022 28
April 2022 17
May 2022 26
June 2022 24
July 2022 6
August 2022 22
September 2022 25
October 2022 31
November 2022 11
December 2022 13
January 2023 20
February 2023 38
March 2023 27
April 2023 17
May 2023 18
June 2023 19
July 2023 15
August 2023 16
September 2023 14
October 2023 23
November 2023 22
December 2023 22
January 2024 30
February 2024 29
March 2024 41
April 2024 26
May 2024 37
June 2024 21
July 2024 30
August 2024 8

Citations

Powered by Dimensions

32 Web of Science

Altmetrics

×

Email alerts

Article activity alert

Advance article alerts

New issue alert

Subject alert

Receive exclusive offers and updates from Oxford Academic

Citing articles via

Google Scholar

  • Latest

  • Most Read

  • Most Cited

Central role of SUMOylation in the regulation of chromatin interactions and transcriptional outputs of the androgen receptor in prostate cancer cells
The characteristics of CTCF binding sequences contribute to enhancer blocking activity
Stress granule formation helps to mitigate neurodegeneration
Structures of the mitochondrial single-stranded DNA binding protein with DNA and DNA polymerase γ
H3T11 phosphorylation by CKII is required for heterochromatin formation in Neurospora

More from Oxford Academic

Science and Mathematics

Books

Journals

Advertisem*nt

SANSparallel: interactive hom*ology search against Uniprot (2024)

FAQs

What does TrEMBL contribute to UniProt? ›

UniProtKB/TrEMBL (unreviewed) contains protein sequences associated with computationally generated annotation and large-scale functional characterization.

What is the difference between the Swiss-Prot and TrEMBL parts of UniProt? ›

UniProtKB. UniProt Knowledgebase (UniProtKB) is a protein database partially curated by experts, consisting of two sections: UniProtKB/Swiss-Prot (containing reviewed, manually annotated entries) and UniProtKB/TrEMBL (containing unreviewed, automatically annotated entries).

How to do a blast search in UniProt? ›

Select the Blast tab of the toolbar to run a sequence similarity search with the BLAST (Basic Local Alignment Search Tool) program:
  1. Enter either a protein or nucleotide sequence (raw sequence or fasta format) or a UniProt identifier into the form field.
  2. Click the Blast button.
Mar 25, 2024

How do you search proteins in UniProt? ›

You can access UniProtKB from the UniProt homepage or by selecting it from the dropdown in front of the search bar and entering your search term.
  1. For example, let's enter a query and click the search button.
  2. You can now see the results for your query.
Apr 19, 2023

What is the difference between UniProt and GenBank? ›

NCBI databases, such as GenBank and ENA, are primary databases that directly receive data from individual researchers, while UniProt is a secondary database that curates and quality controls information before making it accessible to the public .

What is the difference between UniProt and ensemble? ›

Ensembl's proteins correspond to the translation of the underlying transcripts, the sequences of which depend on the particular genome assembly used as reference. Uniprot's sequences come mostly from translations of GenBank coding sequences and from other sources (e.g. PDB).

Who owns UniProt? ›

The UniProt consortium and host institutions EMBL-EBI, SIB and PIR are committed to the long-term preservation of the UniProt databases. UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR).

What are the four components of UniProt? ›

UniProt is comprised of four major components, each optimized for different uses: the UniProt Knowledgebase, the UniProt Reference Clusters, the UniProt Archive and the UniProt Metagenomic and Environmental Sequences database.

What does UniProt tell you? ›

UniProt provides an up-to-date, comprehensive body of protein information. The resource facilitates scientific discovery by collecting, interpreting and organising this information, which saves researchers countless hours of work.

How to determine hom*ology between two sequences? ›

Although a common rule of thumb is that two sequences are hom*ologous if they are more than 30% identical over their entire lengths (much higher identities are seen by chance in short alignments), the 30% criterion misses many easily detected hom*ologs.

What does the UniProt database specialize in? ›

The UniProt Knowledgebase (UniProtKB) is used to access functional information on proteins.

Is UniProt a secondary database? ›

Uniprot was originally formulated as a primary database for protein sequences and functional annotation based on experimental evidence. Nowadays it combines a network of sister databases centralising all levels of annotation produced for protein sequences.

How to check gene hom*ology? ›

Search the hom*oloGene database with the gene name. If you know both the gene symbol and organism, use a query such as this: tpo[gene name] AND human[orgn]. If your search finds multiple records, click on the desired record. The hom*ologous genes are listed in the top of the report.

What is the e value in UniProt? ›

The expectation value (E) threshold is a statistical measure of the number of expected matches in a random database. The lower the E-value, the more likely the match is to be significant. E-values between 0.1 and 10 are generally dubious and over 10 are unlikely to have biological significance.

Can I use AlphaFold to predict protein structure? ›

In CASP14, AlphaFold was the top-ranked protein structure prediction method by a large margin, producing predictions with high accuracy.

What is the function of TrEMBL? ›

TrEMBL consists of entries in a SWISS-PROT format that are derived from the translation of all coding sequences in the EMBL nucleotide sequence database, that are not in SWISS-PROT. Unlike SWISS-PROT entries those in TrEMBL are awaiting manual annotation.

Who contributes to UniProt? ›

UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). Across the three institutes more than 100 people are involved through different tasks such as database curation, software development and support.

What are the components of UniProt? ›

UniProt, http://www.uniprot.org/, consists of three parts:
  • UniProt Knowledge-base (UniProtKB) protein sequences with annotation and references.
  • UniProt Reference Clusters (UniRef) ...
  • UniProt Archive (UniParc)
Oct 17, 2023

Is TrEMBL a protein database? ›

TrEMBL consists of computer-annotated entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for CDS already included in SWISS-PROT.

Top Articles
Latest Posts
Article information

Author: Pres. Lawanda Wiegand

Last Updated:

Views: 6287

Rating: 4 / 5 (51 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Pres. Lawanda Wiegand

Birthday: 1993-01-10

Address: Suite 391 6963 Ullrich Shore, Bellefort, WI 01350-7893

Phone: +6806610432415

Job: Dynamic Manufacturing Assistant

Hobby: amateur radio, Taekwondo, Wood carving, Parkour, Skateboarding, Running, Rafting

Introduction: My name is Pres. Lawanda Wiegand, I am a inquisitive, helpful, glamorous, cheerful, open, clever, innocent person who loves writing and wants to share my knowledge and understanding with you.