Formatdb and Fastacmd

formatdb and fastacmd
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fast...
Program Parameters for formatdb and fastacmd - Two BLAST Database Related Tools Tao Tao, Ph.D. User Service NCBI, NLM, NIH TOC 1. Introduction 1.1 Biological sequences in FASTA format 1.2 Conversion to blastable format 1.3 Sequence retrieval from formatted BLAST databases 2. Installation and configuration 3. Program parameters 3.1 Command line parameters for formatdb 3.2 Command line parameters for fastacmd 4. Practical usage 4.1 Using formatdb 4.1.1 Number of formatdb output files: 3, 5, 7 and 9 4.1.2 Format a custom database with entries from NCBI Entrez 4.1.3 Use a GI list to create an alias file for a master database 4.1.4 Format multiple input files 4.1.5 Format custom databases 4.1.6 Alias file structure 4.2 On fastacmd 4.2.1 Database information and database to FASTA sequence conversion 4.2.2 Specific sequence and subsequence retrieval 4.2.3 Taxonomic information 5. Technical Assistance 1. Introduction When discussing biological sequences, we generally refer to text strings with each of its letter represents a nucleotide or a amino acide residue in the actual biological sequence. There are many ways to display a biological sequence, with FASTA being the most widely used one. The FASTA format was first adopted by the authors of FASTA sequence alignment program [1]. In this format, a ">" initialed definition line (defline in short) precedes the actual sequence. This defline contains a brief description of the actual sequence. NCBI further expanded the defline by using the first string in the FASTA defline as seqID and breaking this string into pipe (|) separated fields to encode additional information. In addition to FASTA, NCBI also provides sequences in GenBank (GenPept for protein), ASN.1, XML formats. However, only FASTA formatted sequences can be used with command line standalone BLAST or client BLAST (blastcl3) as input query. We cannot use sequences in above formats as BLAST databases. Rather, we will need to convert them into BLASTable format using formatdb, which takes only sequences in FASTA or ASN.1 format as input. Once we identify a hit of interest, we often need to get the entries out of a formatted BLAST database in human readable FASTA format. We can use the fastacmd program to accomplish this
1 of 18
7/20/11 5:10 PM
task. In this document, we will go over the technical details of FASTA sequence deflines, the two database related programs, formatdb and fastacmd, their program parameters, and their practical usages. We would like to point out that NCBI provides the common set of BLAST databases in preformatted form, generated directly from our relational databases. We break large databases into smaller and easy to handle volumes. These databases are readily blastable after inflation and extraction. Use these preformatted databases whenever you can. Users can quickly regenerate the FASTA sequences from these preformatted databases if they are needed. See this page for more information on available BLAST databases:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdb.html
1.1 Biological sequences in FASTA format FASTA formatted sequence consists of a single comment line called defline, which is marked by a ">" sign at the beginning followed by the description of the sequence. The defline terminates with a new line character and is followed by one or more lines of actual sequences, each terminated with a new line character. Deflines for FASTA sequences from NCBI follow a distinctive structure, which has several pipe (|) separated fields. Details are given below. Table 1.1 Sequence ID (seqID) Fields in the FASTA Deflines of Sequences from NCBI Database Name GenBank 1 EMBL Data Library 1 DNA Database of Japan NBRF PIR 2 Protein Research Foundation 2 SWISS-PROT 2 Protein Data Bank 2 Patents 2 GenInfo Backbone Id NCBI Reference Sequence 1, 2 General database identifier 3 Local Sequence identifier 3 NOTE:
1
Identifier Syntax and Examples

>gi|digits|gb|accession|locus >gi|digits|emb|accession|locus >gi|digits|dbj|accession|locus >gi|digits|pir||entry >gi|digits|prf||name >gi|digits|sp|accession|entry name >gi|digits|pdb|entry|chain >gi|digits|pat|country|number >gi|digits|bbs|number >gi|digits|ref|accession| >gnl|database|identifier >lcl|identifier
Nucleotide Defline Examples:
>gi|304804|gb|L17338.1|DROZENA Drosophila pseudoobscura zen gene >gi|293667|gb|L13590.1|MUSIMSWAL Mus musculus DNA sequence >gi|18917|emb|X52321.1|HVBAMYL Barley mRNA for beta-amylase >gi|15042013|dbj|AB055093.1| Bacillus sp. KSM-KP43 gene for 16S rRNA
2
Protein Defline Examples:
>gi|68510037|ref|NP_766538.2| lipin 1 isoform a [Mus musculus]
2 of 18
7/20/11 5:10 PM
>gi|24430466|gb|AAN61186.1| maturase [Perilla frutescens] >gi|46090801|dbj|BAD13538.1| cytochrome b [Tanakia lanceolata] >gi|223168|prf||0602187A protein,Leu,Ile,Val binding >gi|50552|emb|CAA46200.1| protein kinase [Mus musculus] >gi|51315829|sp|P84108|FLL1_ACEDI Flagellin-like protein >gi|230326|pdb|1SGT| Trypsin (SGT) (E.C.3.4.21.4) >gi|1082471|pir||S52920 disintegrin (EC 3.4.24.-) - human (fragment)
3
For records that are not included in the NCBI Entrez database. 1.2 Conversion to blastable format
Text files with sequences in FASTA or ASN.1 format cannot be used as BLAST databases directly during a BLAST search. To make them recognizable by BLAST, we will need to format them using formatdb. When formatting a database with the "-o F" setting, formatdb generates 3 files for the input sequence file, all arerequired by BLAST programs. If the sequence file is from NCBI, or its deflines conform to NCBI convention, formatdb is capable of parsing the seqIDs from the deflines to generate additional indexing files with the "-o T" setting. For sequence files with custom deflines, the exact number of files generated with "-o T" setting will depend on the actual format of the deflines. For large databases, with size larger than one GB, formatdb automatically splits up the file and generate multiple volumes, each no larger than the one GB (default setting: -v 1000). Each volume will has its own specific set of database files. An alias file, generated automatically, ties up individual volumes to form a large virtual database. 1.3 Sequence retrieval from formatted BLAST databases We often need to retrieve the FASTA sequence for a specific entry within a BLAST database for visual inspection or other analyses. We can do so if the database was formatted with the "-o T" setting. What we need is fastacmd, another database related tool from the standalone BLAST package, in combination with the seqID for the entry. When using preformatted databases from NCBI, we can also obtain taxonomic information. For details see Section 4.2 2. Installation and configuration There is not specific setting for formatdb or fastacmd if the BLAST package was installed properly. For more information on the installation of BLAST pages, see pc_setup.html or unix_setup.html. 3. Program parameters We will list the detailed program parameters for formatdb and fastacmd separately below. 3.1 Command line parameters for formatdb Command line parameters for foramtdb are discussed here with each parameter listed in its own table. Table 3.1.1 Parameter Function Default -i Sepcifies the input file(s) to be formatted N/A
3 of 18
7/20/11 5:10 PM
Input format Example
[File In] To format an input FASTA file my_seq.txt, use: -i my_seq.txt
Note This parameter is mandatory. It requires the full file name with extension. The input file should have sequences in FASTA or ASN.1 format, except when converting a gi list to binary form. To format multiple input files, quote the input file names as in -i "db1 db2". The FASTA output from other programs can be pipe to this option using "-i stdin". Renaming of database is recommended (mandatory in the first case). See Table 3.1.9. Table 3.1.2 Parameter Function Default Input format Example -p The input type is protein T [T/F] To format nucleotide database, use: -p F
Note T: true, input is protein F, false, input is nucleotide. Table 3.1.3 Parameter Function Default Input format Example -o Parses deflines and indexes seqIDs F [T/F] To enable seqID parsing and indexing, use: -o T
Note T: Parse SeqID and create indexes F: Do not parse SeqID and do not create indexes. For input FASTA sequence file with NCBI styled deflines, use "-o T". Otherwise, use "-o F". Table 3.1.4 Parameter Function Default Input format Example -t Adds custom title to the database N/A [String] To add the title "combined nt, est, and htgs", use: -t "combined nt, est, and htgs"
Note This adds a more descriptive title to the database, which is displayed in the header section of the BLAST output. Table 3.1.5
4 of 18
7/20/11 5:10 PM
Parameter Function Default Input format Example
-l specifies the logfile name formatdb.log [File Out] None
Note The default setting is usually sufficient. We recommend users check this log file after each formatdb run to make sure there is no obvious error. Table 3.1.6 Parameter Function Default Input format Example -a The input file is in ASN.1 format F [T/F] To format ASN.1 file, use: -a T
Note The deftault is to expect sequences in FASTA format. Currently, multiple sequences in ASN.1 format downloaded from Entrez do NOT work properly with formatdb - only the first entry will be formatted. Table 3.1.7 Parameter Function Default Input format Example -b The input ASN.1 file is in binary form F [T/F] To format binary ASN.1 file, use -b T
Note T: binary ASN.1 file F: text ASN.1 file expected Use this with -a T option. Table 3.1.8 Parameter Function Default Input format Example -e The input ASN.1 is a seq-entry file F [T/F] To set this to true, use: -e T
Note To format sequence in ASN.1 form downloaded from Entrez, use: -e T Table 3.1.9 Parameter -n
5 of 18
7/20/11 5:10 PM
Function Default Input format Example
Renames the resulting database N/A [String] To rename the formatted database to combined_nt, use: -n combined_nt
Note This parameter renames the formatted database to a name different from the input file, which is recommended when formatting input sequences piped from stdin. Mandatory when formatting multiple input files. Do NOT combine -n with -L. Table 3.1.10 Parameter Function Default Input format Example Note Zero invokes default of 1000, which is one gigabase or 109 letters. If an input database is broken into multiple volumes, formatdb will automatically create an alias file with db_name.nal extension, which ties all the volume together. The complete database can be called using "-d db_name" See Table 3.1.13 and section 3.1. Table 3.1.11 Parameter Function Default Input format Example -s Creates sparse indexes - limited only to accessions F [T/F] To activate this option, use: -s T -v Sets the upper limit of database volume size, input in MILLIONS of letters 0 [Integer] To break the formatted database into 100 megabase volumes, use: -v 100
Note Activation of this parameter will reduce the size of the database indexing files. Table 3.1.12 Parameter Function Default Input format Example -V Activates verbose mode and checks for non-unique IDs F [T/F] To activate this warning, use: -V T
Note This prints warnings on screen if duplicate IDs are found. Table 3.1.13
6 of 18
7/20/11 5:10 PM
Parameter Function Default Input format Example
-L Creates an alias file with this name N/A [File Out] To create a nucleotide database alias named mouse_subset for nt from a gi list named mouse.gil, use: formatdb -i all -p F -F mouse.gil -L mouse_subset
Note It will use the GI file argument from -F to calculate the database size. Do NOT combine -L with -n. Table 3.1.14 Parameter Function Default Input format Example -F Specifies an input GI file N/A [File In] To input a GI file named mouse_gi, use: -F mouse_gi
Note It takes a text file with a list of GIs from Entrez or Eutils. Table 3.1.15 Parameter Function Default Input format Example -B Generates binary GI file from the text GI file specified in -F N/A [File Out] To generate binary GI file worm.gil from input text GI list worm, use: formatdb -F worm -B worm.gil
Note This converts the -F input to a more efficient binary format. The resulting file can be used in database aliases or by -l parameters of BLAST programs during while searching against a preformatted database from NCBI. Table 3.1.16 Parameter Function Default Input format Example -T Reads in taxonomic information and writes the group bit to the ASN.1 defline Optional [File in] To read in gi_taxid_prot.dmp, use: -T gi_taxid_prot.dmp
Note This parameter allows formatdb to read in a file with gi/taxid information and write the taxid information to the ASN.1 defline. The inputs are gi_taxid*.dmp files from ftp://ftp.ncbi.nlm.nih.gov /pub/taxonomy/
7 of 18
7/20/11 5:10 PM
3.2 Command Line Parameters for fastacmd The program fastacmd works on a formatted BLAST database generated by formatdb. Its program parameters are discussed below in their individual tables. Table 3.2.1 Parameter Function Default Input format Example -d Specifies the input database nr [String] To use ecoli database, use: -d ecoli
Note fastacmd will NOT be able to retrieve specific entries from databases formatted with "-o F". To work with multiple databases simultaneously, use -d "db1 db2". Databases in quotes must be of the same type. Table 3.2.2 Parameter Function Default Input format Example -p Specifies the database type G [G/T/F] To work with a protein database, use: -p T
Note G: guess mode, looking for protein first, then nucleotide; T: protein; F: nucleotide. Table 3.2.3 Parameter Function Default Input format Example -s Specifies the seqID or test strings N/A [String] To retrieve sequences/information for gi 5556, use: -s 5556
Note Use GI, accession, or text string (for custom database). Multiple entries need to be comma-delimited as in "-s AF123456,U12345". Specific sequence retrieval from custom database with deflines conforming to NCBI format will need the first text string. See Section 4.2.2. Table 3.2.4 Parameter Function -i Specifies the input file containing GIs, accessions, or text strings for batch retrieval
8 of 18
7/20/11 5:10 PM
Default Input format Example
N/A [String] To batch retrieve sequences specified by gi_list, use: -i gi_list
Note The input file must be a text file with one entry per line. The complete file name with extension should be used. Table 3.2.5 Parameter Function Default Input format Example -a Retrieves duplicate accessions F [T/F] None
Note The "-a T" setting retrieves all entries with deflines containing the input text string. For example, it is quite common for pdb entries of different chains to have the same record id, but with different chain number. The command line "fastacmd -d pdb -s 2hhd -a" retrieves all the entries, while "fastacmd -d pdb -s 2hhd" retrieves only some of them. Table 3.2.6 Parameter Function Default Input format Example -l Specifies the line length of the returned sequences 80 [Integer] To set line length to 70, use: -l 70
Note Changes to default is not recommended. Table 3.2.7 Parameter Function Default Input format Example -t Requires defline to contain the target GI F [T/F] To make sure "-s 511603" retrieves only the entry with this gi in the defline, add: -t T
Note It only affects compound entries from a non-redundant databases such as protein nr. Table 3.2.8 Parameter -o
9 of 18
7/20/11 5:10 PM
Specifies the output file stdout, print to screen [String] To save the output to file called my_hit, use: -o my_hit
Note Redirection using "|" or ">" also works. Table 3.2.9 Parameter Function Default Input format Example -c Uses Ctrl-A's as non-redundant defline separator F [T/F] To keep the Ctrl-A in the defline, use: -c T
Note It is only for non-redundant databases, such protein nr. Table 3.2.10 Parameter Function Default Input format Example -D Dumps the entire database 0 Integer To dump the database in FASTA format, use: -D 1
Note It overwrites all other options except -I, and accepts the following values 0: No dump; 1: FASTA; 2: GI list; 3: Accession.version Table 3.2.11 Parameter Function Default Input format Example -L Specifies the subsequence range 0,0 [String] To get subsequence between 10 to 200, use: -L 10,200
Note In default setting, 0 in 'start' refers to the beginning of the sequence and 0 in 'stop' refers to the end of the sequence. Double quote the input if it contains white space(s). fastacmd will apply the input to all retrieved record in a batch retrieval. Table 3.2.12 Parameter -S
10 of 18
7/20/11 5:10 PM
Specifies the strand of nucleotide sequence to retrieve 1 [Integer] To retrieve the reverse complement, use: -S 2
Note fastacmd will apply the input to all retrieved entries during batch retrieval. Values and functions are: 1: The entry itself; 2: The reverse complement of the entry Table 3.2.13 Parameter Function Default Input format Example -T Prints taxonomic information for requested sequence(s) F [T/F] To print taxonomic information, use: -T T
Note This works only for preformatted databases provided by NCBI and requires the installation of the taxdb.tar.gz archive. See Section 4.2.2 for more information. Table 3.2.14 Parameter Function Default Input format Example -I Prints database information only F [T/F] To get the database information, use -I T
Note Prints title, database type, total length, and number of sequences for the target database specified by -d. Overrides all other parameters. Table 3.2.15 Parameter Function Default Input format Example -P Retrieves sequences with PIG ID N/A [Integer] To retrieve PIG 234 from nr, use: -d nr -P 234
Note PIG stands for "Protein Identification Group". Each PIG contains one or more protein entries with the exact same sequence. PIG number list or table is NOT available to the public at this time. 4. Practical usage In this section, we will present some additional information on these two programs and discuss their
11 of 18
7/20/11 5:10 PM
practical usages. 4.1 Using formatdb The function of this program is to convert sequence files to blastable format, index the entries, and encode additional information to the ASN.1 defline to make the resulting databases more useful. 4.1.1 Number of formatdb output files: 3, 5, 7 and 9 The program formatdb processes the input sequence file and generates different number of files. The exact number of files generated will depend on whether the "-o T" option is used, whether the sequence deflines conform to NCBI format, whether they are from NCBI Entrez database, and whether they are protein or nucleotide. Table 4.1.1 Formatdb Output List Nucleotide db file extension .nhr .nin .nsq Protein db file extension .phr .pin .psq Content Deflines Indices sequence data Format binary binary binary
Additional files generated using the "-o T" * .nnd .nni .nsd .nsi .pnd .pni .psd .psi .ppd .ppi GI data GI indices non-GI data non-GI indices PIG data PIG indices binary binary binary binary binary binary
NOTE: * Number of files produced with "-o T" depends on the defline format. We recommend that users use NCBI preformatted BLAST database whenever possible. These preformatted BLAST databases are split into smaller volumes and easy to handle. They also contain added Linkouts and taxonomic information. Users can fully exploit the linkout information under the wwwblast setup [2]. 4.1.2 Format a custom database with entries from NCBI Entrez Sequences obtained from NCBI Entrez database or Eutilites [3] should be formatted with "-o T" option. For batch sequences, only FASTA format should be used at this time. For an annotated genome or genomic segment, it is advantageous to use the ASN.1 format as input to formatdb, since we can use it as a nucleotide or a protein input. When such a record is used as a protein input to formatdb, we will format the annotated CDS, or the proteins products into a BLASTable protein database. If the preformatted database, containing target entries of interest, is available from NCBI's BLAST db ftp directory, users can use the database alias alternative by creating a database alias using a GI list without formatting a separate database. We will discuss the detail in Section 4.1.3.
12 of 18
7/20/11 5:10 PM
The "-T" option was added to formatdb 2.2.12, which allows the incorporation of taxonomic information into the ASN.1 defline using the gi_taxid_nucl.dmp or gi_taxid_prot.dmp from the taxonomy ftp site at:
ftp.ncbi.nlm.nih.gov/pub/taxonomy/
The following command line formats the bacterial_protein FASTA file and adds the taxid information to the ASN.1 deflines:
formatdb -i bacterial_protein -p T -o T -T gi_taxid_prot.dmp
4.1.3 Use a GI list to create an alias file for a master database All preformatted BLAST databases from NCBI can be used in conjunction with a GI list fed to the parameter -l to restrict a given BLAST search to a subset of entries delimited by that list. A more efficient and informative way to use GI list, however, is to generate a binary GI list based database alias using formatdb. We can readily generate a GI list by searching in Entrez Nucleotide or Protein database. Correct usage of GI list BLAST requires a good understanding of the sequence partition among the available BLAST databases. We also need to know that BLAST programs, will not report error/warning if a sequence specified by a GI is missing from the target database. For example, a GI list representing human mRNAs, obtained from Entrez Nucloetide using "human[orgn] AND biomol_mrna[prop]" contains GIs for ESTs as well. GIs representing these ESTs will have no corresponding entries. Using this GI list as input to -l to limit a BLAST search against nt, BLAST will not report errors on the missing entries. formatdb, on the other hand, will check the GI list during an alias construction to verify the presence of the entries specified by the GI list. This function helps avoid the confusion furhter downstream, e.g. at the BLAST result analysis stage. For reference, we list the Entrez query approximation for the available BLAST databases below. Table 4.1.3 Entrez proximation for preformatted databases 1 Database Name Entrez Query Proximation Protein 2 nr swissprot pdb refseq_protein env_nr 3 pat all[filter] NOT environmental sample[filter] NOT gbdiv_pat[prop] srcdb_swiss_prot[prop] srcdb_pdb[prop] srcdb_refseq[prop] environmental sample[filter] gbdiv_pat[prop] Nucleotide nt all[filter] NOT (gbdiv_est[prop] OR gbdiv_gss[prop] OR gbdiv_sts[prop] OR gbdiv_pat[prop] OR gbdiv_htg[prop] OR (srcdb_refseq[prop] AND biomol_genomic[prop]) OR environmental sample[filter] OR wgs[prop])
13 of 18
7/20/11 5:10 PM
refseq_rna refseq_genomic 4 est est_human est_mouse est_others
srcdb_refseq[prop] AND biomol_rna[prop] N/A gbdiv_est[prop] gbdiv_est[prop] AND human[orgn] gbdiv_est[prop] AND mouse[orgn] gbdiv_est[prop] NOT (mouse[orgn] OR human[orgn])
human_genomic_transcript N/A mouse_genomic_transcript N/A htgs gss sts wgs pat pdb env_nt 3 human_genomic other_genomic 3 NOTE:
1
gbdiv_htg[prop] gbdiv_gss[prop] gbdiv_sts[prop] wgs[prop] gbdiv_pat[prop] gbdiv_pdb[prop] environmental sample[filter] NC_000001:NC_000024[accn] OR AC_000044:AC_000068[accn] N/A
The query is only a proximation provided for use with combination of other terms to get the gi for a subset of sequences, which can be used to limit BLAST search to that subset through the -l option. Due to the size of the databases and the time needed to update them, content of BLAST databases will LAG behind Entrez.
2 3 4
Protein BLAST databases are non-redundant, while Entrez approximation is NOT. Some entries in the Entrez approximation are in protein nr or nucleotide nt. Currently, there are not Entrez approximation for these two databases.
We can combine additional Entrez query with the approximation, using boolean operator AND or NOT, to retrieve a subset for that database. We can save the GIs of the retrieved records by first displaying them as "GI list" followed by using the "Send to" file button to save the GI. For more information, see Entrez Help. To turn a text GI list file, mouse.n.gi, into a more efficient binary form, we can use the formatdb command line below.
formatdb -F mouse.n.gi -B mouse.n.gil
To generate a database alias using the resulted binary mouse.n.gil file, we can use this formatdb commandline.
formatdb -i parent_db -p F -F mouse.n.gil -L mouse.n.subset
Here the parent_db is the name of the preformatted parent database, and the alias generated by the
14 of 18
7/20/11 5:10 PM
command line is named mouse.n.subset. A search against this alias database will be against the subset of sequences specified by the GI list. Database alias also allows one to give a database subset a more meaningful name, use a shorter command line in actual searching, keep fewer sets of actual databases, and reduces the maintenance needed in a group environment. 4.1.4 Format multiple input files We can use formatdb to format multiple input files into a single BLAST database. To do so, we need to quote the input files to be formatted and provide that to -i parameter. It is mandatory that to use the -n option to name the resulting database since formatdb will not be able to using the multiple file name input to name the resulting database.
formatdb -i "db1.fa db2.fa" -n all.db -t "combined db1 db2"
The -t paramter in the above command line is to create a descriptive title for the resulting database, which will appear in the header of the search result and help us track the BLAST result. 4.1.5 Format custom database The term "custom databases" here refers to sequences from users or other third party sources. Most of those databases should be formatted with "-o F" setting, unless their deflines follow NCBI convention. To format a custom database with deflines in NCBI convention and "-o T" setting requires that each defline starts with a unique first string, since this is the field formatdb indexes to generate additional indexing files. The additional indexing will allow specific retrieval of sequences using their unique first string and fastacmd. Sometimes, we may encounter problems when searching a custom BLAST databases formatted with "-o T" setting. To resolve the issue, we need to go back to the FASTA sequences and reformat them using the "-o F" setting. If FASTA sequences are not available, we can dump them out of the formatted database using fastacmd. 4.1.6 Alias file structure BLAST database aliases are text files with database configuration information. They can be created automatically by formatdb or manually. The alias file name follows database.##.*** convention, where the .## are optional volume numbers and the .*** are .pal or .nal file extension, representing protein and nucleotide aliases, respectively. An alias file can tie multiple databases together to form a larger virtual database. It can also specify a subset of sequences within a large master database to form a smaller virtual database. Information on the number of sequences and their total length in the virtual database can be included in an alias file. BLAST will use them for the Expect value calculation. The alias below, named zebrafish.pal, specifies a virtual database for zebrafish entries found in the nr protein database.
# # Alias file created Thu Jul # TITLE My zebrafish database # 5 15:04:29 2001
15 of 18
7/20/11 5:10 PM
DBLIST nr # GILIST zebrafish.gi # #OIDLIST # NSEQ 1836 LENGTH 640724 #
The alias content below is from est_others.nal, which was generated automatically by formatdb. It ties up the individual est_others.## volumes into one complete database.
# # Alias file created Tue Jan 30 18:04:04 2007 # TITLE GenBank non-mouse and non-human EST entries # DBLIST est_others.00 est_others.01 est_others.02 est_others.03 est_others.04 # #GILIST # #OIDLIST #
Note: # marks a commented line. All other lines should contain no line break. 4.2 On fastacmd The database tool fastacmd allows us to work with a formatted BLAST database for non-sequence alginment purposes. Those includes dumping of FASTA sequences, getting summary information, retrieving specific sequence or subsequence, and the extracting the taxonomic information for specific entries. Dealing with specific entries requires a database formatted with "-o T". 4.2.1 Database information and database to FASTA sequence conversion To get a brief summary of a BLAST database, we can use the -I parameter of fastacmd. This parameter overrides all others in the command line. The output given below is for an old version of the refseq_protein database.
C:\blast2210>fastacmd -d refseq_protein -I T Database: NCBI Protein Reference Sequences 902,672 sequences; 324,856,552 total letters File name: C:\blast2210\blast2210p\db\refseq_protein Date: May 12, 2005 8:14 PM Version: 4
Longest sequence: 37,777 res
To convert a formatted BLAST database back to its FASTA form, we use the "-D 1" setting. For databases from NCBI, preformatted or formatted locally from sequences downloaded from Entrez, we can also selectively dump out the GIs using "-D 2". The first example command line below dumps out the FASTA sequences from refseq_protein and saves the output to a file called refp.fasta. The second command line dumps out the GIs and save the output to refp.gi.
16 of 18
7/20/11 5:10 PM
C:\blast2210>fastacmd -d refseq_protein -D 1 -o refp.fasta C:\blast2210>fastacmd -d refseq_protein -D 2 -o refp.gi
4.2.2 Specific sequence and subsequence retrieval Specific sequence retrieval requires that the target BLAST database be formatted with "-o T". To use this setting, the first strings in the FASTA deflines must be unique for indexing purposes. Preferrably, the deflines should conform to NCBI format ( Table 1.1). In addition, we need to have the seqID or first text string from the defline of the target sequence. For NCBI provided databases, the ids can be GI or accession numbers. The following example command lines demonstrate the retrieval of a full sequence and a subsequence (with -L 100,160) for NP_112245 from the refseq_protein database.
C:\blast2210p>fastacmd -d refseq_protein -s NP_112245 >gi|14195630|ref|NP_112245.1| microtubule-associated protein 4 [Homo sapiens] MADLSLADALTEPSPDIEGEIKRDFIATLEAEAFDDVVGETVGKTDYIPLLDVDEKTGNSES KKKPCSETSQIEDTPSSK C:\blast2210p>fastacmd -d refseq_protein -s NP_112245 -L 100,160 >gi|14195630:100-160 microtubule-associated protein 4 [Homo sapiens] PTEFLEEKMAYQEYPNSQNWPEDTNFCFQPEQVVDPIQTDPFKMYHDDDLADLVFPSSATA
We can batch retrieve multiple sequences by using a list of comma-separated ids, like in "-s NP_000240,NP_024931". Alternatively, we can provide generate an id list and provide the list to fastacmd's -i parameter as in "-i input_file". Here the input_file is the name of a text file containing ids, one record per line. For custom databases, the specific retrieval is a bit different from the preformatted databases. For example, entries from NCBI Trace databases do have NCBI styled deflines, but since they are not part of Entrez Nucleotide database, they do not have GI or accession. We can format them with "-o T", but specific retrieval will require the quoted first string. In the example below, we need to use the bold portion in quoted form to retrieve that record. We need to quote the id due to the presence of pipe symbols.
>gnl|ti|127084115 name:avt02g01.x1 AC110665 mate:127084142 mate_name:avt02g01.y1 C:\blast2210p>fastacmd -d dog_trace.nt -s "gnl|ti|127084115" >gnl|ti|127084115 name:avt02g01.x1 AC110665 mate:127084142 [] ACCTGGGTGATCTGATCCCATCGTCCTGTGGTGGAATTCTTCCCATTCTGAGAGTGAATAATAATTCACT CACTCTGAATAATTATTCACTCTCAGAATCCATCCTTCGAATTTCTGTTCAATTTTTCTGCTCCTCTTCA TCAAAATTTTCTTCAGTGTTATCTAGAGTTGCTGCCTTTACTTTTTCTTTTCTTTTTTTTTTTTAAGATT TTATTTATTTATTCATGAGAGACAGAGAGAGAGAGAGAGCCGCCNNCCCATAGGCAGAGCCTGAGGCCCC GGAAGAAGCAGGCTCCATGCAGGGAGCCCGAGGAGGGAC
4.2.3 Taxonomic information In Version 4 of the blast database, we adopted ASN.1 formatted defline. The extra space available was used to encode taxonomic id and other Linkout group bits. Entries from NCBI preformatted
17 of 18
7/20/11 5:10 PM
BLAST databases will have the information embedded in their deflines. When taxdb.tar.gz archive is installed, fastacmd will be able to provide the taxonomic information for specific entries. Sample command lines taxonomic information retrieval are given below.
C:\blast2210p>fastacmd -d refseq_protein -s NP_000240 -TT NCBI sequence id: gi|4557757|ref|NP_000240.1| NCBI taxonomy id: 9606 Common name: human Scientific name: Homo sapiens C:\blast2210p>fastacmd2210p -d refseq_protein -D 2 -o accession C:\blast2210p>fastacmd2210p -d refseq_protein -i accession -T T -o tax_info
Currently, it is not possible to dump out this piece of information for the complete database in one step. This function will not work for custom databases with non-NCBI entries or with NCBI entries but formatted without the -T setting. 5. Feedback For questions and comments on this document and BLAST in general, please send them to: blast-help@ncbi.nlm.nih.gov Questions and comments on other NCBI resources should be addressed to: info@ncbi.nlm.nih.gov Reference [1] Pearson and Lipman. "Improved tools for biological sequence comparison", 1988. PNAS 85(8): 24444 - 2448 [2] Entrez Utilities Help Document: http://www.ncbi.nlm.nih.gov/entrez/eutils/ [3] wwwblast: Setup and Usage: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/
Updated on 12/17/2007 07:44:02
18 of 18
7/20/11 5:10 PM

Formatdb and Fastacmd

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Formatdb and Fastacmd

Hochgeladen von

Copyright:

Verfügbare Formate

formatdb and fastacmd