THE WILDCAT TOOLBOX
The wildcat toolbox provides perl script tools developed for use in mass spectral data analysis and sorting. A .zip file of the toolbox is provided as a download, and a brief description of each tool follows:
All of these programs require perl build 5.8.1 or higher, available for free download from Active State. The perl scripts should be installed in the /perl/bin directory, and run directly from a command prompt window. They also utilize other perl modules such as seqIO.pm and WriteExcel.pm. The complete bioperl 1.5 can be downloaded from the Bioperl site and should be unarchived into the c:\perl\site\lib folder. The WriteExcel.pm module can be downloaded from the CPAN website and should also be unarchived into the c:\perl\site\lib folder.
BRIEF DESCRIPTIONS OF EACH PROGRAM
Count.pl
A utility for counting the number of entries in a fasta format protein database that you have compiled or downloaded to use in MS/MS database searching.
Useage
In the directory where the fasta protein file is located, type:
Count.pl [filename.fasta]
The program displays on screen the number of protein headers in the file. Write it down.
Reverse.pl
A utility for reversing the protein sequences in a fasta format protein database, so you can re-search data and use that information to assess rate of false positive assignments (see Peng, Gygi et al, Journal of Proteome Research 2003, Elias, Gygi et al NatureBiotech 2003).
Useage
In the directory where the fasta protein file is located, type:
reverse.pl [filename.fasta] [filenameREV.fasta]
comment.pl
A simple program for adding a text string (i.e. a comment) to the descriptive header of ALL of the entries in a protein sequence database file. This is useful in cases where you have assembled a database from various sources and want to keep track of that later. You can add a comment to the headers in one .fasta file before joining it to another one. One good example is you can add the word “artifact” (or artichoke, or burgenflickel) to the headers in the contaminant database. Then when that is added to your genome sequence database prior to searching, all the headers contain the word “artifact”. You can then use DTA select to display the results with the contaminants removed using -l artifact.
Useage
In the directory where the fasta protein file is located, type:
Comment.pl -i [filename.fasta] -o [new filename.fasta] -c Comment
Example
Comment.pl -i contaminants.fasta -o -i contaminants_artifacts.fasta -c ARTIFACT
This will output a new .fasta file called contaminants_artifacts that has the word ARTIFACT inserted at the start of each descriptive header.
organizer.pl
A very useful tool for assembling DTAselect result pages for a set of experiments into a single excel file in individual worksheets. Use this, for example, when you have just run nanoLC-MS/MS on a set of 32 gel bands, and Sequest searched all of them. Rather than running DTAselect in each individual directory (which takes a while) and saving 32 individual result files, or cutting and pasting them together, use organizer.pl.
There are three criteria sets stored in the program:
Low -1 1.5 -2 2.0 -3 3.0 -d .05 -y 1 -p 1
High -1 1.8 -2 2.5 -3 3.5 -d .08 -y 1 -p 1
Vhigh -1 1.8 -2 2.5 -3 3.5 -d .1 -y 2 -p 2
Useage
Go to the parent directory, which contains a Sequest.params file and a number of subdirectories, each of which contains multiple .dta and .out files. Type:
organizer.pl -type [low|high|vhigh|none] -loc [full path to parent directory]
other optional arguments are:
-excel [resultfilename.xls]
-- [other DTAselect arguments to apply]
Example 1
organizer.pl -type high -loc c:\xcalibur\Sequest\ph032505 -excel ph032505hiresults.xls -- -l artifact
This will run dtaselect in each of the subdirectories using the ‘high' cutoffs, exclude any proteins with ‘artifact' in the descriptive header, and generate an excel results file called ph032505hiresults, which contains one worksheet with a DTAselect results page for each of the subdirectories.
Example 2
If you want to use a completely different set of dtaselect parameters to any of those listed above, just set the type to 'none' and enter your own parameters after the -- switch.
organizer.pl -type none -loc c:\xcalibur\Sequest\ph032805 -excel ph032805results.xls -- -1 1.2 -2 2.9 -3 4.6 -d .02 -y 2 -p 3 -l human
This will run dtaselect in each of the subdirectories using the DTAselect parameters entered (i.e. -1 1.2 -2 2.9 -3 4.6 -d .02 -y 2 -p 3 -l human), and generate an excel results file called ph032805results, which contains one worksheet with a DTAselect results page for each of the subdirectories.
Append.pl
An essential utility for preparing concatenated dta files as input for database searching using Xtandem. The program starts with a directory containing a set of dta files, produced from an nanoLC-MS/MS run. This requires first inputting an Xcalibur raw file into either the Sequest bioworks browser or scandenser and making a set of Dta files, usually several thousand individual files. The append program joins all these spectra together in a single file, with each spectra separated by an empty line. The first line of each spectra in the file lists the parent ion mass, charge state, and spectra filename, and the rest of the lines are m/z vs intensity pairs.The program needs to be run from the actual directory where all the dta files are located.
Useage
Append.pl -i [files to be concatenated] -o [outputfilename]
Example
Go to the directory where all the dta files are located, and type:
append.pl -i *.dta -o combine
This will take a few minutes and will create a single file called combine.dta that contains all of the spectra concatenated together. This is ready for searching via an XML input file for Xtandem or mascot.
Note: the inclusoion of the original .dta filename in the header line for each spectra is essential for referencing Xtandem peptide identifications results back to the original spectra.
sub_append.pl
This is an enhancement of the append.pl program described above, that runs in a parent directory containing a number of subdirectories, each of which contains a directory containing a set of dta files, produced from an nanoLC-MS/MS run. The program needs to be run from a parent directory containing subdirectories full of dta files.
Useage
Sub_append.pl
That's it. No arguments or modifiers.
The output is a concatenated file of spectra for each subdirectory, that appears in the parent directory and is already renamed as a .dta file.
Note: say you have a Mudpit data set, with 13 subdirectories worth of dta files. Run sub_append.pl from the parent directory and you get 13 concatenated dta files, one for each chromatographic step. Use windows to make a new parent directory elsewhere with just one subdirectory, and copy all the concatenated dta files into that new subdirectory. Run sub_append.pl again from the newly created parent directory and you now have a single dta files that contains all of the spectra from the Mudpit run, typically 40-50K spectra. This can then be searched using Xtandem to generate a single mudpit results file. These searches take a while, especially the post-processing.
Fasta_labeler.pl
This is a very useful program when you are working in an organism for which there is no sequence information in NCBI, but you have a homemade database file containing protein sequence information that is gene predictions and translations from raw DNA sequencing results - such as coccidioides immitis, drosophila mojavensis or bemisia tabaci. Once you have done some mass spec and database searching you then have results which are multiple peptides assigned to a sequence that has a locus name e.g. AN 6772.1, and a descriptive header saying protein translation, or nothing at all. What you would typically do is BLAST search the protein sequence from the database against the NCBI nr dtabase, and in the case of AN6772.1, for example, you would get back a result indicating a very high homology to a quercitinase from aspergillis. If you have a lot of result, you end up doing a lot of blast searching, which you really don't want to be repeating every time.
The fasta_labeler program takes as input the protein sequence database search file you used, and a second text file that you create manually which contains two tab separated columns: locus names and your annotated descriptions you want added. It then creates a new version of the protein sequence database file with your Blast search results or other comments added to the descriptive headers, along with a date modified stamp.
Useage:
Fasta_labeler.pl -oldfasta [filename.fasta] [newfilename.fasta] [descriptionsfile.txt]
-oldfasta
-m Merge description file into fasta file
-d Description file to merge into new fasta file
-r Replace the descriptions
-a Append the descriptions
-tag The tag that will surround the newly merged comments
-s Strips the descriptions from the sequences
-e Extracts descriptions in tab delimited format
-do File to output extracted descriptions to; defaults to stdout
Notes:
There are 3 basic functions of this script:
Merge, Strip, Extract
MERGE:
Given a fasta file and a tab delimited file of descriptions, the script will:
-append: put description at the end of the (already existing) comment line.
-replace: remove any previous descriptions for that sequence and replace them with ones found in the descriptions file.
You can specify a tag for the newly added comments by using the -tag option.
The tag value defaults to UA.
STRIP:
Strip will copy the old fasta file to the new fasta file without the descriptions.
EXTRACT:
Extract will read the descriptions from a fasta file and output them into a tab delimited file.
The -do option tells the script the file to put the extracted descriptions in.
If no file is given, it prints the descriptions to the screen.
DTA_sorter.pl
A utility for parsing a set of dta files in a directory into three subdirectories based on Sequest results contained in a dtaselect-filter txt file. The three newly created subdirectories are called inexcel, notinexcel and singlexcel. The program starts with a directory full of .dta and .out files that you have already run DTAselect on according to your specified criteria. It then looks at the dtaselect-filter.txt file and determines for each .dta file if the corresponding .out file is found in the dta select results, either as part of a multiple peptide protein identification hit or as a single peptide hit, or not at all. It then moves the spectra into the appropriate subdirectory and then creates a concatenated dta file for each one, same as the append program described above. You can then use these to re-search the same data with XTandem and, for example, see how many of the Sequest single peptide hits are validated in Xtandem. To use it, you must start in a directory containing multiple .dta and .out files, and a dtaselect-filter.txt file.
Useage:
DTA_sorter.pl -a
The -a switch tells the program to use the DTAselect-filter.txt file as input, or you can specify a different filename by typing it after that.
The -d switch is used to specify the filename of a file equivalent to DTAselect-filter.txt but called something different.
The -m switch is used when you want the program to operate on multiple subdirectories, treating each one as a separate data set (which requires a DTAselect-filter.txt file in each subdirectory). Without the -m switch the program will treat all data in all subdirectories as a single experimental group, and base assignments on the DTAselect-filter.txt file in the parent directory. This is used when, for example, you want to sort all the spectra from the 12 steps of a mudpit experiment into the three categories
Note: this process is not easily reversible, so make sure you have prepared a copy of the original directory and file structure before trying this out on anything important.
extract.pl
A utility for retrieving a specified set of proteins from a *.fasta protein database file based on locus names. This is useful when you want to take proteins identified in MS experiments and run them in a batch through BLAST or some other protein analysis programs. The program expects an input file containing fasta format sequences and a text file with sequence names, one per line with comma separated optional start and stop values.
Useage:
extract.pl -i fasta_file -o output_file -l list
Example:
extract.pl -i nr120804.fasta -o phresults.fasta -l mylist.txt
where mylist.txt would contain things like:
11670.m02547|LOC_Os04g27060|protein
gi|3043415|
gi|23397097|gb|AAN31833.1|
gi|23397097|gb|AAN31833.1|, 10,100 (if you only wanted to include the protein sequence from residues 10 to 100)
Note: Its fairly picky about the definition of a locus name. Only include the characters from after the > to the first space in the header, or it wont recognize it. The program prints to the screen how many sequences were added to the specified output file so if you know how many identifiers are in your input list you can easily check if it found them all.
Run_tandem.pl
This utility runs Tandem on all .dta files in a specified folder. There are two ways to run the program 1) run it by specifying a database and the directory containing the .dta files. 2) Run it like above but specify an option to use your own input.xml for Tandem to use. If you don't specify an input file the program will run Tandem using an xml input file it created from a default file called run_tandem_file.xml. If you need the values in this file to change you can use your own file. To do this, make a copy of run_tandem_file.xml, rename it and make your modifications. You can then use your own custom input.xml file.
Requirements:
1) run_tandem.pl and run_tandem_file.xml need to be in the \tandem\bin directory.
2) Use run_tandem_file.xml as a template to make your custom input.xml file because it has tags in it which control the programs operation (tag1, tag2, and tag3). These tags are required by run_tandem.pl so do not change them.
Usage
In \tandem\bin type the following at the command line the stuff in italics is optional
run_tandem.pl [database name] [directory path] -i [name of your xml input file]
For example to run tandem on the .dta files in C:\data\PH030705 using PH_input_template.xml found in C:\data\PH030705 you would enter following at the command line.
C:\program files\tandem\bin>run_tandem.pl yeast C:\data\PH030705 C:\data\PH030705\PH_input_template.xml
CommonSingles.pl
This is an initial version of a script we use for comparing the output of Sequest and Xtandem database searches, specifically to see which of the single-peptide based identifications from large datasets are found in both sets of search results. The current version takes .txt files as input, due to some unexpected difficulties with excel reading parameters, but we plan to make it able to take in excel files directly in a later version. The sequest results file we usually use for this is the DTASelect-filter.txt file. The Xtandem results file is produced from the Global Proteome Machine output, either via their website or by local installation. Display the search results in "table" format, go to the "excel" option and save results as an Excel file. This can then be re-saved as a .txt file. The way we have been using this is to first use the DTA_sorter script to sort out all the .dta files corresponding to single peptide based sequest matches from a large dataset, then re-search that subset using Xtandem, then use the CommonSingles program to merge the results.
Usage:
In the data directory where your two search results files are located, type:
CommonSingles.pl [sequestfile.txt] [xtandem file.txt] ]
For example to analyze the DTASelect-filter.txt files of results from a yeast mudpit experiment named 0925, and see which of the single peptide based identifications were also found when the single peptide data was re-searched using XTandem, you would go to the directory where both input files are present, and enter the following at the command line.
C:\data> CommonSingles.pl DTASelect_filter0925.txt XTandemSingles0925.txt
and the program will create a results output file called CommSngl_DTASelect-filter0925.txt, which includes sequest identification data and XTandem identification data for each one. This can be opened in Excel for easier viewing.
If you run it with a -c switch before the input filenames, it creates an additional output file called XtDTA_DTASelect-filter040305, which contains all the original multi-peptide hits from the sequest results, and the common singles as well, with sequest identification data and XTandem identification data for each.
This program and how to use it are described in more detail in a manuscript submitted to Proteomics on 10th November 2005.
|