VOCABULON.f90 This is a FORTRAN implementation of the algorithm Vocabulon described in: Sabatti, C. and K. Lange (2002) "Genomewide motif identification using a dictionary model," IEEE Proceedings 90: 1803-1810. Sabatti, C., L. Rohlin, K. Lange, and J. Liao (2005) "Vocabulon: a dictionary model approach for reconstruction and localization of transcription factor binding sites," Bioinformatics 21: 922-931. You are welcome to modify the code and use it as you see fit. The code comes with out warranties. Please acknowledge the original author by quoting the publications above. INPUT FILES =========== There are two input files: one contains the sequences to be analyzed and one the information on the dictionary structure and pseudocounts for the Dirichlet prior probabilities. Names of input file have to be specified within the program, within the module CONSTANTS. TEXT_FILE This input file contains the DNA sequence data to be analyzed. Each sequence is described with two lines. The first contains a sequence identifier, the second the letters of the sequence. The program currently looks only in one direction so you have to include here the reverse strand also. Ex. > b0001 TTCTGACTGCAACGGG > b0001rev CCCGTTGCAGTCAGAA WORD_FILE This input file contains the prior information on spelling. Each word is described first by a line indicating the name and length of the world, as well as the pseudocounts needed for the prior probability of the apparence of the word. Following is a matrix of counts that represent the pseudocounts of each letter at each of the word position for the prior on the spelling: each position corresponds to one line, and the first column stands for A, second for C, third for G, and fourth for T. Every line should total up to the same count. The constant PRIOR_STRENGTH (see below) also contributes to determine the strength of the priors. Ex. description of the background world, one letter long, with a pseudocount of 1000. >back 1 1000 1 1 1 1 OUTPUT FILES ============= There are a number of output files that present different summaries of the analysis. This is done with the goal of providing the user with what we thought to be useful formats. Of course, this is also one of the easiest portion of the code to modify in case one were to need some other output. Name of output files have to be specified within the program, within the module CONSTANTS. OUTPUT_FILE This file log the values of the posterior distribution as the iteration progresses. It also reports possible error messages. PARAMETER_FILE This file reports the estimated values for the model parameters. Parameters are organized by blocks. For each block, the first line reports block length and estimated block probability. Then each word in the block is described by its estimated conditional probability with in the block and its estimated spelling probabilities. Ex. BLOCK LENGTH = 1 BLOCK PROBABILITY = 0.999985 MOTIF = >back PROBABILITY = 1.00000 A C G T 1 0.26152 0.23846 0.23846 0.26153 SPELLING_FILE The spelling file contains part of the information already output in the parameter file. Each line represent a position in a word and gives the spelling probabilities. Ex. WORD POSITION A C G T >back 1 0.261530 0.238467 0.238469 0.261534 >lexA 1 0.184684 0.061793 0.000000 0.753523 LOCATION_FILE The location file reports all the locations where a non background word was detected with probability larger then a certain threshold (see MOTIF_CUTOFF_PROB below). Each row corresponds to one instance of the motif and it contains the name of the sequence, the name of the motif, the position in the sequence where the motif starts and the probability with which it is evaluated to be present. Ex. SEQUENCE MOTIF POSITION PROBABILITY > b0226 >lexA 567 0.884487 > b0227 >lexA 406 0.848320 MEAN_FILE This file contains as many rows as the number of sequences analyzed and for each it reports the expected number of non background words found. VERBOSE_FILE This file contains the same information as the location file. However, this is arranged in an expensive way, with as many rows as the number of sequences analyzed, as many columns as the number of non-backgorund words in the dictionary, and a zero in each entry that did not appear in the LOCATION_FILE. Typically, there is no need to produce this file. Q_FILE This file reports the block-proabilities iteration by iteration. RUNNING PARAMETERS ====================== There are a few parameters that can be modified. We report them here with their default value and add explanations when their interpretation is not obvious. MAX_ITERATIONS = 1000 PRIOR_STRENGTH = 10 This is a value by which all the prior counts of letters in a word are going to be multiplied. This parameter is introduced so that prior strength can be controlled and modified, without need of modifying the input file, that can be based on counts observed in databases and becomes important, then, only for relative counts. CONVERGENCE_CRITERION = TEN**(-6) DIRICHLET_LETTER_PRIOR = 0.001 This is added to the prior counts, to make sure that each letter a priori has a positive pseudocount. MOTIF_CUTOFF_PROB = 0.5 This is the cut-off used in determining which motif position should be reported in the LOCATION_FILE. EXAMPLE FILES ================= The file VOCABULON.f90 refers to the following files that are here included as examples. TEXT_FILE = "proecoli8reverse.in" WORD_FILE = "mypriorlexA.in" OUTPUT_FILE = "Globfindlexm.out" PARAMETER_FILE = "Motiffindlexm.out" SPELLING_FILE = "Wordsfindlexm.out" LOCATION_FILE= "LocOccfindlexm.out" MEAN_FILE= "MeanOccfindlexm.out" Q_FILE= "qfindlexm.out"