Motifs present in the exons coding part of the genome decide the structure of the protein or label proteins to be sent to certain parts of the cell for processes like phosphorylation. Motifs that are present in introns which makes up the non coding part of genome are usually the regulatory sequences which determine the amount of gene expression and binding sites of proteins. Satellite DNA, which is the main component of centromeres and heterochromatin, is an example of motif found in junk parts of the genome.
Different occurrences of a sequence motif may differ from each other even if they perform the same function. As such, we define what is known as the consensus sequence for a set of sequences.
Writing the sequences one below the other makes it easier to see this - A is the most common residue in position 1, G is most common in position 2, and so on. Given that motifs are substring patterns which are found at multiple places each of which may or may not have mutations, there are specific ways of representing these sequences. Text representation: We will use the following example of a nucleic acid sequence to introduce various aspects of the representation.
In this representation, A, T, G and C, denote the four possible nucleotide bases adenine, thymine, cytosine, and guanine. Graphical representation: There is also a graphical method of displaying consensus sequences, i. A consensus logo conveys information about the conservation of each position of a sequence motif. The consensus logo depicts the degree of conservation of each position using the height of the consensus character at that position note that degree of conservation is different from the frequency of each nucleotide at each position.
Positions where there is lesser conservation have shorter total height position 3, 5, 19, 21, etc. Positions with high conservation have highest total height usually taken by a single character , for example, positions 6, 10, 14, 18, etc.
The exact height of each character comes from the entropy of each position, which is described in detail later in the article. Motif finding is described as the problem of discovering motifs without any prior knowledge of what the motifs look like. Computationally, the motif finding problem can be defined as: Given a set of T sequences each of length N, find the best pattern of length L that appears in each of the T sequences.
Different types of scores exist to help us score a given pattern and choose which one is the best. That is, we are given 5 sequences of length , and we want to find one sequence motif of length 8 from each sequence such that the motifs are similar to each other.
First, we'll see how to score a given set of motifs. That is, we'll assume that we are given some sequences of length L from each of the T sequences which are our motif candidates. Given this, we'll see how to score these motif candidates.
In the next section, we'll see how we can find the motifs in the original sequences. Suppose we have the following set of motif candidates obtained from a set of sequences 5 motif candidates, each of length 8. From the given motifs, we first construct a profile matrix , which is simply the frequency of each nucleotide base in each position.
Hence for our above example, following is how our profile matrix will look like,. The most common letter in each column taken together gives us the consensus string.
Hence in this example, our consensus sequence would have A at the first position because it appears maximum times in column 1, similarly we will have C at second position and so on to get the following consensus sequence: ACGTACGT. Our objective is to minimize this score. A more appropriate score is the entropy of the profile matrix. Entropy is a measure of how conserved each position is. Higher entropy implies low conservation, and low entropy implies high conservation.
Let prob R, l be the probability that residue R appears at location l, i. It calculates the desired weights for each background sequence to help minimize the error. Due to the complexity of the problem, HOMER uses a simple hill-climbing approach by making small adjustment in background weight at a time.
It also penalizes large changes in background weight to avoid trivial solutions that a increase or decrease the weights of outlier sequences to extreme values. If you wish to use the old version when running any of the HOMER family of programs, add " -homer1 " to the command line. Parsing input sequences into an Oligo Table Input sequences parsed in to oligos of desired motif length, and read into an Oligo Table. The Oligo Table hold each unique oligo in the data set, remembering how many times it occurs in the target and background sequences.
This is done to make searching for motif which are essentially collections of oligos much more efficient. However, this also destroyes the relationship between individual oligos and their sequence of origin. While the Autonormalization described in step 4 above is applied to full sequences i. The idea is still to equalize the smaller oligos i. This is a little more dangerous since the total number of motif lengthed oligos can be very large i.
The basic idea is that if a "Motif" is going to be enriched, then the oligos considered part of the motif should also be enriched. To speed up this process, which can be very resource consuming for longer oligos with a large number of possible mismatches, HOMER will skip oligos when allowing multiple mismatches if they were not promising, for example if they had more background instances than target instances, or if allowing more mismatches results in a lower enrichment value.
Calculating Motif Enrichment: Motif enrichment is calculated using either the cumulative hypergeometric or cumulative binomial distributions. These two statistics assume that the classification of input sequences i. The statistics consider the total number of target sequences, background sequences and how many of each type contains the motif that is being checked for enrichment.
From these numbers we can calculate the probability of observing the given number or more of target sequences with the motif by chance if we assume there is no relationship between the target sequences and the motif.
The hypergeometric and binomial distributions are similar, except that the hypergeometric assumes sampling without replacement, while the binomial assumes sampling with replacement. The motif enrichment problem is more accurately described by the hypergeometric, however, the binomial has advantages. In these cases, the binomial is preferred since it is faster to calculate. As a result it is the default statistic for findMotifsGenome.
However, if you use your own background that has a limited number of sequences, it might be a good idea to switch to the hypergeometric use " -h " to force use of the hypergeometric. One important note: Since HOMER uses an Oligo Table for much of the internal calculations of motif enrichment, where it does not explicitly know how many of the original sequences contain the motif, it approximates this number using the total number of observed motif occurrences in background and target sequences.
It assumes the occurrences were equally distributed among the target or background sequences with replacement, were some of the sequences are likely to have more than one occurence. It uses the expected number sequences to calculate the enrichment statistic the final output reflects the actual enrichment based on the original sequences.
HOMER takes the most enriched oligos from the global optimization step, transforms them into simple position specific probability matrices, and further optimizes them with a sensitive local optimization algorithm.
This step is performed separately for each oligo, and will create the "motif probability matrix" as well as determine the optimal detection threshold to maximize the enrichment of the motif in the target vs. The detection threshold is simply done by scoring each oligo in the data to the probability matrix, and then sorting the oligos by their similarity to the matrix.
HOMER then steps down the list, effectively decreasing the detection threshold, including more and more oligos until an optimal enrichment is found. After this step, HOMER will create several new probability matrices based on the oligos found in different detection thresholds and check which one has the highest enrichment. This process is repeated until the enrichment can no longer be improved, producing a final motif. After the first "promising oligo" is optimized into a motif, the sequences bound by the motif to are removed from the analysis and the next promising oligo is optimized for the 2nd motif, and so on.
This is where the there is an important difference between the old homer and new homer2 versions. The old version of homer would simply mask the oligos bound by the motif from the Oligo Table. This would cause homer to find multiple versions of the same motif and provide a little bit of confusion in the results. To avoid this problem in the new version of HOMER homer2 , once a motif is optimized, HOMER revisits the original sequences and masks out the oligos making up the instance of the motif as well as well as oligos immediately adjacent to the site that overlap with at least one nucleotide.
This helps provide much cleaner results, and allows greater sensitivity when co-enriched motifs. To make revert back to the old way of motif masking with homer2, specify " -quickMask " at the command line. You can also run the old version with " -homer1 ".
To find the enrichment for each motif, HOMER scans each sequence for instances of the motif and calculates the final enrichment by considering how many target vs.
ZOOPS zero or one occurence per sequence counting is used and the hypergeometric or binomial is used to calculate the significance. Motif Files homer2, findMotifs. They are reported in the output directories from findMotifs. The header row is actually TAB delimited, and contains the following information:.
Multiplicity: The averge number of occurrences per sequence in sequences with 1 or more binding site. You can easily create your own motif files , just remember that the first 3 columns are required!!! HOMER takes the motifs identified from de novo motif discovery step and tries to process and present them in a useful manner. These pages are explicitly created by running a subprogram called " compareMotifs. Comparison of Motif Matrices: Motifs are first checked for redundancy to avoid presenting the same motifs over and over again.
This is done by aligning each pair of motifs at each position and their reverse opposites and scoring their similarity to determine their best alignment. Neutral frequencies 0.
0コメント