In 2006 I completed a thesis for my Master of Liberal Arts in Information Technology Degree at Harvard Extension School. Here is the summary:
The form and function of all living organisms are determined and managed in great part by modulations in the transcription of their DNA. The identification of sites where regulatory proteins bind to DNA is thus crucial to understanding regulatory mechanisms. Working with sequences containing regulatory regions (cis-regulatory modules, or CRMs) at known positions, we re-implemented a published CRM-finding algorithm and developed a test suite to identify parameters that would provide optimal and consistent performance across different test sequences. This resulted in a modest but consistent improvement over the published results, although in some cases the optimized performance was dramatically better. We refined the results in two phases. First, we ran the algorithm multiple times with different, high-performing parameter sets and selected only the regions (putative CRMs, or pCRMS) shared among all runs. Second, we searched for motifs in the pCRMs of coregulated sequences and removed regions with low densities of these shared motifs, a step that ultimately proved inconclusive.
And here are are the details: