**Bioinformatics. **My work focuses
on theoretical as well as practical aspects of biological sequence
analysis. While the more theoretical face of my research has
considerable overlap with computational statistics, the applied side
of it has more of a data driven discovery flavor. The following
couple of examples of projects I am working on with my students
Patrick Ng and Niranjan Nagarajan (now at U. Maryland) will give you
a better idea what I mean.

__Motif Finding__ - The
identification of transcription factor binding sites is an important
step in understanding the regulation of gene expression. To address
this need, many motif-finding tools (or finders) have been described
that can find short sequence motifs given only an input set of
sequences. The motifs returned by these tools are evaluated and
ranked according to some measure of statistical over-representation,
the most popular of which is based on the information content, or
entropy. Our approach to motif finding focuses on analyzing the
problem from two perspectives. First, we seek to characterize
twilight-zone motif finding: when the “real” motifs are
barely significant statistically when compared to top scoring random
motifs. Delineation of this zone is important for understanding how
far our current finders are from being optimal. We invest
considerable more effort into our second goal of analyzing the
statistical significance of a finder's output. This important area
has lagged considerably behind the extensive development of the
finders. Our main interest is to design a reliable and usable
significance analysis. Nevertheless, we also show how such analysis
can be leveraged to improve the actual motif finding process.

__Study of Replication Origins__
- DNA replication is a fundamental process essential for cell
proliferation. While the proteins involved in initiating DNA
replication are essentially conserved from yeast to humans, the
implicated sequence motifs that these conserved factors interact with
are poorly understood outside of *S. cerevisiae* (baker's
yeast). Moreover, even for *S. cerevisiae* the replication
initiation process is not completely understood. For example, it is
known that the roughly 400 replication origins in *cerevisiae*,
called ARSs (Autonomously Replicating Sequences), differ in several
important aspects from one another: at which times and frequencies do
they initiate replication, and how do they respond to mutations in
proteins that are known to be involved in forming the pre-replication
complex. Still, much of this variability is yet to be explained. We
are collaborating with Cornell molecular biologist Bik Tye on gaining
a better characterization of replication origins in *Saccharomyces*
species. In particular we are interested in characterizing the
sequence elements that account for the variability among replication
origins in *cerevisiae* as well as in detecting and analyzing
new replication origins in related *Saccharomyces* species.

__Computational Statistics__ -
Our search for an efficient and accurate computation of motif
significance led us to develop a new approach for exact tests (exact
tests are ones where the significance of the test is evaluated
directly from the underlying distribution rather than using an
approximation). Borrowing ideas from large-deviation theory, the
underlying mechanism of our approach is the exact numerical
calculation of the exponentially shifted characteristic function of
the test statistic. We use this approach so far to develop faster
exact algorithms for the classical multinomial goodness-of-fit test
and the Mann-Whitney test.

CS 280 - Discrete Structures:
Spring 04,
Fall 06

CS 4520 (aka CS 426) - Introduction to Bioinformatics: Fall
05, Spring
07, Spring
08,
Fall 08

CS 628 - Biological Sequence Analysis: Fall
04 Spring
06, Fall
07,

CS 726 - Problems and perspective in computational
molecular biology: Fall
03, Spring
04, Spring
05

Sequence Analysis Journal Club (run with Tomas Vinar and
Brona Brejova): Fall
06, Spring 07 Fall
07

ENGRG 150 - Engineering Seminar: Fall
06

**GIMSAN** –
a novel tool for de novo motif finding that includes a reliable significance analysis

**SADMAMA** –
computational tool for detection of significant variation in binding affinity across two sets of sequences

**The
FAST package** – Fourier transform based Algorithms for
Significance Testing of ungapped multiple alignments

**csFFT/sFFT**
– computing the p-value of the information content (entropy
score) of a sequence motif

**BagFFT**
– computing the exact p-value of the llr statistic for
multinomial goodness-of-fit test

**GibbsILR**
– a Gibbs sampler based motif finder

**Ph.D.** in Mathematics, Courant Institute, New York
University

Thesis title: *Stationary Approximations to
Non-Stationary Stochastic Processes.*

Advisor: Prof. H . P.
McKean

**M.Sc.** in Mathematics, Department of Mathematics, Technion -
Israel Institute of Technology

Thesis title: *A Generalization
of the "Ahlswede Daykin Inequality".*

Advisor: Prof.
R. Aharoni

**B.Sc.** in Computer Science and Mathematics, Hebrew
University of Jerusalem

**NSF CAREER Award** No. 0644136, 7/2007-1/2012.

**July 2003 - present:**- Assistant Professor at the Computer Science Department of Cornell University
- 2001 - 2003:
- Project scientist at the Department of Computer Science and Engineering of the University of California, San Diego
- 1999 - 2000:
- Assistant Professor at the Department of Mathematics of the University of California, Riverside
- 1996 - 1999:
- Von Karman Instructor at the Applied Mathematics Department of the California Institute of Technology
- 1991 - 1996:
- Research and Teaching assistant at the Courant Institute of New York University

Ng P., Keich U.
** Factoring local sequence composition in motif significance analysis.**
*GIW 2008*, In Press.

Keich U., Gao H., Garretson JS., Bhaskar A., Liachko I., Donato J.,
Tye B. ** Computational detection of significant variation in binding affinity across two sets
of sequences with application to the analysis of replication origins in yeast.**
*BMC Bioinformatics*, 9:372, 2008. (paper).

Ng P., Keich U.** GIMSAN: a Gibbs motif finder with significance analysis.**
*Bioinformatics*, In Press.

Keich U., Ng P. ** A conservative parametric approach to motif
significance analysis.** *Genome Informatics*, 19:61-72, 2007.
(preprint)

Nagarajan N., Keich U. **FAST: Fourier transform based Algorithms
for Significance Testing of ungapped multiple alignments.**
*Bioinformatics*, 24(4):577-8, 2008.

Ng P., Nagarajan N., Jones N., and Keich U. **Apples to apples:
improving the performance of motif finders and their significance
analysis in the Twilight Zone.** *Bioinformatics*,
22(14):e393-401, ISMB 2006. (preprint)

Nagarajan N., Ng P., Keich U. **Refining motif finders with
E-value calculations. ***Proceedings of the 3rd RECOMB Satellite
Workshop on Regulatory Genomics*, Singapore 2006. (preprint)

Keich U., Nagarajan N. **A fast and numerically robust method for
exact multinomial goodness-of-fit test.** *Journal of
Computational and Graphical Statistics,* , 15(4):779-802, 2006.
(preprint)

Nagarajan N., Jones N., and Keich U. **Computing the p-value of
the information content from an alignment of multiple sequences.**
*Bioinformatics*, Vol. 21, Suppl 1, ISMB 2005, i311-i318.
(preprint)

Buhler J., Keich U., Sun Y. **Designing Seeds for Similarity
Search in Genomic DNA.** *Journal of Computer and System
Sciences*, Volume 70, Issue 3, May 2005, Pages 342-363. (preprint)

Keich U., and Nagarajan N. **A Faster Reliable Algorithm to
Estimate the p-Value of the Multinomial llr Statistic**.
*Proceedings of the 4th International Workshop on Algorithms in
Bioinformatic (WABI 2004)*, September 2004, Bergen, Norway.
(preprint)

Keich U. **sFFT: a faster accurate computation of the p-value of
the entropy score.** *Journal of Computational Biology*,
Volume 12, Number 4, May 2005, Pages 416-430. (preprint)

Zhi D., Keich U., Pevzner P., Heber S., and Tang H. **Checking
for base-calling errors in repeats.** *IEEE/ACM Transactions on
Computational Biology and Bioinformatics*, 4(1):54-64, (2007).
(preprint)

Keich U., Li M., Ma B., and Tromp J. **On Spaced Seeds for
Similarity Search**. *Discrete Applied Mathematics*,
138(3):253--263. 2004. (preprint)

Buhler J., Keich U., Sun Y. **Designing Seeds for Similarity
Search in Genomic DNA.** *Proceedings of the Seventh Annual
International Conference on Research in Computational Molecular
Biology (RECOMB-2003)*, April 2003, Berlin, Germany. (preprint)

Eskin E., Keich U., Gelfand M.S., Pevzner P.A. **Genome-Wide
Analysis of Bacterial Promoter Regions**. *Proceedings of the
Pacific Symposium on Biocomputing (PSB-2003)*, January 2003,
Kaua'i, Hawaii. (preprint)

Keich U., and Pevzner, P.A. **Finding motifs in the twilight
zone.** *Bioinformatics*, Vol. 18 (2002), Issue 10, 1374-1381.
(preprint)

Keich U., and Pevzner P.A. **Subtle motifs: defining the limits
of motif finding algorithms.** *Bioinformatics*, Vol. 18
(2002), Issue 10, 1382-1390. (preprint)

Keich U. and Pevzner P.A. **Finding motifs in the twilight zone.**
*Proceedings of the Sixth Annual International Conference on
Research in Computational Molecular Biology (RECOMB-2002)*, April
2002, Washington DC, USA, ACM Press. (preprint)

Keich U., **A Stationary Tangent - the Discrete and Non-smooth
Cases.** *Journal of Time Series Analysis*, March 2003, vol.
24, no. 2, pp. 173-192(20). (preprint)

Cwikel M. and Keich U., **Optimal decompositions for the
K-functional for a couple of Banach lattices.** *Arkiv för
Matematik*, 39 (2001), No. 1, 27-64. (preprint)

Keich U., **A Possible Definition of A Stationary Tangent.**
*Stochastic Processes and Their Applications*, 88 (2000), No. 1,
1-36. (preprint)

Keich U., **Krein's Strings, the Symmetric Moment Problem, and
Extending a Real Positive Definite Function.**, *Communications
on Pure and Applied Mathematics*, 52 (1999), no. 10, 1315-1334.
(preprint)

Keich U., **On L ^{p}**

Keich U., **Absolute Continuity Between the Wiener and Stationary
Gaussian Measures.**, *Pacific Journal of Mathematics*, Vol.
88 (1999), No. 1, 95-108. (preprint)

Keich U., **The Entropy Distance Between the Wiener and
Stationary Gaussian Measures.**, *Pacific Journal of Mathematics*,
Vol. 88 (1999), No. 1, 109-128. (preprint)

Aharoni R. and Keich U, **A Generalization of the Ahlswede Daykin
Inequality.**, *Discrete Mathematics *, 152 (1996), 1-12.