Note 01/12/06: This is a version of my application materials that currently serves as my website. I will be continuously updating it in the next week.

KAMEN Y. YOTOV

Department of Computer Science

Cornell University

Ithaca, NY 14853

(607) 262-0792

kamen@yotov.org

http://yotov.org

 

PERSONAL DATA:

  • Date of birth: June 03, 1977
  • Citizenship: Bulgarian
  • F-1 Visa, application for H-1B pending (IBM T. J. Watson Research Center)

FIELDS OF INTEREST:

Programming languages and compilers for high-performance and high productivity computing, adaptive and self-optimizing systems, empirical and model driven optimization techniques, machine learning in the context of computer systems, parallel and distributed computing, multicore processors.

EDUCATION:

  • Ph.D. in Computer Science, January 2006, Cornell University, Ithaca, NY.
    Title of Ph.D. Thesis: On the Role of Search in Generating High-Performance BLAS Libraries
    Advisor: Professor Keshav Pingali
  • M.S. in Computer Science, January 2003, Cornell University, Ithaca, NY
  • B.S. and M.S. in Computer Science, June 1999, Sofia University, Sofia, Bulgaria

PROFESSIONAL EXPERIENCE:

  • June 2005 – present: Postdoctoral Researcher, I.B.M. T.J. Watson Research Center, Yorktown Heights, NY
  • August 2001 – June 2005: Research Assistant, Cornell University, Ithaca, NY
  • August 2002 – June 2005: Instructor, Cornell University, Ithaca NY
  • June 2002 – August 2002: Summer Intern, Microsoft Research, Seattle, WA
  • June 2001 – August 2001: Summer Intern, Microsoft, Seattle, WA
  • January 2001 – June 2001: Teaching Assistant, Cornell University, Ithaca, NY
    CS430: Information Discovery, Professor Bill Arms
  • August 2000 – December 2000: Teaching Assistant, Cornell University, Ithaca, NY
    CS472:
    Foundations of Artificial Intelligence, Professor Clair Cardie
  • September 1997 – June 1999: Held various teaching assistant positions at Sofia University, Sofia, Bulgaria
  • 1994 – 2003: Held various consulting positions, involving industrial software design, development, deployment, and maintenance.

AWARDS AND HONORS:

  • 2003 – 2005: Microsoft Research Graduate Student Fellowship
  • 2001: Semi-finalist of TopCoder Invitational Contest
  • 1999: Most Successful Student Scholarship Award, Sofia University
  • 1999: Graduated with Honors, Ranked 1st (out of 168) in the class of 2000, Sofia University
  • 1998: Finalist at 22nd ACM International Collegiate Programming Contest, Atlanta, GA
  • 1994 – 1995: Two bronze medals and one gold medal from International Programming Contests

PUBLICATIONS:

  • [PLDI’06?] The Price of Cache Obliviousness
    Submitted to Programming Language Design and Implementation (PLDI)
    (with Tom Roeder, Keshav Pingali, Fred Gustavson, and John Gunnels)
  • [LCPC’05a] Automatic Measurement of Instruction Cache Capacity
    In Proceedings of the 18th Workshop on Languages and Compilers for Parallel Computing (LCPC)
    (with Sandra Jackson, Tyler Steele, Keshav Pingali, and Paul Stodghill)
  • [LCPC’05b] A Language for the Compact Representation of Multiple Program Versions
    In Proceedings of the 18th Workshop on Languages and Compilers for Parallel Computing (LCPC)
    (with Sebastien Donadio, James Brodman, Thomas Roeder, Denis Barthou, Albert Cohen, Maria Garzaran, David Padua, and Keshav Pingali)
  • [LCPC’05c] Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization
    In Proceedings of the 18th Workshop on Languages and Compilers for Parallel Computing (LCPC)
    (with Arkady Epshteyn, Maria Garzaran, Gerald DeJong, David Padua, Gang Ren, Xiaoming Li, and Keshav Pingali)
  • [IEEE’05] Is Search Really Necessary to Generate High-Performance BLAS?
    In Proceedings of the IEEE 93(2), Special Issue on “Program Generation, Optimization, and Adaptation”
    (with Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, and Paul Stodghill)
  • [ICS’05] Think Globally, Search Locally
    In Proceedings of the 19th ACM International Conference on Supercomputing (ICS)
    Extended version to appear in Transactions on Architecture and Code Optimization (TACO)

    (with Keshav Pingali and Paul Stodghill)
  • [QEST’05] X-Ray: A Tool for Automatic Measurement of Hardware Parameters
    In Proceedings of the 2nd International Conference on Quantitative Evaluation of SysTems (QEST)
    (with Keshav Pingali and Paul Stodghill)
  • [SIGMETRICS’05] Automatic Measurement of Memory Hierarchy Parameters
    In Proceedings of International Conference on Measurement & Modeling of Computer Systems (SIGMETRICS)
    (with Keshav Pingali and Paul Stodghill)
    (runner-up for best student paper)
  • [PLDI’03] A Comparison of Empirical and Model-driven Optimization
    In Proceedings of Programming Language Design and Implementation (PLDI)
    (with Xiaoming Li, Gang Ren, Michael Cibulskis, Gerald DeJong, Maria Garzaran, David Padua, Keshav Pingali, Paul Stodghill, and Peng Wu)
    (runner-up for best paper)
  • [’98] A Multipurpose Functional Programming Language without Special Forms and Lambda Expressions
    In Proceedings of the 27nd Spring Conference of the Union of Bulgarian Mathematicians

RESEARCH STATEMENT:

My research interests lie at the intersection of programming languages, compilers, and computer architecture. In particular, I am interested in adaptive, self-optimizing systems in the context of high-performance and high-productivity computing.

Contributions

I have made three significant contributions in this area.

1) Using models in self-optimizing library generators

This work was an investigation into the importance of empirical search in library generators like ATLAS, FFTW and PhiPAC. It began as an effort to understand why compilers are unable to generate code competitive with code produced by self-optimizing library generators. Conventional wisdom in our field held that compilers were unable to produce good code because they used simple architectural models to compute optimal values for transformation parameters such as tile sizes and loop unrolling factors; in contrast, self-optimizing systems typically use heuristic search by running programs with different parameter values on the actual hardware, and determining the parameter values that perform best (this is called empirical optimization). It was believed that the models used in compilers were too simplistic to provide accurate values for transformation parameters.

In collaboration with David Padua’s group at UIUC, we decided to test this hypothesis by building a modified version of the ATLAS system that uses analytical models to choose optimization parameter values. We did not change the code generation module, so any difference in the performance of code produced by the two systems arises only from the replacement of empirical optimization with model-driven optimization. Surprisingly, our model-driven version of ATLAS produced code that obtains performance that is within 5% of the performance of the code produced using empirical search. This proved conclusively that the inability of modern compilers to produce high quality code could not be attributed to simplistic models, at least in the domain of dense linear algebra.

These results were first published in [PLDI’03], and the paper was a runner-up for the best paper award. An extended version appeared in the IEEE Journal as an invited paper in a special issue on Program Generation, Optimization, and Adaptation [IEEE’05]. In a signed review of the IEEE paper, Ken Kennedy wrote: “Every once in a while, a truly outstanding paper comes along that serves as a model for other papers in the area. This is such a paper…I must admit I was amazed by the depth and breadth of the experimental evaluation… a fascinating study, filled with rich insights.”

In more recent work [ICS’05], I showed that the remaining gap between the performance of the codes produced by the model-driven version of the ATLAS and by the original ATLAS system can be removed by doing a small amount of local search around the parameter values predicted by the model (on some architectures, the resulting system actually generates better code than the original ATLAS system). Moreover, the time required to find parameter values is a few seconds, compared to the many hours taken by empirical search. We have coined the phrase “Think globally, search locally” to describe this approach. An extended version of this work has been accepted for publication in the ACM Transactions on Architectures and Code Optimization.

In all these experiments, analytical models were constructed manually. A natural question is whether these models can be constructed automatically using machine learning techniques. In [LCPC’05c], we describe some joint work with Gerald DeJong’s group at UIUC in this direction.

2) X-Ray: Measuring architectural parameters for use in self-optimizing systems

Self-optimizing systems need the values of architectural parameters. For example, the original ATLAS system uses the capacity of the L1 cache to bound the search space for the tile size parameter used by the code generator; similarly, it uses the number of registers available for holding floating-point values to bound the search space for certain loop-unrolling parameters. Model-driven self-optimizing systems need very accurate values of architectural parameters because these values are used directly in analytical formulas. Furthermore, to be truly self-optimizing, the values of architectural parameters must be determined automatically, without user input.

It is important to note that self-optimizing systems need effective architectural parameter values that are relevant for program optimization, rather than the actual values found in hardware manuals. For example, as far as ATLAS is concerned, the effective number of floating point registers on Intel Itanium 2 is 126 as opposed to 128 because two of them are hardwired to the values 0.0 and 1.0.

I have built an open system called X-Ray for measuring effective values of a wide range of architectural parameters relevant to self-optimizing systems, including the existence, latency and throughput of instructions and instruction sequences (such as fused-multiply-add), the number of registers, the capacity, associativity, and line size of all cache levels. I have implemented X-Ray to make it easy to add new micro-benchmarks as needed.

Some of the micro-benchmarks in X-Ray implement novel algorithms. In particular, our approach to measuring memory hierarchy parameters is far more accurate than the well-known one due to Saavedra. This work was described in a paper which appeared in [SIGMETRICS’05], and which was the runner-up for Best Student Paper prize. The other major part of X-Ray, for measuring important CPU parameters, was published in [QEST’05]. A recent addition to X-Ray is the first micro-benchmark I know of that measures the capacity of the instruction cache [LCPC’05a]. It was implemented in X-Ray by Tyler Steele, one of our undergraduates, under my supervision.

X-Ray is available for download. It is being used in CS 612, which is Keshav Pingali’s graduate compiler course, and by Markus Puschel at CMU in his courses.

3) The price of adaptivity in cache-oblivious algorithms

Another way of writing programs that adapt to the memory hierarchy is to use cache-oblivious algorithms, as proposed by Charles Leiserson’s group at MIT. These are usually divide-and-conquer algorithms; each step of the division process produces sub-problems with smaller working sets, and when the working set fits in some level of the memory hierarchy, the sub-problem can be executed without further cache misses. In fact, Hong and Kung proved in 1981 that divide-and-conquer algorithms for matrix multiplication and FFT enjoyed a certain theoretical optimality property called I/O optimality (they perform an asymptotically minimal number of transfers between cache and memory for a simple memory model). These divide-and-conquer implementations can be viewed as self-optimizing programs because they adapt automatically to the different levels of the memory hierarchy.

In recent work with Fred Gustavson and John Gunnels at IBM, we performed the first careful study comparing cache-oblivious algorithms with optimized, cache-aware algorithms such as the ones implemented in the ATLAS system. Starting with a totally cache-oblivious program, we added “awareness” of the processor pipeline, registers, and different cache levels, obtaining an optimized, fully cache-aware and processor-aware program in the end. Contrary to what most people in our community believed, there is a substantial price for cache-obliviousness since the performance of the cache-oblivious code was less than 50% of that of the cache-aware code on the four machines we used. We have submitted these results to [PLDI’06?].

Future Directions

In the next few years, I want to build on these accomplishments to address the problem of producing software for multicore processors. Recently Intel’s Justin Rather pointed out that “We are at the cusp of a transition to multicore, multithreaded architectures, and we still haven't demonstrated the ease of programming the move will require. It's frustrating that, after decades of work in the area, we still don't have the tools and techniques to overcome these big programming obstacles.” Rick Rashid, the head of Microsoft Research, has placed this problem at or near the top of the list of research priorities for Microsoft.

Past efforts at automatic parallelization have had limited success. From my experience, I believe that this is in part because languages like FORTRAN and C are so low-level and so far removed from the application domain that restructuring compilers have to perform a lot of “reverse engineering” before they can determine what is going on in the program. Unfortunately, the analysis techniques that our community has developed, such as alias analysis and shape analysis, have not been able to deal with the complexities of real programs.

My research direction is based on the belief that synthesis is easier than analysis. Rather than start with a low-level language like FORTRAN and C, I want to investigate the use of domain-specific languages in which abstractions such as the operators and data structures relevant to the domain are expressed directly in the programming language. If the compiler knows the algebraic properties of the operators and data structures, it can perform parallelization and optimizations which may be very difficult to engineer at the FORTRAN or C level.

The domains of interest to me are linear algebra, finite-element mesh generators, and graphics. In collaboration with Fred Gustavson and John Gunnels at IBM, I am working on BRILA, a domain-specific language and compiler for dense linear algebra. A program in BRILA is a high-level, recursive divide-and-conquer description of the algorithm that refers only to abstract data structures like matrices and vectors, rather than to concrete data structures like multi-dimensional arrays. The BRILA compiler restructures the high-level program, chooses the appropriate concrete data structure, and generate high performance code, using domain knowledge at the higher level and well-known restructuring transformations at the lower level. I hope to use BRILA very soon to produce threaded code for IBM’s multicore Power architecture.

I am also interested in finite-element mesh generation, which is central to both the numerical solution of partial differential equations and to physical simulations in graphics. One successful method for generating large meshes is Delaunay mesh generation. Our studies have shown that the combination of appropriate high-level abstractions and the idea of transactional memory are key to automatic parallelization of these codes, even though they are very complex and deal with very irregular data structures like graphs.

One problem that does not go away even if we use domain-specific languages is the so-called “needle in a haystack” problem; that is, a compiler may be able to produce many restructured versions of an input program, but it may not know which one will perform best. As an analogy, consider a chess-playing program – it is easy to write a program that knows legal moves for all the pieces on a chess board, but it is much harder to write a program that can actually synthesize winning strategies. Empirical search in the ATLAS or FFTW style is one solution, but it does not scale to large programs or complex architectures. In addition, I must confess that I personally do not find search intellectually appealing. In contrast, models give you insight, and I would argue that finding adequate models to explain empirically observed behavior is what science is all about. However, the models that we have used in our work in self-optimizing systems have all been developed manually, and have taken many weeks of thinking and experimentation. In my future research, I want to study if techniques from machine learning can be used to automate or at least guide the generation of models. Some initial results in this direction are reported in [LCPC’05c].

TEACHING STATEMENT:

I enjoy teaching. I am qualified to teach undergraduate and graduate level courses in the areas of programming languages, compilers, computer architecture, and algorithms and data structures. At Cornell I had the privilege to completely instruct two undergraduate courses, which had a strong impact on my current and future desire to teach.

My most enjoyable teaching experience was being the instructor for CS215: Introduction to C# and .NET. A major challenge in this course is that the students come from very diverse backgrounds, ranging from freshmen to Ph.D. students and staff. I designed the first offering of this course in Fall 2002, and I ran it for eight semesters (till Spring 2005). I handled all lectures, assignments, and grading. I have been running a portal website, which has all discussions and assignments from the different instantiations of this course.

In Summer 2003, I was the instructor for CS130: Creating Web Documents. The course attracted students and staff members from a number of departments. The pace was very challenging as there were lectures daily for six weeks. For me, the most gratifying part of the course was looking at the term projects created by students who, when they started the course, had no idea how the web works.

In addition to my experience as course instructor for CS215 and CS130 at Cornell, I have held various laboratory and teaching assistant positions throughout the years of my undergraduate and graduate studies.

My first teaching appointment was as a laboratory assistant for Introduction to Programming I (a CS 101 course) at Sofia University. I taught lab sections that covered the principles of structured programming languages and basic data structures. I had this position for six semesters (two semesters each of Pascal, C/C++, and Java). In my senior year, I was a Teaching Assistant for my peers for the classes Functional and Logical Programming and Introduction to Artificial Intelligence.

The Ph.D. program at Cornell University gave me the privilege of participating in the delivery of world-class courses in Computer Science. As a teaching assistant for CS472: Foundations of Artificial Intelligence with Professor Claire Cardie, I designed problem sets and participated in the grading of homework and exams. As a teaching assistant for CS430: Information Discovery, I completely organized the course backend, participated in the design of problem sets and grading of submissions, and taught several lectures when Professor Bill Arms was out of town.

Teaching at the graduate level provides an opportunity to get feedback on research ideas. This is the reason why I have worked with my advisor, Professor Keshav Pingali, to incorporate some of my software tools into his graduate compiler course CS612: Software Design for High-performance Architectures. Currently, all assignments in this course use infrastructure that I have developed, such as the following.

  • My X-Ray system is used by students to automatically determine hardware parameters when they are doing assignments on memory hierarchy behavior.
  • I have written a Cache Simulator and Hardware Performance Counter device driver, which are used in studying the behavior of scientific applications and correlating the observed behavior to machine parameters found by X-Ray.
  • I have implemented the ISSFX ANSI C99 restructuring compiler that is used by students to perform source-to-source transformations such as unimodular loop transformations.

I have also given lectures in several classes, like CS614: Advanced Systems, CS717: Introduction to Fault Tolerance, and at several other events hosted by the Association of the Computer Science Undergraduates at Cornell.

In conclusion, I firmly believe that teaching is an invaluable complement to scientific research in academia. Even at the undergraduate level, explaining ideas improves one’s understanding of those ideas. At the graduate level, courses provide a forum for a first-cut evaluation of research ideas, and just as importantly, provide opportunities for networking with bright and intelligent people, often creating life-long friendships.

REFERENCES:

Professor Keshav Pingali (pingali@cs.cornell.edu)

Department of Computer Science

Cornell University

Ithaca, NY 14853-7501

Professor David Padua (padua@uiuc.edu)

Siebel Center for Computer Science

University of Illinois at Urbana-Champaign

201 N. Goodwin Avenue

Urbana, IL 61801-2302

Professor Ken Kennedy (ken@cs.rice.edu)

Rice University

HiPerSoft-MS41

6100 Main Street

Houston, TX 77005-1892

Additional References:

Dr. Fred Gustavson (fg2@us.ibm.com)

IBM T. J. Watson Research Center

1101 Kitchawan Rd, Route 134 / P.O.Box 218

Yorktown Heights, NY 10598

Dr. Calin Cascaval (cascaval@us.ibm.com)

IBM T. J. Watson Research Center

1101 Kitchawan Rd, Route 134 / P.O.Box 218

Yorktown Heights, NY 10598

Dr. Trishul Chilimbi (trishulc@microsoft.com)

Microsoft Research

One Microsoft Way

Redmond, WA 98052