CS312 Lecture 1: Course Overview, Notes on Programming, Background on ML

Who are we?

Prof. Zabih and a staff of about 10. See web page for details: http://www.cs.cornell.edu/courses/cs312 (single most important piece of info in today's lecture). Two members of the course staff are here in lecture today - Maya Haridasan (TA) and Rohan Murthy (consultant). Please stand up...

You will meet the rest of the staff in section and in consulting. More administrivia later on in the lecture.

What is CS312 About?

CS312 is the third programming course in the Computer Science curriculum, following CS100 and CS211. The primary goal of the course is to give students a firm foundation in the fundamental principles of programming and computer science.

A major goal in CS312 is to teach you how to program well. Just about anyone can learn how to throw code together and get simple programs running, but it takes a deep understanding of the principles of computer science to write truly elegant and efficient programs with lasting value. We will try to give you that understanding and teach you some of the craft of programming as well. And practice makes perfect.

Some notes on programming and programming languages

Lots of people vastly overstate the importance of knowing one computer language versus another. In particular, students tend to want to know "What language is the course in?" This is actually a fairly dull question. It's like worrying about what book you use when you first learn to read (Dick and Jane? Dr. Seuss?) In fact, it's actually fairly silly to even list the computer languages you know on your CV; most professors can't answer this question!

There is an important reason for this: computer languages have a lot in common. If you know almost any language really well, you can pick up any other language in a few days (at most). If only foreign languages were so easy (based upon my expertise in Portugese, I will learn Chinese in under a week...) In fact, many of the distinctions between programming languages are syntactic in nature, which means that all you really need is some examples to start coding in that language.

The key notion here, though, is to know a language really well.This means having a good mental model of what the computer does with a program. How can you tell if you don't have a good model? Suppose your program isn't quite working (it gives the number you want, but is often off by 1, i.e. fencepost error somewhere). If you don't have a good mental model, you tend to change some <'s to <='s. Or worse yet, subtract 1 at the very end.

Why is this bad? There are LOTS of wrong programs out there, and very few right ones. What are the chances that you will stumble onto the right one? The lottery has MUCH better odds.

If you have a good mental model, you can just look at the program and think. It's harder, but much more likely to succeed.

VERY important piece of advice: if you find yourself typing instead of thinking, stop. This is probably your first course with just CS majors, and therefore the assignments will be a lot more work (especially problem sets 4 and 5). You can trash your life (i.e., pull multiple all-nighters) by typing instead of thinking.

CS312 slogan #1 (first of many): Thinking is better than typing.

Many of the problem sets, and all of the exams (very important!) have short, elegant answers. It's in your interest to find them.

Given this, you should learn one language well. Ideally, that language should have a simple and elegant model!

Some fundamental principles of programming

The most important principles arise most clearly in big programs. This makes Microsoft a great source of examples (and counter-examples). If you like Microsoft, you would say that they have clearly found a way to build software that sells, literally, hundreds of millions of copies. If you don't like Microsoft, you would say that they provide a showcase for every programming mistake that it is possible to make. Both views have their adherents...

Principle #1: Never write the same code twice

Many of the basic features in programming languages have a single primary goal, which is to make it easy to avoid duplicating functionality. Examples: subroutines, inheritance. In fact, a good way to think about a programming language is to consider its particular features to encourage code to be used in multiple places.

Why is this so important? Well, imagine that you are a programmer in a large company, and you write a spelling checker. Meanwhile, in a different department (project, continent) some other programmer also writes a spelling checker. Now there are two copies. Why is this bad?

Well, imagine that the other programmer spends a lot of time looking at clever data structures (like we'll cover in CS312), and finds a way to make his spelling checker much faster (say, 10x as fast). You don't benefit from his effort. Or suppose that he finds a subtle bug and fixes it; again, you don't benefit. Or they may add a cool feature, such as stemming. Again, you gain nothing from this effort.

Multiply this by a large number of modules and things get really out of control. Microsoft example: Word, Excel and PowerPoint were originally completely separate programs, written by different groups (in fact, different companies). Various shared features, such as spell checkers or even things like the File Open dialog box, were duplicated. Even the fact that File Open looked the same in Excel and Word was just coincidence; they were different pieces of code, and the designers had to keep them in sync. Between Office95 and Office97, Microsoft rewrote these programs internally so know they share code. (As someone will no doubt add, now they share bugs as well...)

Principle #2: Catch errors as early as possible

Program development takes time, often years. Once a program has been written, it can then be used for many more years (various banks are still running programs written in COBOL in the 1960's, and let's not even talk about the air traffic control system!)

It's very important to catch bugs when they are introduced. The obvious reason is that programmers forget what their code did (if you ever want to understand why it is so important to comment your code, go and look at a program you wrote a year ago and try to figure out what the heck you were thinking!) Moreover, programmers move on to other jobs, or occasionally even retire.

But there is a subtler reason for this, which is that if you don't catch a bug early, it is much much harder to find it at all. This is because bugs can manifest themselves in very subtle ways; for example, an errant pointer can corrupt some unrelated datastructure, and many days later some completely different piece of code might then crash.

On the other hand, if you add a module and suddenly something breaks, you know that what you just did caused the problem!

Languages, like the one we will use, which make it hard to write buggy code are good. Some languages simply will not permit certain kinds of bugs (ML is one of them). Of course, the trade-off here is that you can't write certain kinds of fast programs. Another feature is that for some (strongly typed) languages the compiler will catch lots of errors. Various students in past CS312 classes have said that if their code compiled, it would run correctly. This is, in fact, high praise.

Software companies actual devote enormous effort to testing their code, in an attempt to find bugs early. Microsoft has about twice as many testers as programmers (and, to quote Bill Gates from a 1997 meeting, "probably there should be more").

Principle #3: Be wary of over-stressing performance

Performance (speed) is important, but a lot of programmers over-stress it and cause all kinds of difficulties. For example, when you start writing a program, it is tempting to take some subroutine and try to make that incredibly fast, by spending a lot of time and effort on it. However, if you're not careful, you can discover that this subroutine doesn't even get used, especially if instead of thinking up a good design you simply start coding. (I view this as a special case of typing instead of thinking.) Even if it does get used, .

Moreover, it is very tempting to sink a large amount of programming time into making a piece of code really fast, thus causing a long delay before any version of the software is ready. Worse, you often don't know what the bottlenecks are, or what the input pattern will be, until you have at least an initial prototype running. It may turn out that in practice one special case in your subroutine gets called 99% of the time, and by special casing this you could have made your code very fast but also simple.

The solution to this problem is called rapid prototyping. It's really under-used in the software industry, although various academics have been yelling about it for decades. In rapid prototyping, you optimized for speed of development, and build an initial version as quickly as possible. In particular, you don't spend time worrying about performance. Then once you have your prototype you can find out (for example) whether it's even close to what your customer wants, as well as what the usage pattern is and where bottlenecks lie.

Industry used to ignore this idea, and suffered greatly. In recent years it has rediscovered it, and fallen for languages that support (Java and C# are two examples, as is ML).

Principle #4: Keep a clear distinction between specification and implementation

A specification describes the input-output behavior of a program. Any individual implementation of a specification will have various quirks (accidental details).

You should learn to be extremely wary of having someone else depend upon your specific implementation. This can lead to nightmarish situations.

For example, suppose that you are writing a library of code to draw text and graphics on the screen. Perhaps you have a well-documented interface, but your implementation has some quirks. For instance, perhaps drawing a particular set of characters rapidly at some special location erases the whole screen. (This kind of thing can happen.) Now, some user happens to take advantage of this “feature”. Perhaps it’s some other company’s software which uses your library, but which is very important.

At this point, your specification has essentially changed out from under you, and you now bound to continue with this accidental feature. Worse, you may have to do this essentially forever. So your new drawing library, which is completely different from the old code, has to be sure that it supports this, essentially as a special case.

Now imagine that the same thing happens with your new drawing library. Someone else takes advantage of another “undocumented feature” of your new implementation.

Fast-forward a few generations, and your code is now littered with special cases that generate bizarre behavior that someone once depended upon (and perhaps still does).

If this happens to several of your libraries, you might end up essentially implementing “bug sets”, where all your code knows to behave as if it is version 1, or version 2, or version 3. Then you somehow annotate the application as to which bug set it uses.

Perhaps this sounds fanciful, but it actually happens on a regular basis. Anyone know the source of my example? Microsoft GDI and Lotus 123. Even under Windows NT or Win2K, on certain GDI calls crawl up the stack, see if the application calling them is lotus123.exe, and emulate weird bugs. Moreover, WinXP has essentially institutionalized bug sets.

Motto: (Well, there can be several, but here is one…) He who controls your specification controls your life. Write a good one, and stick to it firmly.

By the way, can anyone guess WHY Lotus 123 would do such a crazy thing? Performance, of course!

How this relates to CS312

Our primary emphasis is going to be on understanding code really well. We’re going to use a language that supports lots of ways of avoiding repeating code (including some, like HOP’s, that you probably haven’t seen). It also does a very good job of catching bugs at compile time.

Re: performance and specification. The biggest performance improvements come from better algorithms and data structures, not from low-level tweaks (which can absorb huge amounts of programmer effort, as well as resulting in un-maintainable code). One of the thrills of CS is to be able to keep the interface intact but add a clever new data structure or algorithm, thus causing things to go faster. This is another important part of the course.

Administrivia

Lectures and Sections

Lectures are Tuesday and Thursday, 10:10 to 11am in Kimball B11. Most of them by RDZ, a few by Tibor Janosi, maybe a guest lecture. Note: please don't take CS314 and CS312 at the same time; your life and GPA are likely to suffer greatly.

Sections are Monday and Wednesday at three times (see web page). There will be a 3rd section added, but we don't yet know what time it will be. We'll try to make it convenient for you,subject to various constraints. Keep an eye on the web page, and try to make it to one of the other 2 sections on Monday. This will all be settled early next week.

You are expected to attend both lectures and recitations. You may attend any recitation you want to, but it's probably in your interest to stick with one. Feel free to load-balance.

Course web site

The course web site is at http://www.cs.cornell.edu/courses/cs312. You should keep a close eye on this web page. We will post announcements about the course there. The programming assignments will all be posted there too.

Newsgroup

The best way to reach the course staff is by posting questions or comments to the course newsgroup, cornell.class.cs312. There are many members of the course staff reading the newsgroup who can answer your questions. Read the guidelines on the web page for some tips about the newsgroup etiquette.

Email

For questions that would be inappropriate to post to the newsgroup, you can also reach the course staff by sending mail to cs312@cs.cornell.edu. The newsgroup is preferred, however.

Software

You can download a copy of SML of New Jersey from the course web site. This include the Emacs editing environment that you will use to interact with SML and do your programming and debugging.

We will have a few sessions demoing this environment next week. Keep your eye on the course web site for updates about the demos.

Consulting hours

The TAs will have regular office hours during the day; consultants have evening consulting hours. Office hours and consulting hours will be on the web shortly. There will be two different consulting schedules; a light schedule, for weeks when nothing is due, and a heavy schedule before problem sets and prelims. In addition, the night that every problem set is due, we will hold extended consulting hours from 7pm-12 midnight. Consulting hours will not be held the day after a problem set is due.

Problem Sets

The homework in this class will consist of five problem sets. The first of these problem sets will be available on the course web site by Monday. Problem sets are due 11:59PM on Wednesday and will be handed back in section on Monday. The first problem set is due on Wednesday September 10..

Prelims & Final

There will be two prelims, October 16 and November 18, held in the evenings. Location is on web site.

The final is December 17.

Make-up exams are oral; let's try not to have them.

Grading

Last year: 30% A, 40% B, 30% C or less (mostly C). Past performance is no guarantee of future outcomes.

Everything counts, but exams count more. Especially the final, since I have it in front of me when I assign grades.

Our language choice: SML

We use the Standard ML (SML) programming language throughout the course. SML is a modern functional programming language with an advanced type and module system. The course is not about programming in SML. Rather, SML provides a convenient framework in which we can achieve the objectives of the course. Like the object-oriented model of Java, the functional paradigm of SML is an important programming model with which all students should be familiar, as it underlies the core of almost any high-level programming language. In addition the SML type and module systems provide frameworks for ensuring code is modular, correct, reusable, and elegant. The lessons you learn in programming with SML will be applicable to other programming languages such as Java. By studying alternative ways to write programs, you will be better equipped to use, implement or even design future programming environments.

Another important reason we use SML is that it has a relatively clean and simple model that makes it easier to reason about the correctness of programs. Indeed, SML was one of the first major programming languages to have a formal semantic definition. In our studies, we will see that we can reason formally about the functional correctness of code, and also about the space, time, and other resources used in a computation.

Background on ML

Our first order of business in this course is to learn how to use ML. Why learn another language?

We use a zillion different programming languages to communicate with machines and each other:

· general purpose programming: Fortran, Lisp, Basic, C, Pascal, C++, Java, etc.

· scripting: Visual Basic, awk, sed, perl, tcl, sh, csh, bash, REXX, Scheme, etc.

· search: regular expressions, browser queries, SQL, etc.

· display and rendering: PostScript, HTML, XML, VRML, etc.

· hardware: CCS, VHDL, Esterelle

· theorem proving and mathematics: Mathematica, Maple, Matlab, NuPRL, Coq

· others?

Though there are only a handful of general-purpose languages that you will learn and use, you'll be learning and using special-purpose languages for the rest of your life. Even general-purpose languages come and go. Today, it's Java and C++. Yesterday, it was Pascal and C, before that Fortran and Lisp. Who knows what it will be like tomorrow? You have to learn how to learn new languages.

In addition, many projects will require that you build "little" languages for gluing things together.

· Javascript grew out of a little language to make web pages interactive

· protocols, like HTTP or TCP are little languages that allow devices to talk to one another

· the command prompt of DOS handles a little shell language

· search engines on the web accept queries in a little language

· others?

We gain a lot of leverage by having good notation and good language support for a given domain.

· perl is extremely useful for searching through documents because of its built-in support for regular expressions

· SQL is a very high-level language that makes it easy to do database transactions in a scalable way.

So it's important to understand programming models and programming paradigms because in this fast changing field, you need to be able to rapidly adapt.

It's crucial that you understand the principles behind programming that transcend the specifics of today.

There's no better way to get at these principles than to approach programming from a completely different perspective.

This is one reason why we're using ML -- it's very different from what most of you will have seen.

A great general-purpose programming language:

· lets you say things concisely and understandably at the right level of abstraction

· lets you extend the language with new features that are specific to a domain but blend in well with the rest of the language.

· makes it easy to write correct code, with good performance

· makes it easy to change the code when you find out the specification has changed

· makes it easy to re-use code

· is easy to learn

Fact: there are thousands of general purpose languages.

Corollary: there are no great programming languages.

But there are some pretty good ones. Java and ML are pretty good general-purpose languages (at least when compared to their predecessors.)

SML is a functional programming language.

· genealogically, it fits in to the Lisp, Scheme, Miranda, Hope, Haskell, etc. line of programming languages.

· Lisp vs. FORTRAN: functional vs. imperative

· the key linguistic abstraction of this family: programmers can build new functions

· forms the core of almost any general-purpose language

· casting everything in terms of functions has its benefits: uniform, simple

· functions are first-class: you can pass them to other functions, return "new" functions from functions, put functions in data structures, compose new functions out of old ones, etc.

· you don't need to build in loops (e.g., while-loops, for-loops, do-loops, iterators, etc.) because these can be coded easily using functions.

· constructing models of and reasoning about functional languages is generally easier than for other languages (since you have to at least model the functional subset)

· SML does support imperative programming, but doesn't encourage it.

· SML is not object-oriented, although there are versions of ML

SML is a statically typed, type-safe programming language.

· a type-safe language ensures that you don't apply the wrong operations to the wrong data.

· In practice, this prevents a lot of silly errors (e.g., treating an integer as a function) and also prevents a lot of security problems -- over half of the reported break-ins at CERT were due to buffer overflows -- something that's impossible in a type-safe language.

· Functional languages like Scheme and Lisp are type-safe, but dynamically typed. That is, type-errors are caught only at run-time.

· C and C++ are statically typed but not type-safe. There's no guarantee that a type-error won't occur.

· Java and SML are type-safe and statically typed. This means that most errors are caught before running the program.

· Fact: statically determining whether a program will have a type-error is impossible.

· Corollary: all statically-typed languages are conservative and may reject some programs that are perfectly okay.

· A good statically-typed language rules out lots of bad code, while admitting lots of good code.

SML (and SML/NJ in particular) supports a number of advanced features.

· garbage collection: as in Java, the automatic memory management of SML lifts the burden of having to worry about memory management -- a common source of bugs in languages such as C or C++.

· type inference: you do not have to write type information down everywhere. The compiler automatically figures out most types. This makes the code a bit more terse which can make it easier to read and maintain. (But this is a double-edged sword. Too little type information can make code harder to read.)

· parametric polymorphism: ML lets you write functions and data structures that can be used with any type. This is crucial for being able to re-use code. Java provides a form of subtype polymorphism which also lets you re-use code. We'll learn more about parametric and subtype polymorphism and their relative strengths and weaknesses in class.

· algebraic datatypes: you can build sophisticated data structures in ML very easily, without fussing with pointers and memory management. Pattern matching makes them even more convenient.

· exceptions, threads, and continuations: as in Java, SML/NJ supports exceptions and threads, which are crucial for building real systems. The thread model of SML/NJ is radically different from that of Java, however. In addition, SML/NJ supports continuations, which are an advanced control construct out of which you can build things like loops, exceptions, and threads.

· advanced modules: SML makes it easy to structure large systems through the use of modules. Modules (called structures) are used to encapsulate implementations behind interfaces (called signatures). SML goes well beyond the functionality of most languages with modules by providing functions that manipulate modules (functors), module variables, multiple interfaces per module, and nested modules.