CS212: Summer 2003
Lecture 2 - From code to running program
-----------------------------------------------------------------------
0) Announcements

-----------------------------------------------------------------------
1) Languages

There are 3 different types of programming languages:

+ MACHINE LANGUAGE: the instructions and coding system that the computer
  will act upon (not usually something in which humans write programs)
  - has a number of fields: op-code field (operation code) and operand 
    fields
  - each field has a pattern of bits
  - the op-code has the operation, like STORE, ADD, JUMP, etc
  - each operand field has info associated with the operation (like an 
    address or data to manipulate)
  - the complete set of instructions for a computer is the INSTRUCTION
    SET
 
+ ASSEMBLY LANGUAGE
  - Programming by binary numbers is so tedious (and inefficient!)
  - Symbolic notation for binary machine instructions.  So, to add two 
    numbers, a programmer would write "add A, B" instead of 10001110010
  - An "assembler" converts assembly language into machine code
  - Only one instruction per line, however... still inefficient.  That's
    why we have "high-level programming languages", which are easier for
    a human to read and write, and which are more concise.

+ HIGH LEVEL LANGUAGE
  - Though assembly is a great improvement over machine code, it is still
    inefficient
  - Only one instruction per line
  - High level languages are easier to read and write, look more like 
    english, and are more concise
  - Provide MANY conveniences: loops, variable types (int, float, etc.),
    classes, objects, etc.
  - Also may have large libraries of classes, packages, and other pieces
    of code that can be reused.
  - Can be designed for intended use. Perl for working with text, Fortran
    for scientific computation, Cobol for business data processing
  - Also allow programs to be written independent of the platform type,
    since compilers and assemblers can translate high-level languages
    to the machine language of any computer.
  - Overall: more efficient!

Machine languages came first.  Computer scientists then wrote assemblers
and assembly languages.  Assemblers convert assembly code into machine 
code, letting the computer do the work.  Then came high-level languages.  
Compilers were created to compile high-level code (e.g. C/C++) into 
assembly code.  Assemblers took over from there to create machine code.

-----------------------------------------------------------------------
2) Beyond Languages

+ LIBRARIES
  - Reusing programs is much more efficient than rewriting everything
    from scratch.  Thus, programmers could pool together useful, 
    widely-used routines into libraries.
  - The first library was a subroutine library for inputting and 
    outputting data, which included, for example, routines to control 
    printers, hard disk drives, and displays.

+ OPERATING SYSTEMS
  - A set of programs could be run more efficiently if a seperate
    program existed to supervise all running programs: when one 
    program finished, it would start the next program in the queue,
    reducing delays.
  - Soon came to include I/O subroutine libraries, and then began
    managed system resources
  - These programs were the basis for today's operating systems

From "Computer Organization and Design: The Hardware/Software Interface"
by John Hennessy, David Patterson
-----------------------------------------------------------------------
SIDE-NOTE: QUEUE

A queue is a simple data structure that acts in FIFO (First In First 
Out) order.  Compare this to a stack, which acts in LIFO.

Think of a stack as a set of data objects placed side-by-side.  A good
analogy would be a line in which you wait for something, like tickets
at the movie theater (the British use the word "queue" instead of 
"line").

A queue has the following operations:
ENQUEUE - place an object at the *end* of the queue (which is just like
          getting in the back of a line)
DEQUEUE - take out the object from the *front* of the queue (just like 
          when you get your movie ticket and get out of line)

As you follow a single object in the queue, you can see that the first 
object in is the first one out (FIFO).


-----------------------------------------------------------------------
3) How a Program is Run

+ STEPS
  - The following steps are taken to transform a program written in a
    high-level language into an executable file.  Lets pretend the 
    program was written in C.
  1 Start with the C program.  A "Compiler" translates the program
    into an assembly language program
  2 An "Assembler" converts the assembly language into machine code. 
    The machine code is placed into various "object files", which 
    also contain extra information about dependencies on other
    object files.
  3 A "Linker" takes the object files (possibly along with other object
    files from libraries) and links them to produce an executable file.
  4 A "Loader" takes the executable file and runs it

Read on to find out more about how all of this works...

-----------------------------------------------------------------------
4) Compilers

+ COMPILER
  - A computer program that converts one programming language (the 
    "source language") into another (the "target language").
    (i.e. high-level language to assembly or machine language)
  - Also have "decompilers", which go from low-level to high-level
    languages.
  - The output from a compiler is usually run directly by the machine.
    However, sometimes it is run by a "virtual machine", as in Java.
  - Compilers typically output "object" files.

+ OBJECT FILE
  - Basically contains machine code, with information about the name and
    location of entry points (to internal functions) and external calls 
    (to functions not in the object)
  - A set of object files can be "linked" together to create the final
    executable file, which can be run directly by the user

+ COMPILER DESIGN
  - Compiler functionality divided into several passes.  Most have a 
    'two stage' design.
  - Compiler frontend: translates source language into an internal 
    representation, called a "Parse Tree"
  - Compiler backend: translates parse tree to target language.
  - This design allows designers to tweak/modify a stage, and also to
    exchange the frontend or backend to retarget the source or output
    language, respectively

+ COMPILER FRONTEND PHASES
  - Lexical Analysis: breaking the source code text into small pieces
    ('tokens' or 'terminals').  Each token represents a single atomic 
    unit of the language, (keyword (float, int, for, if), identifier
    (an object of function name), symbol name)    
  - Syntax Analysis: basically, identifying the order of the toekns and 
    understanding the hierarchical structures in the code.
  - Semantic Analysis: understanding the *meaning* of the program code
    to prepare for output.  Type checking is done here
  - Intermediate language representation: a parse tree is built here,
    which can be traversed by the backend to form the output

+ COMPILER BACKEND PHASES
  - Optimization: alter the parse tree to a functionally equivalent but
    *faster* form.  (in other words, rewrite the program so it runs 
    faster)
  - Code generation: traverse the parse tree to translate it into the 
    target language, usually the native machine language of the system.
 
From: http://www.wikipedia.org/wiki/Compiler
-----------------------------------------------------------------------
5) Assemblers, Linkers, Loaders

+ ASSEMBLER
  - translates assembly language into object code (see note above for
    explanation of object file).
  - far simpler to write than compilers, since each assembly language
    opcode corresponds to a specific machine code operation.

+ LINKER
  - A program that takes one or more objects from a compiler (or 
    assembler), and assembles them into one executable program.
  - Uses the machine code and additional info in the object file 
    to create the executable
  - The additional info describes functions or variables that are
    present in the object and can be used by other objects.  It also
    describes functions of variables that are referenced by the 
    object, but not internally defined (i.e. they exist in a different
    object file)
  - The main job of the linker is to resolve the references to 
    undefined symbols by finding out which other object defines a
    symbol in question
  - Can also take objects from a library of objects
  - Some operating systems allow for "Dynamic Linking", i which some
    symbols are resolved when the program is run (at "run time").  This
    allows us to keep only one copy of often-used libraries.  This 
    saves space, and allows multiple programs to benefit if the 
    library is updated/improved.  An example of such a library is any
    .dll file in your system (DLL = Dynamically Linked Library)

+ LOADER
  - Responsible for running an executable file.  The steps are:
  1 Reads the executable to determine size of various components
  2 Creates an address space in memory that is large enough for the 
    program to run
  3 Copies the instructions and data from the executable into memory
  4 Copies the parameters to the main program onto the stack (remember
    the Stack Machine concept we talked about?)
  5 Initializes machine registers, sets stack pointer to first free
    location
  6 Copies parameters of the main function into the argument registers
    and cals the main routine of the program

Taken from: 
http://www.wikipedia.org/wiki/Assembler
http://www.wikipedia.org/wiki/Linker
"Computer Organization and Design: The Hardware/Software Interface"
by John Hennessy, David Patterson, p 163

-----------------------------------------------------------------------
6) Compiled vs. Interpreted Languages

+ COMPILED LANGUAGES
  - like we talked about: a compiled language is compiled into an 
    executable file, which is run directly by the machine.
  - The source language (usually a high-level language) must be compiled
    for a specific system type, such as Windows or Unix
  - These executable files run fast :)

+ INTERPRETED LANGUAGES
  - A program is written in the source language.  It may be compiled
    into some sort of bytecode, or intermediate language (Java does this)
  - An "interpreter" then reads and runs the program.
  - Slower to run a program under an interpreter than to run a compiled
    executable file.
  - Perl is an interpretted language

Note that the difference between a compiled and interpretted language
is often vague.  Any language can be compiled or interpreted.  Also, 
some languages (like Java) are both compiled and interpreted.

http://www.wikipedia.org/wiki/Interpreted_language

-----------------------------------------------------------------------
7) How is Java different?

Java first compiles the java program code into Java byte code.  The 
Java Virtual Machine (JVM) then runs the program by interpreting it.

Why does Java both compile and interpret?  Portability.  The compiled 
Java program does not need to be compiled for any specific system;
only the JVM must be tailored to specific system types.  As long as a
system has the proper JVM, it can run the same Java program.  Thus, a
program can be delivered to and run on multiple platforms, and the
developers never have to worry about what system their program will 
run on.

Why do we have to compile first?  Why not just interpret?
  - Probably for speed.  Necessary work may be done during compilation.
  - I'm guessing here :)

It is slower than a purely compiled language though :(
"Java, being an interpreted system, is currently an order of magnitude slower than C." (Just Java, 302)