CS212: Summer 2003 Lecture 2 - From code to running program ----------------------------------------------------------------------- 0) Announcements ----------------------------------------------------------------------- 1) Languages There are 3 different types of programming languages: + MACHINE LANGUAGE: the instructions and coding system that the computer will act upon (not usually something in which humans write programs) - has a number of fields: op-code field (operation code) and operand fields - each field has a pattern of bits - the op-code has the operation, like STORE, ADD, JUMP, etc - each operand field has info associated with the operation (like an address or data to manipulate) - the complete set of instructions for a computer is the INSTRUCTION SET + ASSEMBLY LANGUAGE - Programming by binary numbers is so tedious (and inefficient!) - Symbolic notation for binary machine instructions. So, to add two numbers, a programmer would write "add A, B" instead of 10001110010 - An "assembler" converts assembly language into machine code - Only one instruction per line, however... still inefficient. That's why we have "high-level programming languages", which are easier for a human to read and write, and which are more concise. + HIGH LEVEL LANGUAGE - Though assembly is a great improvement over machine code, it is still inefficient - Only one instruction per line - High level languages are easier to read and write, look more like english, and are more concise - Provide MANY conveniences: loops, variable types (int, float, etc.), classes, objects, etc. - Also may have large libraries of classes, packages, and other pieces of code that can be reused. - Can be designed for intended use. Perl for working with text, Fortran for scientific computation, Cobol for business data processing - Also allow programs to be written independent of the platform type, since compilers and assemblers can translate high-level languages to the machine language of any computer. - Overall: more efficient! Machine languages came first. Computer scientists then wrote assemblers and assembly languages. Assemblers convert assembly code into machine code, letting the computer do the work. Then came high-level languages. Compilers were created to compile high-level code (e.g. C/C++) into assembly code. Assemblers took over from there to create machine code. ----------------------------------------------------------------------- 2) Beyond Languages + LIBRARIES - Reusing programs is much more efficient than rewriting everything from scratch. Thus, programmers could pool together useful, widely-used routines into libraries. - The first library was a subroutine library for inputting and outputting data, which included, for example, routines to control printers, hard disk drives, and displays. + OPERATING SYSTEMS - A set of programs could be run more efficiently if a seperate program existed to supervise all running programs: when one program finished, it would start the next program in the queue, reducing delays. - Soon came to include I/O subroutine libraries, and then began managed system resources - These programs were the basis for today's operating systems From "Computer Organization and Design: The Hardware/Software Interface" by John Hennessy, David Patterson ----------------------------------------------------------------------- SIDE-NOTE: QUEUE A queue is a simple data structure that acts in FIFO (First In First Out) order. Compare this to a stack, which acts in LIFO. Think of a stack as a set of data objects placed side-by-side. A good analogy would be a line in which you wait for something, like tickets at the movie theater (the British use the word "queue" instead of "line"). A queue has the following operations: ENQUEUE - place an object at the *end* of the queue (which is just like getting in the back of a line) DEQUEUE - take out the object from the *front* of the queue (just like when you get your movie ticket and get out of line) As you follow a single object in the queue, you can see that the first object in is the first one out (FIFO). ----------------------------------------------------------------------- 3) How a Program is Run + STEPS - The following steps are taken to transform a program written in a high-level language into an executable file. Lets pretend the program was written in C. 1 Start with the C program. A "Compiler" translates the program into an assembly language program 2 An "Assembler" converts the assembly language into machine code. The machine code is placed into various "object files", which also contain extra information about dependencies on other object files. 3 A "Linker" takes the object files (possibly along with other object files from libraries) and links them to produce an executable file. 4 A "Loader" takes the executable file and runs it Read on to find out more about how all of this works... ----------------------------------------------------------------------- 4) Compilers + COMPILER - A computer program that converts one programming language (the "source language") into another (the "target language"). (i.e. high-level language to assembly or machine language) - Also have "decompilers", which go from low-level to high-level languages. - The output from a compiler is usually run directly by the machine. However, sometimes it is run by a "virtual machine", as in Java. - Compilers typically output "object" files. + OBJECT FILE - Basically contains machine code, with information about the name and location of entry points (to internal functions) and external calls (to functions not in the object) - A set of object files can be "linked" together to create the final executable file, which can be run directly by the user + COMPILER DESIGN - Compiler functionality divided into several passes. Most have a 'two stage' design. - Compiler frontend: translates source language into an internal representation, called a "Parse Tree" - Compiler backend: translates parse tree to target language. - This design allows designers to tweak/modify a stage, and also to exchange the frontend or backend to retarget the source or output language, respectively + COMPILER FRONTEND PHASES - Lexical Analysis: breaking the source code text into small pieces ('tokens' or 'terminals'). Each token represents a single atomic unit of the language, (keyword (float, int, for, if), identifier (an object of function name), symbol name) - Syntax Analysis: basically, identifying the order of the toekns and understanding the hierarchical structures in the code. - Semantic Analysis: understanding the *meaning* of the program code to prepare for output. Type checking is done here - Intermediate language representation: a parse tree is built here, which can be traversed by the backend to form the output + COMPILER BACKEND PHASES - Optimization: alter the parse tree to a functionally equivalent but *faster* form. (in other words, rewrite the program so it runs faster) - Code generation: traverse the parse tree to translate it into the target language, usually the native machine language of the system. From: http://www.wikipedia.org/wiki/Compiler ----------------------------------------------------------------------- 5) Assemblers, Linkers, Loaders + ASSEMBLER - translates assembly language into object code (see note above for explanation of object file). - far simpler to write than compilers, since each assembly language opcode corresponds to a specific machine code operation. + LINKER - A program that takes one or more objects from a compiler (or assembler), and assembles them into one executable program. - Uses the machine code and additional info in the object file to create the executable - The additional info describes functions or variables that are present in the object and can be used by other objects. It also describes functions of variables that are referenced by the object, but not internally defined (i.e. they exist in a different object file) - The main job of the linker is to resolve the references to undefined symbols by finding out which other object defines a symbol in question - Can also take objects from a library of objects - Some operating systems allow for "Dynamic Linking", i which some symbols are resolved when the program is run (at "run time"). This allows us to keep only one copy of often-used libraries. This saves space, and allows multiple programs to benefit if the library is updated/improved. An example of such a library is any .dll file in your system (DLL = Dynamically Linked Library) + LOADER - Responsible for running an executable file. The steps are: 1 Reads the executable to determine size of various components 2 Creates an address space in memory that is large enough for the program to run 3 Copies the instructions and data from the executable into memory 4 Copies the parameters to the main program onto the stack (remember the Stack Machine concept we talked about?) 5 Initializes machine registers, sets stack pointer to first free location 6 Copies parameters of the main function into the argument registers and cals the main routine of the program Taken from: http://www.wikipedia.org/wiki/Assembler http://www.wikipedia.org/wiki/Linker "Computer Organization and Design: The Hardware/Software Interface" by John Hennessy, David Patterson, p 163 ----------------------------------------------------------------------- 6) Compiled vs. Interpreted Languages + COMPILED LANGUAGES - like we talked about: a compiled language is compiled into an executable file, which is run directly by the machine. - The source language (usually a high-level language) must be compiled for a specific system type, such as Windows or Unix - These executable files run fast :) + INTERPRETED LANGUAGES - A program is written in the source language. It may be compiled into some sort of bytecode, or intermediate language (Java does this) - An "interpreter" then reads and runs the program. - Slower to run a program under an interpreter than to run a compiled executable file. - Perl is an interpretted language Note that the difference between a compiled and interpretted language is often vague. Any language can be compiled or interpreted. Also, some languages (like Java) are both compiled and interpreted. http://www.wikipedia.org/wiki/Interpreted_language ----------------------------------------------------------------------- 7) How is Java different? Java first compiles the java program code into Java byte code. The Java Virtual Machine (JVM) then runs the program by interpreting it. Why does Java both compile and interpret? Portability. The compiled Java program does not need to be compiled for any specific system; only the JVM must be tailored to specific system types. As long as a system has the proper JVM, it can run the same Java program. Thus, a program can be delivered to and run on multiple platforms, and the developers never have to worry about what system their program will run on. Why do we have to compile first? Why not just interpret? - Probably for speed. Necessary work may be done during compilation. - I'm guessing here :) It is slower than a purely compiled language though :( "Java, being an interpreted system, is currently an order of magnitude slower than C." (Just Java, 302)