|
The Function and Purpose of Translators 4.2.1 Interpreters and Compilers When electronic computers were first used, the programs had to be written in machine code. This code was comprised of simple instructions each of which was represented by a binary pattern in the computer. To produce these programs, programmers had to write the instructions in binary. This not only took a long time, it was also prone to errors. To improve program writing assembly languages were developed. Assembly languages allowed the use of mnemonics and names for locations in memory. Each assembly instruction mapped to a single machine instruction which meant that it was fairly easy to translate a program written in assembly language to machine code. To speed up this translation process, assemblers were written which could be loaded into the computer and then the computer could translate the assembly language to machine code. Writing programs in assembly language, although easier than using machine code, was still tedious and took a long time. After assembly languages came high-level languages which used the type of language used by the person writing the program. Thus FORTRAN (FORmula TRANslation) was developed for science and engineering programs and it used formulae in the same way as would scientists and engineers. COBOL (Common Business Oriented Language) was developed for business applications. Programs written in these languages needed to be translated into machine code. This led to the birth of compilers. A compiler takes a program written in a high-level language and translates into an equivalent program in machine code. Once this is done, the machine code version can be loaded into the machine and run without any further help as it is complete in itself. The high-level language version of the program is usually called the source code and the resulting machine code program is called the object code. The relationship between them is shown in Fig. 4.2.1. Fig. 4.2.1 The problem with using a compiler is that it uses a lot of computer resources. It has to be loaded in the computer's memory at the same time as the source code and there has to be sufficient memory to hold the object code. Further, there has to be sufficient memory for working storage while the translation is taking place. Another disadvantage is that when an error in a program occurs it is difficult to pin-point its source in the original program. An alternative system is to use interpretation. In this system each instruction is taken in turn and translated to machine code. The instruction is then executed before the next instruction is translated. This system was developed because early personal computers lacked the power and memory needed for compilation. This method also has the advantage of producing error messages as soon as an error is encountered. This means that the instruction causing the problem can be easily identified. Against interpretation is the fact that execution of a program is slow compared to that of a compiled program. This is because the original program has to be translated every time it is executed. Also, instructions inside a loop will have to be translated each time the loop is entered. However, interpretation is very useful during program development as errors can be found and corrected as soon as they are encountered. In fact many languages, such as Visual Basic, use both an interpreter and a compiler. This enables the programmer to use the interpreter during program development and, when the program is fully working, it can be translated by the compiler into machine code. This machine code version can then be distributed to users who do not have access to the original code. Whether a compiler or interpreter is used, the translation from a high-level language to machine code has to go through various stages and these are shown in Fig. 4.2.2. Lexical Analysis Syntax Analysis Semantic Analysis Code Generation Code Optimisation 4.2.2 Lexical Analysis The lexical analyser uses the source program as input and creates a stream of tokens for the syntax analyser. Tokens are normally 16-bit unsigned integers. Each group of characters is replaced by a token. Single characters, which have a meaning in their own right, are replaced by their ASCII values. Multiple character tokens are represented by integers greater than 255. Variable names will need to have extra information stored about them. This is done by means of a symbol table. This table is used throughout compilation to build up information about names used in the program. During lexical analysis only the variable's name will be noted. It is during syntax and semantic analysis that details such as the variable's type and scope are added. The lexical analyser may output some error messages and diagnostics. For example, it will report errors such as illegal identifier or identifier too long. At various stages during compilation it will be necessary to look up details about the names in the symbol table. This must be done efficiently so a linear search is not sufficiently fast. In fact, it is usual to use a hash table and to calculate the position of a name by hashing the name itself. When two names are hashed to the same address, a linked list can be used to avoid the symbol table filling up. The lexical analyser also removes redundant characters such as white space (spaces, tabs, etc.) and comments. Often the lexical analysis takes longer than the other stages of compilation. This is because it has to handle the original source code, which can have many formats. For example, the following two pieces of code are equivalent although their format is considerably different. IF X = Y THEN 'square X IF X = Y THEN Z := X * X Z := X * X ELSE Z := Y * Y ELSE 'square Y PRINT Z Z := Y *Y ENDIF PRINT Z When the lexical analyser has completed its task, the code will be in a standard format which means that the syntax analyser can always expect the format of its input to be the same. Consider the instructions X := 54 RETURN X the lexical analyser will turn this into paired tokens. The first part describes the token, if necessary, and the second part defines what the token represents. Thus the above input is turned into Variable X Assignment symbol Integer 54 Return symbol Variable X 4.2.3 Syntax Analysis This Section should be read in conjunction with Section 4.5.11 which discusses Backus-Naur Form (BNF) and syntax diagrams. During this stage of compilation the code generated by the lexical analyser is parsed (broken into small units) to check that it is grammatically correct. All languages have rules of grammar and computer languages are no exception. The grammar of programming languages is defined by means of BNF notation or syntax diagrams. It is against these rules that the code has to be checked. For example, taking a very elementary language, an assignment statement may be defined to be of the form <variable> <assignment_operator> <expression> and expression is <variable> <arithmetic_operator> <variable> and the parser must take the output from the lexical analyser and check that it is of this form. If the statement is sum := sum + number the parser will receive <variable> <assignment_operator> <variable> <arithmetic_operator> <variable> which becomes <variable> <assignment_operator> <expression> and then <assignment statement> which is valid. If the original statement is sum := sum + + number this will be input as <variable> <assignment_operator> <variable> <arithmetic_operator> <arithmetic_operator> <variable> and this does not represent a valid statement hence an error message will be returned. It is at this stage that invalid names can be found such as PRIT instead of PRINT as PRIT will be read as a variable name instead of a reserved word. This will mean that the statement containing PRIT will not parse to a valid statement. Note that in languages that require variables to be declared before being used, the lexical analyser may pick up this error because PRIT has not been declared as a variable and so is not in the symbol table. Most compilers will report errors found during syntax analysis as soon as they are found and will attempt to show where the error has occurred. However, they may not be very accurate in their conclusions nor may the error message be very clear. During syntax analysis certain semantic checks are carried out. These include label checks, flow of control checks and declaration checks. Some languages allow GOTO statements (not recommended by the authors) which allow control to be passed, unconditionally, to another statement which has a label. The GOTO statement specifies the label to which the control must pass. The compiler must check that such a label exists. Certain control constructs can only be placed in certain parts of a program. For example in C (and C++) the CONTINUE statement can only be placed inside a loop and the BREAK statement can only be placed inside a loop or SWITCH statement. The compiler must ensure that statements like these are used in the correct place. Many languages insist on the programmer declaring variables and their types. It is at this stage that the compiler verifies that all variables have been properly declared and that they are used correctly. 4.2.4 Code Generation It is at this stage, when all the errors due to incorrect use of the language have been removed, that the program is translated into code suitable for use by the computer's processor. During lexical and syntax analysis a table of variables has been built up which includes details of the variable name, its type and the block in which it is valid. The address of the variable is now calculated and stored in the symbol table. This is done as soon as the variable is encountered during code generation. Before the final code can be generated, an intermediate code is produced. This intermediate code can then be interpreted or translated into machine code. In the latter case, the code can be saved and distributed as an executable program. Two methods can be used to represent the high-level language in machine code. The one uses a tree structure and the other a three address code (TAC). TAC allows no more than three operands and instructions take the form Operand1 := Operand2 Operator Operand3 For example, using three address code, the assignment statement A := (B + C) * (D – E) / F can be evaluated in the following order, where Ri represents an intermediate result. R1 := B + C R2 := D – E R3 := R1 * R2 A := R3 / F This can also be represented by the tree shown in Fig. 4.2.4.1. Fig. 4.2.4.1 Other statements can be represented in similar ways and the final stage of compilation can then take place. The compiler has to consider, at this point, the type of code that is required. Code can be optimised for speed of execution or for size of program. Often compilers try to compromise between the two. An example of code optimisation is shown in Fig. 4.2.4.2 where the code on the left has been changed to that on the right so that r1 * b is only evaluated once. a := 5+ 3 a := 5 + 3 b := 5 * 3 b := 5 * 3 r1 := a + b r1 := a + b r2 := r1 * b r2 := r1 * b r3 := r2 / a r3 := r2 / a r4 := r1 - b r4 := r2 r5 := r4 + 6 r5 := r4 + 6 c := r3 – r5 c := r3 – r5 Fig. 4.2.4.2 However, care must be taken when doing this as it does not work in the case shown in Fig. 4.2.4.3 because the value of b changes between the two evaluations of r1 * b. Hence the code on the right is not equivalent to that on the left. a := 5+ 3 a := 5 + 3 b := 5 * 3 b := 5 * 3 r1 := a + b r1 := a + b r2 := r1 * b r2 := r1 * b r3 := r2 / a r3 := r2 / a b := b + 2 b := b + 2 r4 := r1 - b r4 := r2 r5 := r4 + 6 r5 := r4 + 6 c := r3 – r5 c := r3 – r5 Fig. 4.2.4.3 In the first example we have, in the revised code, the copy statement r4 := r2. This means that r4 and r2 represent the same value and hence one of them can be removed. This leads to the code in Fig. 4.2.4.4. a := 5 + 3 b := 5 * 3 r1 := a + b r2 := r1 - b r3 := r2 / a b := b + 2 r5 := r2 + 6 c := r3 – r5 Fig. 4.2.4.4 Another way of optimising code is to use algebraic transformations. For example, the statement a := b * 1 can be replaced by the statement a := b and a := b + 0 by a := b There are many other ways of optimising code but they are beyond that expected at this level. 4.2.5 Linkers and Loaders Programs are usually built up in modules. These modules are then compiled into machine code that has to be loaded into the computer's memory. This process is done by the loader. The loader decides where in memory to place the code and then adjusts memory addresses as described in Chapter 4.1. As the whole program may consist of many modules, all of which have been separately compiled, the modules will have to be correctly linked once they have been loaded. This is the job of the linker. The linker calculates the addresses of the separate pieces that make up the whole program and then links these pieces so that all the modules can interact with one another. The idea of using modules that can be used in many programs was explained in Section 1.3.2. This method of creating programs is important as it reduces the need to keep rewriting code and will be further discussed under object oriented programming in Section 4.5.6. 4.2.6 Example Questions. 1. Explain why the size of the memory available is particularly relevant to the process of compilation. (4) A. The computer must be able to simultaneously hold in memory: -The compiler software/without which the process cannot be carried out. -The source code/the code that needs to be compiled -The object code/because the compilation process produces the program in machine code form. -Working space/processing area to be used by the compiler during the process. (4) Notes: This question is trying to make the candidate think about the implications of the creation of a machine code version of the high-level language program. The significance of the size of the memory is not as significant as it used to be because the memory in a micro is now large enough for the problem not to arise in the main, but in the past the compilation of programs could cause trouble on a micro leading to the standard translation technique of interpretation. 2. a) Explain the difference between the two translation techniques of interpretation and compilation. (2) b) Give one advantage of the use of each of the two translation techniques. (2) A.a) -Interpretation involves the translation of an instruction and the execution of that instruction before the translation of the next. -The compilation of a program involves creating the machine code version of the program before allowing any of it to be run. (2) b) -Interpretation provides the programmer with better error diagnostics because the source code is always present and hence can be used to provide a reference whenever the error occurs. -When a program is compiled no further translation is necessary no matter how many times the program is run, consequently there is nothing to slow down the execution of the program. (2) Notes: The question is probably not in its best form as there are many responses that could justifiably be given to the difference between the two processes. A perfectly acceptable response here would be that interpretation does not create an object code while compilation does. 3. State the three stages of compilation and describe, briefly, the purpose of each. (6) A. -Lexical analysis -puts each statement into the form best suited to the syntax analyser. -Syntax analysis -Language statements are checked against the rules of the language. -Code generation -The machine code (object code) is produced. (6) Notes: The number of marks for the question plays a big part in this answer. There are only six marks, three of which must be for stating the three stages. This means that there is only one mark each for saying what the purpose of each is. Do not launch into a long essay, you don’t have time in the constraints of the examination room, the examiner is simply looking for an outline description of what the stage does. Be careful about writing down all you know about each stage. There is a danger that the first thing you write down may be wrong. There is only one mark available and, if the answer is very long, the mark will be lost immediately. Also, don’t think that the marks can be carried across from another part. You may not know anything about code generation, but you do know a lot about lexical analysis – sorry, the marks cannot be transferred over in a question like this. 4. Explain, in detail, the stage of compilation known as lexical analysis. (6) A. -Source program is used as the input -Tokens are created from the individual characters and from… -the special reserved words in the program. -A token is a string of binary digits. -Variable names are loaded into a look up table known as the symbol table -Redundant characters (e.g. spaces) are removed -Comments are removed -Error diagnostics are issued. (6) Notes: Compare this question with the last one. This one is asking for the details, so it becomes important to say as much as possible. The examiner may be a little more lenient about something in a student’s list that is wrong, but only a little! After all, the main thing about the stages of compilation is to know when each of them is appropriate, so don’t make too many errors. |