The Function and Purpose of Translators

4.2.1 Interpreters and Compilers

When electronic computers were first used, the programs had to be written in machine
code. This code was comprised of simple instructions each of which was represented by a
binary pattern in the computer. To produce these programs, programmers had to write the
instructions in binary. This not only took a long time, it was also prone to errors. To improve
program writing assembly languages were developed. Assembly languages allowed the use
of mnemonics and names for locations in memory. Each assembly instruction mapped to a
single machine instruction which meant that it was fairly easy to translate a program written
in assembly language to machine code. To speed up this translation process, assemblers
were written which could be loaded into the computer and then the computer could translate
the assembly language to machine code. Writing programs in assembly language, although
easier than using machine code, was still tedious and took a long time.

After assembly languages came high-level languages which used the type of language used
by the person writing the program. Thus FORTRAN (FORmula TRANslation) was developed
for science and engineering programs and it used formulae in the same way as would
scientists and engineers. COBOL (Common Business Oriented Language) was developed for
business applications. Programs written in these languages needed to be translated into
machine code. This led to the birth of compilers.

A compiler takes a program written in a high-level language and translates into an
equivalent program in machine code. Once this is done, the machine code version can be
loaded into the machine and run without any further help as it is complete in itself. The
high-level language version of the program is usually called the source code and the
resulting machine code program is called the object code. The relationship between them is
shown in Fig. 4.2.1.







Fig. 4.2.1

The problem with using a compiler is that it uses a lot of computer resources. It has to be
loaded in the computer's memory at the same time as the source code and there has to be
sufficient memory to hold the object code. Further, there has to be sufficient memory for
working storage while the translation is taking place. Another disadvantage is that when an
error in a program occurs it is difficult to pin-point its source in the original program.

An alternative system is to use interpretation. In this system each instruction is taken in
turn and translated to machine code. The instruction is then executed before the next
instruction is translated. This system was developed because early personal computers
lacked the power and memory needed for compilation. This method also has the advantage
of producing error messages as soon as an error is encountered. This means that the
instruction causing the problem can be easily identified. Against interpretation is the fact
that execution of a program is slow compared to that of a compiled program. This is
because the original program has to be translated every time it is executed. Also,
instructions inside a loop will have to be translated each time the loop is entered.

However, interpretation is very useful during program development as errors can be found
and corrected as soon as they are encountered. In fact many languages, such as Visual
Basic, use both an interpreter and a compiler. This enables the programmer to use the
interpreter during program development and, when the program is fully working, it can be
translated by the compiler into machine code. This machine code version can then be
distributed to users who do not have access to the original code.

Whether a compiler or interpreter is used, the translation from a high-level language to
machine code has to go through various stages and these are shown in Fig. 4.2.2.






Lexical Analysis


Syntax Analysis


Semantic Analysis








Code Generation


Code Optimisation








4.2.2 Lexical Analysis

The lexical analyser uses the source program as input and creates a stream of tokens for
the syntax analyser. Tokens are normally 16-bit unsigned integers. Each group of
characters is replaced by a token. Single characters, which have a meaning in their own
right, are replaced by their ASCII values. Multiple character tokens are represented by
integers greater than 255. Variable names will need to have extra information stored about
them. This is done by means of a symbol table. This table is used throughout compilation to
build up information about names used in the program. During lexical analysis only the
variable's name will be noted. It is during syntax and semantic analysis that details such as
the variable's type and scope are added. The lexical analyser may output some error
messages and diagnostics. For example, it will report errors such as illegal identifier or
identifier too long.

At various stages during compilation it will be necessary to look up details about the names
in the symbol table. This must be done efficiently so a linear search is not sufficiently fast.
In fact, it is usual to use a hash table and to calculate the position of a name by hashing the
name itself. When two names are hashed to the same address, a linked list can be used to
avoid the symbol table filling up.

The lexical analyser also removes redundant characters such as white space (spaces, tabs,
etc.) and comments. Often the lexical analysis takes longer than the other stages of
compilation. This is because it has to handle the original source code, which can have many
formats. For example, the following two pieces of code are equivalent although their format
is considerably different.


IF X = Y THEN 'square X IF X = Y THEN Z := X * X
Z := X * X ELSE Z := Y * Y
ELSE 'square Y PRINT Z
Z := Y *Y
ENDIF
PRINT Z

When the lexical analyser has completed its task, the code will be in a standard format
which means that the syntax analyser can always expect the format of its input to be the
same. Consider the instructions

X := 54
RETURN X

the lexical analyser will turn this into paired tokens. The first part describes the token, if
necessary, and the second part defines what the token represents. Thus the above input is
turned into


Variable X
Assignment symbol
Integer 54
Return symbol
Variable X

4.2.3 Syntax Analysis

This Section should be read in conjunction with Section 4.5.11 which discusses Backus-Naur
Form (BNF) and syntax diagrams.

During this stage of compilation the code generated by the lexical analyser is parsed
(broken into small units) to check that it is grammatically correct. All languages have rules
of grammar and computer languages are no exception. The grammar of programming
languages is defined by means of BNF notation or syntax diagrams. It is against these rules
that the code has to be checked.

For example, taking a very elementary language, an assignment statement may be defined
to be of the form

<variable> <assignment_operator> <expression>

and expression is

<variable> <arithmetic_operator> <variable>

and the parser must take the output from the lexical analyser and check that it is of this
form.

If the statement is

sum := sum + number

the parser will receive

<variable> <assignment_operator> <variable> <arithmetic_operator> <variable>

which becomes

<variable> <assignment_operator> <expression>

and then

<assignment statement>

which is valid.

If the original statement is

sum := sum + + number

this will be input as

<variable> <assignment_operator> <variable> <arithmetic_operator>
<arithmetic_operator> <variable>

and this does not represent a valid statement hence an error message will be returned.

It is at this stage that invalid names can be found such as PRIT instead of PRINT as PRIT will
be read as a variable name instead of a reserved word. This will mean that the statement
containing PRIT will not parse to a valid statement. Note that in languages that require
variables to be declared before being used, the lexical analyser may pick up this error
because PRIT has not been declared as a variable and so is not in the symbol table.

Most compilers will report errors found during syntax analysis as soon as they are found
and will attempt to show where the error has occurred. However, they may not be very
accurate in their conclusions nor may the error message be very clear.

During syntax analysis certain semantic checks are carried out. These include label checks,
flow of control checks and declaration checks.

Some languages allow GOTO statements (not recommended by the authors) which allow
control to be passed, unconditionally, to another statement which has a label. The GOTO
statement specifies the label to which the control must pass. The compiler must check that
such a label exists.

Certain control constructs can only be placed in certain parts of a program. For example in
C (and C++) the CONTINUE statement can only be placed inside a loop and the BREAK
statement can only be placed inside a loop or SWITCH statement. The compiler must ensure
that statements like these are used in the correct place.

Many languages insist on the programmer declaring variables and their types. It is at this
stage that the compiler verifies that all variables have been properly declared and that they
are used correctly.
4.2.4 Code Generation

It is at this stage, when all the errors due to incorrect use of the language have been
removed, that the program is translated into code suitable for use by the computer's
processor.

During lexical and syntax analysis a table of variables has been built up which includes
details of the variable name, its type and the block in which it is valid. The address of the
variable is now calculated and stored in the symbol table. This is done as soon as the
variable is encountered during code generation.

Before the final code can be generated, an intermediate code is produced. This intermediate
code can then be interpreted or translated into machine code. In the latter case, the code
can be saved and distributed as an executable program. Two methods can be used to
represent the high-level language in machine code. The one uses a tree structure and the
other a three address code (TAC). TAC allows no more than three operands and instructions
take the form

Operand1 := Operand2 Operator Operand3

For example, using three address code, the assignment statement

A := (B + C) * (D – E) / F

can be evaluated in the following order, where Ri represents an intermediate result.

R1 := B + C

R2 := D – E

R3 := R1 * R2

A := R3 / F

This can also be represented by the tree shown in Fig. 4.2.4.1.






















Fig. 4.2.4.1

Other statements can be represented in similar ways and the final stage of compilation can
then take place.

The compiler has to consider, at this point, the type of code that is required. Code can be
optimised for speed of execution or for size of program. Often compilers try to compromise
between the two.

An example of code optimisation is shown in Fig. 4.2.4.2 where the code on the left has
been changed to that on the right so that r1 * b is only evaluated once.


a := 5+ 3 a := 5 + 3
b := 5 * 3 b := 5 * 3
r1 := a + b r1 := a + b
r2 := r1 * b r2 := r1 * b
r3 := r2 / a r3 := r2 / a
r4 := r1 - b r4 := r2
r5 := r4 + 6 r5 := r4 + 6
c := r3 – r5 c := r3 – r5

Fig. 4.2.4.2

However, care must be taken when doing this as it does not work in the case shown in Fig.
4.2.4.3 because the value of b changes between the two evaluations of r1 * b. Hence the
code on the right is not equivalent to that on the left.






a := 5+ 3 a := 5 + 3
b := 5 * 3 b := 5 * 3
r1 := a + b r1 := a + b
r2 := r1 * b r2 := r1 * b
r3 := r2 / a r3 := r2 / a
b := b + 2 b := b + 2
r4 := r1 - b r4 := r2
r5 := r4 + 6 r5 := r4 + 6
c := r3 – r5 c := r3 – r5

Fig. 4.2.4.3

In the first example we have, in the revised code, the copy statement r4 := r2. This means
that r4 and r2 represent the same value and hence one of them can be removed. This leads
to the code in Fig. 4.2.4.4.


a := 5 + 3
b := 5 * 3
r1 := a + b
r2 := r1 - b
r3 := r2 / a
b := b + 2
r5 := r2 + 6
c := r3 – r5

Fig. 4.2.4.4

Another way of optimising code is to use algebraic transformations. For example, the
statement

a := b * 1

can be replaced by the statement

a := b

and

a := b + 0

by

a := b

There are many other ways of optimising code but they are beyond that expected at this
level.



4.2.5 Linkers and Loaders

Programs are usually built up in modules. These modules are then compiled into machine
code that has to be loaded into the computer's memory. This process is done by the loader.
The loader decides where in memory to place the code and then adjusts memory addresses
as described in Chapter 4.1. As the whole program may consist of many modules, all of
which have been separately compiled, the modules will have to be correctly linked once
they have been loaded. This is the job of the linker. The linker calculates the addresses of
the separate pieces that make up the whole program and then links these pieces so that all
the modules can interact with one another.

The idea of using modules that can be used in many programs was explained in Section
1.3.2. This method of creating programs is important as it reduces the need to keep
rewriting code and will be further discussed under object oriented programming in Section
4.5.6.
4.2.6 Example Questions.

1. Explain why the size of the memory available is particularly relevant to the process of
compilation. (4)

A. The computer must be able to simultaneously hold in memory:
-The compiler software/without which the process cannot be carried out.
-The source code/the code that needs to be compiled
-The object code/because the compilation process produces the program in machine code
form.
-Working space/processing area to be used by the compiler during the process.
(4)
Notes: This question is trying to make the candidate think about the implications of the
creation of a machine code version of the high-level language program. The significance of
the size of the memory is not as significant as it used to be because the memory in a micro
is now large enough for the problem not to arise in the main, but in the past the compilation
of programs could cause trouble on a micro leading to the standard translation technique of
interpretation.


2. a) Explain the difference between the two translation techniques of interpretation and
compilation. (2)

b) Give one advantage of the use of each of the two translation techniques. (2)

A.a) -Interpretation involves the translation of an instruction and the execution of that
instruction before the translation of the next.
-The compilation of a program involves creating the machine code version of the program
before allowing any of it to be run. (2)

b) -Interpretation provides the programmer with better error diagnostics because the
source code is always present and hence can be used to provide a reference whenever the
error occurs.
-When a program is compiled no further translation is necessary no matter how many
times the program is run, consequently there is nothing to slow down the execution of the
program. (2)

Notes: The question is probably not in its best form as there are many responses that could
justifiably be given to the difference between the two processes. A perfectly acceptable
response here would be that interpretation does not create an object code while compilation
does.

3. State the three stages of compilation and describe, briefly, the purpose of each. (6)
A. -Lexical analysis
-puts each statement into the form best suited to the syntax analyser.

-Syntax analysis
-Language statements are checked against the rules of the language.

-Code generation
-The machine code (object code) is produced. (6)

Notes: The number of marks for the question plays a big part in this answer. There are
only six marks, three of which must be for stating the three stages. This means that there is
only one mark each for saying what the purpose of each is. Do not launch into a long essay,
you don’t have time in the constraints of the examination room, the examiner is simply
looking for an outline description of what the stage does. Be careful about writing down all
you know about each stage. There is a danger that the first thing you write down may be
wrong. There is only one mark available and, if the answer is very long, the mark will be
lost immediately. Also, don’t think that the marks can be carried across from another part.
You may not know anything about code generation, but you do know a lot about lexical
analysis – sorry, the marks cannot be transferred over in a question like this.


4. Explain, in detail, the stage of compilation known as lexical analysis. (6)

A. -Source program is used as the input
-Tokens are created from the individual characters and from…
-the special reserved words in the program.
-A token is a string of binary digits.
-Variable names are loaded into a look up table known as the symbol table
-Redundant characters (e.g. spaces) are removed
-Comments are removed
-Error diagnostics are issued. (6)

Notes: Compare this question with the last one. This one is asking for the details, so it
becomes important to say as much as possible. The examiner may be a little more lenient
about something in a student’s list that is wrong, but only a little! After all, the main thing
about the stages of compilation is to know when each of them is appropriate, so don’t make
too many errors.