Data representation, structure, and management

Data: Its Representation, Structure and
Management

1.4.1 Number Systems

Counting is one of the first skills that a young child masters, and none of us consider
counting from 1 to 100 difficult. However, to count, we have to learn, by heart, the
meanings of the symbols 0,1,2…9 and also to understand that two identical symbols mean
totally different things according to their ‘place’ in the number. For instance, in 23 the 2
actually means 2 * 10. But why multiply by 10? Why not multiply by 6? The answer is
simply that we were taught to do that because we have 10 fingers, so we can count on our
fingers until we get to the last one, which we remember in the next column and then start
again.
We don’t need to count in tens. The ancient Babylonians counted in a system, which is
similar to counting in sixties. This is very difficult to learn because of all the symbols
needed, but we still use a system based on sixties today: 60 minutes = 1 hour; 60 seconds
= 1 minute; 6 * 60 degrees = 1 revolution.
Instead of increasing the number of symbols in a system, which makes the system more
difficult, it seems reasonable that if we decrease the number of symbols the system will be
easier to use.
A computer is an electronic machine. Electricity can be either on or off. If electricity is not
flowing through a wire that can stand for 0. If electricity is flowing, then it stands for 1. The
difficulty is what to do for the number 2. We can’t just pump through twice as much
electricity, what we need is a carry system, just like what happens when we run out of
fingers. What we need is another wire.

The ‘units’ wire no electricity 0
= 0
The ‘twos’ wire no electricity 0

ADD 1

electricity 1
= 1
no electricity 0

ADD 1

no electricity 0
= 2
Carry electricity 1

ADD 1

electricity 1
=3
electricity 1

ADD 1

no electricity 0
Carry
no electricity 0 =4

Carry electricity 1

The computer can continue like this for ever, just adding more wires when it gets bigger
numbers.
This system, where there are only two digits, 0 and 1, is known as the binary system. Each
wire, or digit, is known as a binary digit. This name is normally shortened to BIT. So each
digit, 0 or 1, is one bit. A single bit has very few uses so they are grouped together. A
group of bits is called a BYTE. Usually a byte has 8 bits in it. In section 4 of this chapter we
will see that bytes can be of different sizes, but for now we will say that there are 8 bits in
a byte.

The first thing we must be able to do with the binary system is to change numbers from
our system of 10 numbers (the denary system) into binary, and back again. There are a
number of methods for doing this, the simplest being to use the sort of column diagrams,
which were used in primary school to do simple arithmetic
Thousands Hundreds Tens Units

except, this time we are using binary, so the column headings go up in twos instead of tens
32s 16s 8s 4s 2s units

To turn a denary number into a binary number simply put the column headings, start at the
left hand side and follow the steps:
If the column heading is less than the number, put a 1 in the column and then subtract the
column heading from the number. Then start again with the next column on the right.
If the column heading is greater than the number, put a 0 in the column and start again
with the next column on the right.
Note: You will be expected to be able to do this with numbers up to 255, because that is
the biggest number that can be stored in one byte of eight bits.
e.g. Change 117 (in denary) into a binary number.
Answer: Always use the column headings for a byte (8 bits)
128 64 32 16 8 4 2 1

Follow the algorithm.
128 is greater than 117 so put a 0 and repeat.
128 64 32 16 8 4 2 1
0
64 is less than 117 so put a 1.
128 64 32 16 8 4 2 1
0 1
Take 64 from 117 = 53, and repeat.
32 is less than 53, so put a 1.
128 64 32 16 8 4 2 1
0 1 1
Take 32 from 53 = 21, and repeat.
If you continue this the result (try it) is
128 64 32 16 8 4 2 1
0 1 1 1 0 1 0 1
So 117 (in denary) = 01110101 (in binary).

To turn a binary number into denary, simply put the column headings above the binary
number and add up all the columns with a 1 in them.
e.g. Change 10110110 into denary.
Answer:
128 64 32 16 8 4 2 1
1 0 1 1 0 1 1 0
So 10110110 = 128 + 32 + 16 + 4 + 2 = 182 (in denary).

This principle can be used for any number system, even the Babylonians’ sixties if you can
learn the symbols.
e.g. If we count in eights (called the OCTAL system) the column headings go up in 8’s.
512 64 8 1

So 117 (in denary) is 1 lot of 64, leaving another 53.
53 is 6 lots of 8 with 5 left over. Fitting this in the columns gives
512 64 8 1
0 1 6 5
So 117 in denary is 165 in octal.

Why bother with octal?
Octal and binary are related. If we take the three digits of the octal number 165 and turn
each one into binary using three bits each we get
1 = 001 6 = 110 5 = 101
Put them together and we get 001110101 which is the binary value of 117 which we got
earlier.
The value of this relationship is not important now, but it is the reason why octal is in the
syllabus.

Another system is called HEXADECIMAL (counting in 16’s). This sounds awful, but just use
the same principles.
256 16 1

So 117 (in denary) is 7 lots of 16 (112) plus an extra 5. Fitting this in the columns gives
256 16 1
0 7 5
Notice that 7 in binary is 0111 and that 5 is 0101, put them together and we get 01110101
which is the binary value of 117 again. So binary, octal and hexadecimal are all related in
some way.
There is a problem with counting in 16’s instead of the other systems. We need symbols
going further than 0 to 9 (only 10 symbols and we need 16!).
We could invent 6 more symbols but we would have to learn them, so we use 6 that we
already know, the letters A to F. In hexadecimal A stands for 10, B stands for 11 and so on
to F stands for 15.
So a hexadecimal number BD stands for 11 lots of 16 and 13 units
= 176 + 13
= 189 ( in denary)
Note: B = 11, which in binary = 1011
D = 13, which in binary = 1101
Put them together to get 10111101 = the binary value of 189.

Binary Coded Decimal
Some numbers are not proper numbers because they don’t behave like numbers. A
barcode for chocolate looks like a number, and a barcode for sponge cake looks like a
number, but if the barcodes are added together the result is not the barcode for chocolate
cake. The arithmetic does not give a sensible answer. Values like this that look like
numbers but do not behave like them are often stored in binary coded decimal (BCD).
Each digit is simply changed into a four bit binary number which are then placed after one
another in order.
e.g. 398602 in BCD
Answer: 3 = 0011 9 = 1001
8 = 1000 6 = 0110
0 = 0000 2 = 0010
So 398602 = 001110011000011000000010 (in BCD)
Note: All the zeros are essential otherwise you can’t read it back.
1.4.2 Negative Integers

If a computer system uses a byte to store a number in the way that was suggested in 1.4.1
there are three problems that arise. The first is that the biggest number that can be
represented is 255 because there aren’t enough bits to store bigger numbers. This is easily
solved by using more than one byte to represent a number. Most computer systems use
either two or four bytes to store a number. There is still a limit on the size that can be
represented, but it is now much larger. The second problem is not so easy to solve, how to
represent fractions. This will be looked at later in this chapter. The third problem is how to
store negative numbers.
The example we used for binary storage was 117 which becomes 01110101 in binary. If
we want to store +117 or –117, these numbers need a second piece of data to be stored,
namely the sign. There are two simple ways to store negative numbers.

Sign and Magnitude.
Use the first bit in the byte (the most significant bit (MSB)) to represent the sign (0 for +
and 1 for -) instead of representing 128. This means that
+117 = 01110101 and -117 = 11110101
Notes: The range of numbers possible is now –127 to +127.
The byte does not represent just a number but also a sign, this makes arithmetic difficult.

Two’s Complement
The MSB stays as a number, but is made negative. This means that the column headings
are
-128 64 32 16 8 4 2 1

+117 does not need to use the MSB, so it stays as 01110101.
-117 = -128 + 11
= -128 + (8 + 2 + 1) fitting this in the columns gives 10001011
Two’s complement seems to make everything more complicated for little reason at the
moment, but later it becomes essential for making the arithmetic easier.

1.4.3 Binary Arithmetic

The syllabus requires the addition of two binary integers, and the ability to take one away
from another. The numbers and the answers will be limited to one byte.

Addition.
There are four simple rules 0 + 0 = 0
0 + 1 = 1
1 + 0 = 1
and the difficult one 1 + 1 = 0 (Carry 1)
e.g. Add together the binary equivalents of 91 and 18
Answer: 91 = 0 1 0 1 1 0 1 1
18 = 0 0 0 1 0 0 1 0 +
0 1 1 0 1 1 0 1 = 109
1 1
Subtraction.
This is where two’s complement is useful. To take one number away from another, simply
write the number to be subtracted as a two’s complement negative number and then add
them up.
e.g. Work out 91 – 18 using their binary equivalents.
Answer: 91 = 01011011
-18 as a two’s complement number is –128 + 110
= -128 +(+64 +32 +8 +4 +2)
= 11101110
Now add them 0 1 0 1 1 0 1 1
1 1 1 0 1 1 1 0 +
1 0 1 0 0 1 0 0 1
1 1 1 1 1 1 1
But the answer can only be 8 bits, so cross out the 9th bit giving
01001001 = 64 + 8 + 1 = 73.
Notes: Lots of carrying here makes the sum more difficult, but the same rules are used.
One rule is extended slightly because of the carries, 1+1+1 = 1 (carry 1)
Things can get harder but this is as far as the syllabus goes.
1.4.4 Character Sets.

There are other types of data that have to be stored in computers besides numbers, for
instance, the letters of the alphabet. These are stored as codes which look like binary
numbers. For instance A could be stored as 000, B as 001 and so on. Unfortunately, there
are only 8 possible codes using 3 bits, so we could store the letters A to H but not the rest,
and what about the lower case letters and punctuation and…? The computer can store as
many characters as necessary simply by using more and more bits for the code. Some
systems don’t need to be able to recognise a lot of characters so they only use a few bits
for each character. The number of bits needed to store one character is called a byte which
is usually said to have 8 bits because most systems use 8 bits to store the code for each
character.
We now have enough codes, but another problem arises. If my computer stores A as
01000001 and your computer stores A as 01000010 then the computers cannot
communicate because they cannot understand each other’s codes. In the 1960’s a meeting
in America agreed a standard set of codes so that computers could communicate with each
other. This standard set of codes is known as the ASCII set. Most systems use ASCII so
you can be fairly sure that when you type in A it is stored in the computer’s memory as
01000001. However, you can’t be sure because some systems use other codes. A less
common code is called EBCDIC, it was developed for use by larger scale computer
systems and differs in that the code for each character is different than that used in ASCII.
Notes: All the characters that a system can recognise are called its character set.
ASCII uses 8 bits so there are 256 different codes that can be used and hence 256
different characters. (This is not quite true, we will see why in chapter 1.6.)
A problem arises when the computer retrieves a piece of data from its memory. Imagine
that the data is 01000001. Is this the number 65, or is it A?
They are both stored in the same way, so how can it tell the difference?
The answer is that characters and numbers are stored in different parts of the memory, so
it knows which one it is by knowing whereabouts it was stored.
1.4.5 Data Types

The computer needs to use different types of data in the operation of the system. All of
these different types of data will look the same because they all have to be stored as
binary numbers. The computer can distinguish one type of data from another by seeing
whereabouts in memory it is stored.

Numeric data.
There are different types of numbers that the computer must be able to recognise.
Numbers can be restricted to whole numbers, these are called INTEGERS and are stored
by the computer as binary numbers using a whole number of bytes. It is usual to use
either 2 bytes (called short integers) or 4 bytes (called long integers), the difference being
simply that long integers can store larger numbers. Section 1.4.2 described how negative
integers could be stored, these have to be treated as different types of data by the
computer because the MSB of the data stands for something different than in an ordinary
unsigned integer.
Another type of number is a fraction (or decimal). This is stored by changing the position of
the point so that it is always at the front of the number and then storing the fraction that is
left (there is no whole part because the point comes first). However, 2.3 and .23 will now
both be stored as .23 so we need an extra piece of data to be stored, the number of places
the point has had to move. Fractions, then, are stored in two parts: the fraction part (called
the mantissa) and the number of places the point has had to move (called the exponent).
Because the first stage is always to move the point to the front of the number, and it is like
the point “floating to the top”, fractions are stored in FLOATING POINT form. Obviously
there is a lot missing from this explanation, but that is more than enough for this part of
the course. Simply remember that fractions are stored in binary with the point at the front,
and then a separate part is the number of places the point has had to move to get it to the
front.

Boolean data
Sometimes the answer to a question is either yes or no, true or false. There are only two
options. The computer uses binary data which consists of bits of information that can be
either 0 or 1, so it seems reasonable that the answer to such questions can be stored as a
single bit with 1 standing for true and 0 standing for false. Data which can only have two
states like this is known as BOOLEAN data.
A simple example of its use would be in the control program for an automatic washing
machine. One of the important pieces of information for the processor would be to know
whether the door was shut. A boolean variable could be set to 0 if it was open and to 1 if it
was shut. A simple check of that value would tell the processor whether it was safe to fill
the machine with water.

Characters
A character can be anything, which is represented in the character set of the computer by
a character code in a single byte.
1.4.6 Arrays

Data stored in a computer is stored at any location in memory that the computer decides
to use. This means that similar pieces of data can be scattered all over memory. This, in
itself, doesn’t matter to the user, except that to find each piece of data it has to be referred
to by a variable name.
e.g. If it is necessary to store the 20 names of students in a group then each location
would have to be given a different variable name. The first, Jane, might be stored in
location NAME, the second, John, might be stored in FORENAME, the third, Penny, could be
stored in CHRINAME, but I’m now struggling, and certainly 20 different variable names that
made sense will be very taxing to come up with. Apart from anything else, the variable
names are all going to have to be remembered.
Far more sensible would be to force the computer to store them all together using the
variable name NAME. However, this doesn’t let me identify individual names, so if I call the
first one NAME(1) and the second NAME(2) and so on, it is obvious that they are all
peoples’ names and that they are distinguishable by their position in the list. Lists like this
are called ARRAYS.
Because the computer is being forced to store all the data in an array together, it is
important to tell the computer about it before it does anything else so that it can reserve
that amount of space in its memory, otherwise there may not be enough space left when
you want to use it. This warning of the computer that an array is going to be used is called
INITIALISING the array. Initialising should be done before anything else so that the
computer knows what is coming.
Initialising consists of telling the computer
what sort of data is going to be stored in the array so that the computer knows what part
of memory it will have to be stored in
how many items of data are going to be stored, so that it knows how much space to
reserve
the name of the array so that it can find it again.
Different programming languages have different commands for doing this but they all do
the same sort of thing, a typical command would be
DIM NAME$(20)
DIM is a command telling the computer that an array is going to be used
NAME is the name of the array
$ tells the computer that the data is going to be characters
(20) tells it that there are going to be up to 20 pieces of data.
Notes: Just because the computer was told 20 does not mean that we have to fill the array,
the 20 simply tells the computer the maximum size at any one time.
The array that has been described so far is really only a list of single data items. It would
be far more useful if each student had a number of pieces of information about them,
perhaps their name, address, date of birth. The array would now have 20 students and
more than one thing about each, this is called a two dimensional array. Obviously
everything gets more complicated now, but don’t worry as use of multi-dimensional arrays
is for later in the course.

We should now have a picture of a part of memory which has been reserved for the array
NAME$
Jane Name$(1)
John Name$(2)
- -
NAME$ - -
- -
Zaid Name$(20)

To read data into the array simply tell the computer what the data is and tell it the position
to place it in e.g. NAME$(11) = Thomas will place Thomas in position 11 in the array
(incidentally, erasing any other data that happened to be in there first.
To read data from the array is equally simple, tell the computer which position in the array
and assign the data to another value
e.g. RESULT = NAME$(2) will place John into a variable called RESULT.
Searching for a particular person in the array involves a simple loop and a question
e.g. search for Vera in the array NAME$
Answer:
Counter = 1
While Counter is less than 21, Do
If NAME$(Counter) = Vera Then Print “Found” and End.
Else Add 1 to Counter
Endwhile
Print “Name not in array”
Try to follow the logic of the steps above.
1.4.7 Linked Lists

When data is stored in a computer the processor can store it in any location provided it can
get the data back. In order to get the data back, each of the locations where the data is
stored is given an address so that if the computer can remember the address of where it
put some information it can easily retrieve it. If everything has to be in the index then the
index can get very large, it seems reasonable to try to cut down the size of the index by
grouping things together under one index entry. One method for doing this is by using an
array, the twenty people in the set could all be found by the one reference to NAME which
pointed to the location of the array. There are two major problems with arrays. The first is
that if the size of the set grows because a new student joins there is no room in the array
to store the information. This is because the array size has to be predetermined. The array
could be made much bigger, say size 50, so that we are sure it will never be too small, but
this leads to the second problem, that most of this space will never be used, consequently
wasting valuable memory. These problems can be overcome by using a linked list.
A linked list of data items tells the computer to store the data in any location and to link it
to the previous data item by giving the previous data item the address of the new one.
That sounds very complex, the idea is simple if we look at it in diagram form.
Imagine the list of names used in the example for the array.

Address Data Next address
Start address 60 Penny 75
61 61 Jane 86
62
63 Zaid XX
-
-
85
86 John 60
-

The diagram is intended to show a portion of memory. Each of the locations that can hold
data are given an address, and the address of the first data in the list is stored separately.
This tells the computer where the first piece of data is. If the first data item is not the one
that is wanted, the computer goes to the next piece of data by using the address that is
stored with the data. These addresses are called pointers, the last one of which has to be
one that cannot possibly exist so that the computer knows it has reached the end of the
list. Notice that the two problems associated with arrays no longer exist.
Linked lists are often drawn showing the data in boxes and the linking addresses as
pointers. This is the better method for describing a list, just don’t lose sight of the fact that
the arrows are really address references.

Start Jane Penny

John Zaid XX

Note: The jagged line signifies that there are a number of others which would fit in there,
but they are not shown.

To initialise a list, all that needs to be done is to create a new start pointer for this list and
add it to the index of start pointers for all the other lists.

To search through the list for a particular piece of data follow these rules
1. Find the correct list in the index of lists
2. Follow the pointer to the next item
3. If the item is the one being searched for, report that it is found and end.
4. If the pointer shows that the end of the list has been reached, report that the item is not
there and end.
5. Go to step 2.

To remove a value from a list, simply change the pointer that points to it into one that
points to the next value after it. E.g. to remove John from the example

Start Jane Penny

John Zaid XX

Note that John’s data is still there, its just that there is no way of getting to it so it might
just as well not be. 1.4.8 Stacks and Queues

Queues.
Information arrives at a computer in a particular order, it may not be numeric, or
alphabetic, but there is an order dependent on the time that it arrives. Imagine Zaid, Jane,
John and Penny send jobs for printing, in that order. When these jobs arrive they are put in
a queue awaiting their turn to be dealt with. It is only fair that when a job is called for by
the printer that Zaid’s job is sent first because his has been waiting longest. These jobs are
held, just like the other data we have been talking about, in an array. The jobs are put in
at one end and taken out of the other. All the computer needs is a pointer showing it which
one is next to be done (start pointer(SP)) and another pointer showing where the next job
to come along will be put (end pointer(EP))
1. Zaid is in the queue for printing, the end pointer is pointing at where the next job will go.
2. Jane’s job is input and goes as the next in the queue, the end pointer moves to the next
available space.
3. Zaid’s job goes for printing so the start pointer moves to the next job, also John’s job
has been input so the end pointer has to move.

EP
EP John
EP Jane SP Jane
SP Zaid SP Zaid
1. 2. 3.

Notes: The array is limited in size, and the effect of this seems to be that the contents of
the array are gradually moving up. Sooner or later the queue will reach the end of the
array. The queue does not have to be held in an array, it could be stored in a linked list.
This would solve the problem of running out of space for the queue, but does not feature in
this course until the second year.

Stacks.
Imagine a queue where the data was taken off the array at the same end that it was put
on. This would be a grossly unfair queue because the first one there would be the last one
dealt with. This type of unfair queue is called a stack.
A stack will only need one pointer because adding things to it and taking things off it are
only done at one end
1. Zaid and Jane are in the stack. Notice that the pointer is pointing to the next space.
2. A job has been taken off the stack. It is found by the computer at the space under the
pointer (Jane’s job), and the pointer moves down one.
3. John’s job has been placed on the stack in the position signified by the pointer, the
pointer then moves up one. This seems to be wrong, but there are reasons for this being
appropriate in some circumstances which we will see later in the course.

Pointer Pointer
Jane Pointer John
Zaid Zaid Zaid
1. 2. 3.

In a queue, the Last one to come In is the Last one to come Out. This gives the acronym
LILO, or FIFO (First in is the first out).
In a stack, the Last one In is the First one Out. This gives the acronym LIFO, or FILO (First
in is the last out).
1.4.9 Files, Records, Items, Fields.

Data stored in computers is normally connected in some way. For example, the data about
the 20 students in the set that has been the example over the last three sections has a
connection because it all refers to the same set of people. Each person will have their own
information stored, but it seems sensible that each person will have the same information
stored about them, for instance their name, address, telephone number, exam grades…
All the information stored has an identity because it is all about the set of students, this
large quantity of data is called a FILE.
Each student has their own information stored. This information refers to a particular
student, it is called their RECORD of information. A number of records make up a file.
Each record of information contains the same type of information, name, address and so
on. Each type of information is called a FIELD. A number of fields make up a record and all
records from the same file must contain the same fields.
The data that goes into each field, for example “Jane Smith”, “3 Canal St.” will be different
in most of the records. The data that goes in a field is called an ITEM of data.
Note: Some fields may contain the same items of data in more than one record. For
example, there may be two people in the set who happen to be called Jane Smith. If Jane
Smith’s brother Tom is in this set he will presumably have the same address as Jane. It is
important that the computer can identify individual records, and it can only do this if it can
be sure that one of the fields will always contain different data in all the records. Because
of this quality, that particular field in the record is different from all the others and is known
as the KEY FIELD. The key field is unique and is used to identify the record. In our example
the records would contain a field called school number which would be different for each
student.
Note: Jane Smith is 10 characters (1 for the space), Christopher Patterson is 21
characters. It makes it easier for the computer to store things if the same amount of space
is allocated to the name field in each record. It might waste some space, but the searching
for information can be done more quickly. When each of the records is assigned a certain
amount of space the records are said to be FIXED LENGTH. Sometimes a lot of space is
wasted and sometimes data has to be abbreviated to make it fit. The alternative is to be
able to change the field size in every record, this comes later in the course.
1.4.10 Record Formats.

To design a record format, the first thing to do is to decide what information would be
sensible to be stored in that situation.
e.g. A teacher is taking 50 students on a rock-climbing trip. The students are being
charged 20 pounds each and, because of the nature of the exercise, their parents may
need to be contacted if there is an accident. The teacher decides to store the information
as a file on a computer. Design the record format for the file.
Answer.
The fields necessary will be Student name, Amount paid, Emergency telephone number,
Form (so that contact can be made in school). There are other fields that could be included
but we will add just one more, the school number (to act as the key field).
For each one of these fields it is necessary to decide what type of data they will be and
also to decide how many characters will be allowed for the data in that field, remember
these are fixed length records.
The easiest way is to write them in a table
Student number Integer 1 byte
Student name Character 20 bytes
Amount paid Integer 1 byte
Emergency number Character 12 bytes
Form (e.g.3RJ) Character 3 bytes
Notes: It would be perfectly reasonable to say that the school number was not a proper
number so it should be stored as characters (probably 4 bytes) or in BCD (2 bytes).
The student name is quite arbitrary. 15 bytes would be perfectly reasonable, as would 25
bytes, but 5 bytes would not. In other words there is no single right answer, but there are
wrong ones.
The amount paid is listed as an integer as the teacher will store the number of whole
pounds so far paid. This is not the best data type for an amount of money but it is the only
one that fits so far. When you have learned about other data types use them if they are
more suitable.
Many students expect that the emergency number should be an integer, but phone
numbers start with a 0, and integers are not allowed to. As most numbers do start with 0,
the computer can be programmed to put them in at the start of the rest of the number, but
you would have to say this in your answer. As we are not going to do any arithmetic with
these numbers, why make life more complicated than necessary.
3 characters were allowed for form. If in doubt, give an example of what you mean by the
data, it can’t hurt and it may save you a mark in the question.

1.4.11 Sizing a File

We have just designed the record format for a given situation. It may be necessary to
calculate how large the file is going to be.
Having decided on the size of each field, it is a simple matter of adding up the individual
field sizes to get the size of a record, in this case 37 bytes.
There are 50 students going on the trip, each of them having their own record, so the size
of the data in the file will be 50 * 37 = 1,850 bytes.
All files need a few extra pieces of information that the user may not see such as
information at the start of the file saying when it was last updated, which file it is, is it
protected in any way? These sort of extra pieces of information are known as overheads,
and it is usual to add 10% to the size of a file because of the need for overheads.
Therefore the size of the file is 1,850 bytes + (10% of 1,850 bytes) =2,035 bytes.
The final stage is to ensure that the units are sensible for the size of the file.
There are 1024 bytes in 1 Kbyte, so the size of this file is 2,035/1024 =
1.99Kbytes.
Note: Don’t worry about dividing by 1024, because, after all, this is only an approximation
anyway. If you gave the final answer as 2 (approx) then that is just as acceptable. Just
make sure that you write down somewhere that you know that there are 1024 bytes in a
Kbyte, otherwise you can’t be given the mark for knowing that.
1.4.12 Access Methods to Data

Computers can store large volumes of data. The difficulty is to be able to get it back. In
order to be able to retrieve data it must be stored in some sort of order. Imagine the
phone book. It contains large volumes of data which can be used, fairly easily, to look up a
particular telephone number because they are stored in alphabetic order of the
subscriber’s name. Imagine how difficult it would be to find a number if they had just been
placed in the book at random. The value of the book is not just that it contains all the data
that may be needed, but that it has a structure that makes it accessible. Similarly, the
structure of the data in a computer file is just as important as the data that it contains.
There are a number of ways of arranging the data that will aid access under different
circumstances.

Serial access.
Data is stored in the computer in the order in which it arrives. This is the simplest form of
storage, but the data is effectively unstructured, so finding it again can be very difficult.
This sort of data storage is only used when it is unlikely that the data will be needed again,
or when the order of the data should be determined by when it is input. A good example of
a serial file is what you are reading now. The characters were all typed in, in order, and
that is how they should be read. Reading this book would be impossible if all the words
were in alphabetic order. Another example of the use of a serial file will be seen in section
1.4.15.

Sequential access.
In previous sections of this chapter we used the example of a set of students whose data
was stored in a computer. The data was stored in alphabetic order of their name. It could
have been stored in the order that they came in a Computing exam, or by age with the
oldest first. However it is done the data has been arranged so that it is easier to find a
particular record. If the data is in alphabetic order of name and the computer is asked for
Zaid’s record it won’t start looking at the beginning of the file, but at the end, and
consequently it should find the data faster.
A file of data that is held in sequence like this is known as a sequential file.

Indexed sequential.
Imagine a large amount of data, like the names and numbers in a phone book. To look up
a particular name will still take a long time even though it is being held in sequence.
Perhaps it would be more sensible to have a table at the front of the file listing the first
letters of peoples’ names and giving a page reference to where those letters start. So to
look up Jones, a J is found in the table which gives the page number 232, the search is
then started at page 232 (where all the Js will be stored). This method of access involves
looking up the first piece of information in an index which narrows the search to a smaller
area, having done this, the data is then searched alphabetically in sequence. This type of
data storage is called Index Sequential.

Random access.
A file that stores data in no order is very useful because it makes adding new data or
taking data away very simple. In any form of sequential file an individual item of data is
very dependent on other items of data. Jones cannot be placed after Monks because that is
the wrong ‘order’. However, it is necessary to have some form of order because otherwise
the file cannot be read easily. What would be wonderful is if, by looking at the data that is
to be retrieved, the computer can work out where that data is stored. In other words, the
user asks for Jones’ record and the computer can go straight to it because the word Jones
tells it where it is being stored. How this can be done is explained in section 1.4.13.
1.4.13 Implementation of File Access Methods

This section is about how the different access methods to data in files can be put into
practice. There will not be a lot of detail, and some questions will remain unanswered, don’t
worry because those will appear in further work.

Serial access.
Serial files have no order, no aids to searching, and no complicated methods for adding
new data. The data is simply placed on the end of the existing file and searches for data
require a search of the whole file, starting with the first record and ending, either with
finding the data being searched for, or getting to the end of the file without finding the
data.

Sequential access.
Because sequential files are held in order, adding a new record is more complex, because
it has to be placed in the correct position in the file. To do this, all the records that come
after it have to be moved in order to make space for the new one.
e.g. A section of a school pupil file might look like this
…
James, Amanda, 21 Church St, F, 21/03/……..
Jones, David, 134 New Gardens, M, ………
Marsden, Thomas, ………..
Newsome, Claire, ………..
…
If a new pupil arrives whose name is Johnson, space must be found between James and
Jones. To do this all the other records have to be moved down one place, starting with
Newsome, then Marsden, and then Jones.
…
James, Amanda, 21 Chur…………

Jones, David, 134 Ne……….
Marsden, Thomas, …….
Newsome, Claire, ……..
This leaves a space into which Johnson’s record can be inserted and the order of the
records in the file can be maintained.
Having to manipulate the file in this way is very time consuming and consequently this type
of file structure is only used on files that have a small number of records or files that
change very rarely.
Larger files might use this principle, but would be split up by using indexing into what
amounts to a number of smaller sequential files.
e.g. the account numbers for a bank’s customers are used as the key to access the
customer accounts. The accounts are held sequentially and there are approximately 1
million accounts. There are 7 digits in an account number.
Indexes could be set up which identify the first two digits in an account number. Dependent
on the result of this first index search, there is a new index for the next two digits, which
then points to all the account numbers, held in order, that have those first four digits.
There will be one index at the first level, but each entry in there will have its own index at
the second level, so there will be 100 indexes at the second level. Each of these indexes
will have 100 options to point to, so there will be 10,000 blocks of data records. But each
block of records will only have a maximum of 1000 records in it, so adding a new record in
the right place is now manageable which it would not have been if the 1million records
were all stored together.
00
01
00 02
01 …
02 99
...
… 00 0102000 DATA
99 01 0102001
02 0102002
One … 0102003
First level index 99 ……….
(first two digits 0102999
in account number)

100 10,000
Second level indexes Final index blocks
(third and fourth digits each containing up to
in account number) 1000 account numbers

Random access.
To access a random file, the data itself is used to give the address of where it is stored.
This is done by carrying out some arithmetic (known as pseudo arithmetic because it
doesn’t make much sense) on the data that is being searched for.
E.g. imagine that you are searching for Jones’ data.
The rules that we shall use are that the alphabetic position of the first and last letters in the
name should be multiplied together, this will give the address of the student’s data.
So Jones = 10 * 19 = 190. Therefore Jones’ data is being held at address 190 in memory.
This algorithm is particularly simplistic, and does not give good results, as we shall soon
see, but it illustrates the principle. Any algorithm can be used as long as it remains the
same for all the data.
This type of algorithm is known as a HASHING algorithm.
The problem with this example can be seen if we try to find James’s data.
James = 10 * 19 = 190. The data for James cannot be here because Jones’s data is here.
This is called a CLASH. When a clash occurs, the simple solution is to work down
sequentially until there is a free space. So the computer would inspect address 191, and if
that was being used, 192, and so on until a blank space. The algorithm suggested here will
result in a lot of clashes which will slow access to the data. A simple change in the
algorithm will eliminate all clashes. If the algorithm is to write down the alphabetic position
of all the letters in the name as 2 digit numbers and then join them together there could be
no clashes unless two people had the same name.
e.g. Jones = 10, 15, 14, 05, 19 giving an address 1015140519
James = 10, 01, 13, 05, 19 giving an address 1001130519
The problem of clashes has been solved, but at the expense of using up vast amounts of
memory (in fact more memory than the computer will have at its disposal). This is known
as REDUNDANCY. Having so much redundancy in the algorithm is obviously not acceptable.
The trick in producing a sensible hashing algorithm is to come up with a compromise that
minimizes redundancy without producing too many clashes.
1.4.14 Selection of Data Types and Structures

Data types.
When the computer is expected to store data, it has to be told what type of data it is going
to be because different types of data are stored in different areas of memory. In addition
to the types of data that we have already described, there are other, more specialised,
data types. Most can be covered by calling them characters (or string data, which is just a
set of characters one after the other), but there are two others that are useful. Currency
data is, as the name suggests, set up to deal with money. It automatically places two digits
after the point and the currency symbol in. The other is date, this stores the date in either
6 or 8 bytes dependent on whether it is to use 2 or 4 digits for the year. Care should be
taken with the date because different cultures write the three elements of a date in
different orders, for example, Americans put the month first and then the day.

Data structures
Students should be able to justify the use of a particular type of structure for storing data
in given circumstances. Questions based on this will be restricted to the particular
structures mentioned in 1.4.7 and 1.4.8 and will be non-contentious. E.g. Jobs are sent to a
printer from a number of sources on a network. State a suitable data structure for storing
the jobs that are waiting to be printed giving a reason for your answer.
Answer: A queue, because the next one to be printed should be the one that has been
waiting longest.
1.4.15 Backing up and Archiving Data

Backing up data.
Data stored in files is very valuable. It has taken a long time to input to the system, and
often, is irreplaceable. If a bank loses the file of customer accounts because the hard disk
crashes, then the bank is out of business.
It makes sense to take precautions against a major disaster. The simplest solution is to
make a copy of the data in the file, so that if the disk is destroyed, the data can be
recovered. This copy is known as a BACK-UP. In most applications the data is so valuable
that it makes sense to produce more than one back-up copy of a file, some of these copies
will be stored away from the computer system in case of something like a fire which would
destroy everything in the building.
The first problem with backing up files is how often to do it. There are no right answers, but
there are wrong ones. It all depends on the application. An application that involves the file
being altered on a regular basis will need to be backed up more often than one that is very
rarely changed (what is the point of making another copy if it hasn’t changed since the
previous copy was made?). A school pupil file may be backed up once a week, whereas a
bank customer file may be backed up hourly.
The second problem is that the back-up copy will rarely be the same as the original file
because the original file keeps changing. If a back up is made at 9.00am and an alteration
is made to the file at 9.05am, if the file now crashes, the back up will not include the
change that has been made. It is very nearly the same, but not quite. Because of this, a
separate file of all the changes that have been made since the last back up is kept. This file
is called the transaction log and it can be used to update the copy if the original is
destroyed. This transaction log is very rarely used. Once a new back up is made the old
transaction log can be destroyed. Speed of access to the data on the transaction log is not
important because it is rarely used, so a transaction log tends to use serial storage of the
data and is the best example of a serial file if an examination question asks for one.

Archiving data.
Data sometimes is no longer being used. A good example would be in a school when pupils
leave at the end of the sixth form. All their data is still on the computer file of pupils taking
up valuable space. It is not sensible to just delete it, there are all sorts of reasons why the
data may still be important, for instance a past pupil may ask for a reference. If all the
data has been erased it may make it impossible to write a sensible reference. Data that is
no longer needed on the file but may be needed in the future should be copied onto long
term storage medium and stored away in case it is needed. This is known as producing an
ARCHIVE of the data. (Schools normally archive data for 7 years before destroying it).
Note: Archived data is NOT used for retrieving the file if something goes wrong, it is used
for storing little used or redundant data in case it is ever needed again, so that space on
the hard drive can be freed up.
Example Questions

1. a) Express the number 113 (denary) in
(i) binary
(ii) in BCD
using an appropriate number of bytes in each case. (4)
b) Using the answer obtained in part (a) show how 113 (denary) can be expressed in
(i) octal
(ii) hexadecimal. (4)
A. a) (i) 128 64 32 16 8 4 2 1
0 1 1 1 0 0 0 1 =01110001.
(ii) 1 = 0001
1 = 0001
3 = 0011
Therefore 113 = 0000000100010011
b) (i) 113 = 001 110 001 in binary
= 1 6 1 in octal.
(ii) 113 = 0111 0001 in binary
= 7 1 in hexadecimal.
Notice: (i) and (ii) It was necessary to show a method of working out. Many students have
calculators that will automatically change from one number system to another, so it is
necessary to show that you know what you are doing, even if you use a calculator to check
the results. Also, the question stated that the appropriate number of bytes be used, in part
(i) this is obviously 8 bits and is an easy mark, but in part (ii) it is necessary to add a set of
zeros to the front of the answer to make it a whole number of bytes.
(iii) and (iv) The question stated that the first answer had to be used, so one of the two
marks is going to be given for showing the relationship between binary and each of these
two representations. Notice that for the octal answer it was necessary to add a 0 to the
front of the binary number to give 9 bits (3 lots of 3).

2. Explain how the denary number –27 can be represented in binary in
(i) sign and magnitude
(ii) two’s complement
notation, using a single byte for each answer. (4)
A. (i) +/- 64 32 16 8 4 2 1
1 0 0 1 1 0 1 1 = 10011011
(ii) –27 = -128 + 101
= -128 + (+64 +32 +4 +1)
-128 64 32 16 8 4 2 1
1 1 1 0 0 1 0 1 = 11100101
Notice that the question asked “Explain…”, so just writing the answer down is not
acceptable. There will be a mark for showing the column headings, particularly the value of
the MSB in each case.

3. Add together the binary equivalents of 34 and 83, using single byte arithmetic, showing
you working. (3)
A. 34 = 0 0 1 0 0 0 1 0
83 = 0 1 0 1 0 0 1 1 +
0 1 1 1 0 1 0 1 = 117
1
Note that the question asked for the working. The part that shows that you are capable of
doing the arithmetic is the carry, don’t miss it out. In a question like this try to ask yourself
what evidence the examiner is expecting for the mark. The other marks are for using 8
bits for each value, and for the answer.

4. Describe how characters are stored in a computer. (3)
A. -Each character is given a code…
-as a binary number
-Each character code occupies one byte of data
-Typically one byte is eight bits
-Most computers use a standard set of characters like the ASCII set.
There are lots of marks available here. The question has been left deliberately open so
that there are plenty of mark points available.

5.a) Explain the difference between integer and floating point data types. (2)
b) State what is meant by Boolean data. (1)
A. a)-Integers are whole numbers
-Floating point numbers allow for the storage of fractions.
b)-Boolean data is data that can exist in two states, e.g. true or false.
Note that in part (a) there may be lots of other differences that you know, but they are not
really part of this syllabus. If your answer is valid though, it will be given the marks. In
part (b), this is another standard definition, don’t try to elaborate too much or you may
wander too far from the answer.

6. An array is to be used to store information.
State three parameters that need to be given about the array before it can be used,
explaining the reason why each is necessary. (6)
A. -The size of the array (how many data items it will hold)…
-so that this amount of space can be reserved in memory.
-The type of data to be stored in the array…
-so that it can be set up in the correct area of memory.
-The name of the array…
-so that it can be identified when it needs to be used.
-The number of dimensions of the array…
-so that the computer knows what form the data is going to be stored in.
Note: these marks go in pairs. Don’t worry if you are not happy with the last point, this is
a more advanced feature of arrays – look at the notes on section 1.4.6.

7. A garden centre stores details of each of the types of plant that it has for sale on a
computer system. The details of the plants are stored in alphabetical order in a linked list.
a)By drawing a diagram show how the plants are arranged in the list. You may use the
following plants to illustrate your answer
Pansy, Dahlia, Clematis, Sweet pea. (4)
b)Describe how a new linked list, of plants that like shaded conditions, can be created
from the original one. (4)
A. a)
Head of list table Dahlia
(Start Pointer) Clematis
Sweet pea XX
Pansy
Mark points:
-Some start position for list
-List in alphabetic order
-Use of pointers
-Use of null pointer to finish list.
b)-Create new linked list by creating a new start point.
-Read values from old list in turn
-If plant in old linked list likes shade then…
-add to new list and put in pointer.
-Repeat until null pointer (end of list)
-Place null pointer after last entry in new list
Note: The first part is fairly simple. Stick to the simple type of diagram and don’t forget
how to start and end the list. The second part is more difficult because it means a certain
degree of analysis of the problem to be able to solve it. This is why there are rather more
mark points available. When dealing with linked lists, if you draw a diagram and remember
to start and finish it, you won’t go far wrong.

8. A stack is being held in an array. Items may be read from the stack or added to the
stack.
a) State a problem that may arise when
(i) adding a new value to the stack
(ii) reading a value from the stack. (2)
b) Explain how the stack pointer can be used by the computer to recognise when such
problems may occur. (2)
A. a)(i) The array may be full, consequently no new value can be entered.
(ii) The array may be empty, there is no value to be read.
b)(i) The stack pointer will be pointing outside the array.
(ii) The stack pointer will be pointing at the first location in the array.
Notes. When discussing stacks and queues it is important to have a picture in your mind of
what it looks like. The simplest picture is to imagine them being held in an array. This is
not the only way to store them and you may have been shown other methods, but the
array is perfectly adequate for this course.

9. A library stores details of the books that are available.
a) Apart from title and author, state 3 other fields that it would be sensible for the library
to store in this file, giving a reason why each of your chosen fields would be necessary. (6)
b) State which field would be used as the key field of the record and explain why a key
field is necessary. (2)
c) State the size of each of the fields in your record. (2)
d) If the library stores approximately 20,000 books, estimate the size the book file. (3)
A. a)One mark for each of three sensible fields with an extra mark for an explanation of
the need for that field.
E.g. ISBN/to identify book
Shelf number/ to allow for ease of search for book
Fiction or reference or childrens’ (some form of category)/ to decide whereabouts in
library it should go
b)-Book number (ISBN)
-because it is unique to that record and hence can be used as an identifier.
c) Title 40 bytes
Author 20 bytes
ISBN 10 bytes
Shelf 3 bytes
Category 1 byte
Mark points:
-Use of bytes
-Sensible field sizes.
d)Total size of one record = 74 bytes
-Size of file = 74 * 20000 = 1480000 bytes
-+ Overheads of 10% = 1,480,000 + 148,000 = 1,628,000 bytes.
- (Divide by 1024 until sensible units) = 1.55 Mbytes.
Notes: In part (a), any sensible fields (can you justify it?) are acceptable. For instance,
Publisher/so that more copies may be ordered, would be a sensible field. In part (b) you
would be expected to provide the answer ISBN, although an explanation would be accepted
if it was clear what was meant. Try to be aware of the standard key fields (account
numbers, ISBN, barcodes, school numbers). The sizes of the fields are only given as
guidance. If you suggested 100 bytes for the title it would be accepted as being
reasonable. On the other hand 1000 bytes would be way over the top and 10 bytes would
be unreasonably small. Don’t worry about trying to get the ‘correct’ answer, there isn’t
one, just make it sensible. There is a correct answer for the ISBN, it is 10 characters long,
but it is not part of your syllabus to learn things like that off by heart, so as long as you
don’t say something silly like 30 bytes, which demonstrates that you don’t know what an
ISBN is, you will get the mark. In part (d) make sure that you convert your answer to
sensible units. Don’t worry about using a calculator, as long as you show that you did know
the correct method is to divide by 1024, an approximate answer is all that is necessary. In
this case it would be quite acceptable to write “Keep dividing by 1024, this gives
approximately 1.6 Mbytes”. When you answer this question your answer is highly unlikely
to be the same as this, don’t worry, the examiner will mark your answer and not expect it
to be the same as a model answer.

10. a)Explain the difference between a serial file and a sequential file. (2)
b)Describe what is meant by a hashing algorithm and explain why such an algorithm can
lead to clashes. (3)
A. a)-Serial file holds data in the order in which it was received.
-Sequential file holds the data according to some order defined on the data.
b)-A hashing algorithm is pseudo arithmetic…
-carried out on the data…
-in order to determine the location of the data in the memory.
-A clash occurs when the answer to the pseudo arithmetic is the same despite the data
used in the calculation being different.
Note: This is a question where it is very easy to try to answer too much. Restrict yourself
to what the question is asking. Many candidates faced with part (b) will go into great detail
about what to do after clashes have occurred, this is not required and will earn no marks.

11. A library keeps both a book file and a member file. The library does a stock take twice
a year and orders new books only once a year. Members can join or cancel their
membership at any time.
a) Describe how the library can implement a sensible system of backing up their files. (4)
b) Explain the part that would be played by archiving in the management of the files. (4)
A. a)-Book file needs to backed up twice (or three) times a year…
-when the stock take or book purchasing has made the file alter.
-Member file needs backing up daily(at least weekly) because of constant changes…
-would also need to keep a transaction log for the member file.
-Back up copies would be stored away from the building with the computer system in it to
ensure that a copy of files survived in case of fire.
-Multiple copies of the book file would be made
-Member file copies and transaction logs may be kept for a number of back up periods.
b)-When books are discovered to be missing, or if a book is replaced by a more
up-to-date edition, the old records should be kept…
-but they are no longer live so are taken from the hard disk…
-after a copy (archive) has been made.
-When members leave the library the data should be archived…
-also when they have not taken a book out for a long period of time their record can be
considered to be dormant.
Notes: Backing up and archiving are topics which cause confusion among candidates.
There really is no need for such confusion if you remember that archiving has nothing to
do with recovery after a disk failure. Note that there are plenty of mark points available in
this question.

Note: This chapter, or section of the syllabus, is by far the largest portion of module 1,
and consequently, candidates should expect a higher proportion of marks on the exam
paper to relate to this work than to the other sections.
Back