|
Data: Its Representation, Structure and Management 1.4.1 Number Systems Counting is one of the first skills that a young child masters, and none of us consider counting from 1 to 100 difficult. However, to count, we have to learn, by heart, the meanings of the symbols 0,1,2…9 and also to understand that two identical symbols mean totally different things according to their ‘place’ in the number. For instance, in 23 the 2 actually means 2 * 10. But why multiply by 10? Why not multiply by 6? The answer is simply that we were taught to do that because we have 10 fingers, so we can count on our fingers until we get to the last one, which we remember in the next column and then start again. We don’t need to count in tens. The ancient Babylonians counted in a system, which is similar to counting in sixties. This is very difficult to learn because of all the symbols needed, but we still use a system based on sixties today: 60 minutes = 1 hour; 60 seconds = 1 minute; 6 * 60 degrees = 1 revolution. Instead of increasing the number of symbols in a system, which makes the system more difficult, it seems reasonable that if we decrease the number of symbols the system will be easier to use. A computer is an electronic machine. Electricity can be either on or off. If electricity is not flowing through a wire that can stand for 0. If electricity is flowing, then it stands for 1. The difficulty is what to do for the number 2. We can’t just pump through twice as much electricity, what we need is a carry system, just like what happens when we run out of fingers. What we need is another wire. The ‘units’ wire no electricity 0 = 0 The ‘twos’ wire no electricity 0 ADD 1 electricity 1 = 1 no electricity 0 ADD 1 no electricity 0 = 2 Carry electricity 1 ADD 1 electricity 1 =3 electricity 1 ADD 1 no electricity 0 Carry no electricity 0 =4 Carry electricity 1 The computer can continue like this for ever, just adding more wires when it gets bigger numbers. This system, where there are only two digits, 0 and 1, is known as the binary system. Each wire, or digit, is known as a binary digit. This name is normally shortened to BIT. So each digit, 0 or 1, is one bit. A single bit has very few uses so they are grouped together. A group of bits is called a BYTE. Usually a byte has 8 bits in it. In section 4 of this chapter we will see that bytes can be of different sizes, but for now we will say that there are 8 bits in a byte. The first thing we must be able to do with the binary system is to change numbers from our system of 10 numbers (the denary system) into binary, and back again. There are a number of methods for doing this, the simplest being to use the sort of column diagrams, which were used in primary school to do simple arithmetic Thousands Hundreds Tens Units except, this time we are using binary, so the column headings go up in twos instead of tens 32s 16s 8s 4s 2s units To turn a denary number into a binary number simply put the column headings, start at the left hand side and follow the steps: If the column heading is less than the number, put a 1 in the column and then subtract the column heading from the number. Then start again with the next column on the right. If the column heading is greater than the number, put a 0 in the column and start again with the next column on the right. Note: You will be expected to be able to do this with numbers up to 255, because that is the biggest number that can be stored in one byte of eight bits. e.g. Change 117 (in denary) into a binary number. Answer: Always use the column headings for a byte (8 bits) 128 64 32 16 8 4 2 1 Follow the algorithm. 128 is greater than 117 so put a 0 and repeat. 128 64 32 16 8 4 2 1 0 64 is less than 117 so put a 1. 128 64 32 16 8 4 2 1 0 1 Take 64 from 117 = 53, and repeat. 32 is less than 53, so put a 1. 128 64 32 16 8 4 2 1 0 1 1 Take 32 from 53 = 21, and repeat. If you continue this the result (try it) is 128 64 32 16 8 4 2 1 0 1 1 1 0 1 0 1 So 117 (in denary) = 01110101 (in binary). To turn a binary number into denary, simply put the column headings above the binary number and add up all the columns with a 1 in them. e.g. Change 10110110 into denary. Answer: 128 64 32 16 8 4 2 1 1 0 1 1 0 1 1 0 So 10110110 = 128 + 32 + 16 + 4 + 2 = 182 (in denary). This principle can be used for any number system, even the Babylonians’ sixties if you can learn the symbols. e.g. If we count in eights (called the OCTAL system) the column headings go up in 8’s. 512 64 8 1 So 117 (in denary) is 1 lot of 64, leaving another 53. 53 is 6 lots of 8 with 5 left over. Fitting this in the columns gives 512 64 8 1 0 1 6 5 So 117 in denary is 165 in octal. Why bother with octal? Octal and binary are related. If we take the three digits of the octal number 165 and turn each one into binary using three bits each we get 1 = 001 6 = 110 5 = 101 Put them together and we get 001110101 which is the binary value of 117 which we got earlier. The value of this relationship is not important now, but it is the reason why octal is in the syllabus. Another system is called HEXADECIMAL (counting in 16’s). This sounds awful, but just use the same principles. 256 16 1 So 117 (in denary) is 7 lots of 16 (112) plus an extra 5. Fitting this in the columns gives 256 16 1 0 7 5 Notice that 7 in binary is 0111 and that 5 is 0101, put them together and we get 01110101 which is the binary value of 117 again. So binary, octal and hexadecimal are all related in some way. There is a problem with counting in 16’s instead of the other systems. We need symbols going further than 0 to 9 (only 10 symbols and we need 16!). We could invent 6 more symbols but we would have to learn them, so we use 6 that we already know, the letters A to F. In hexadecimal A stands for 10, B stands for 11 and so on to F stands for 15. So a hexadecimal number BD stands for 11 lots of 16 and 13 units = 176 + 13 = 189 ( in denary) Note: B = 11, which in binary = 1011 D = 13, which in binary = 1101 Put them together to get 10111101 = the binary value of 189. Binary Coded Decimal Some numbers are not proper numbers because they don’t behave like numbers. A barcode for chocolate looks like a number, and a barcode for sponge cake looks like a number, but if the barcodes are added together the result is not the barcode for chocolate cake. The arithmetic does not give a sensible answer. Values like this that look like numbers but do not behave like them are often stored in binary coded decimal (BCD). Each digit is simply changed into a four bit binary number which are then placed after one another in order. e.g. 398602 in BCD Answer: 3 = 0011 9 = 1001 8 = 1000 6 = 0110 0 = 0000 2 = 0010 So 398602 = 001110011000011000000010 (in BCD) Note: All the zeros are essential otherwise you can’t read it back. 1.4.2 Negative Integers If a computer system uses a byte to store a number in the way that was suggested in 1.4.1 there are three problems that arise. The first is that the biggest number that can be represented is 255 because there aren’t enough bits to store bigger numbers. This is easily solved by using more than one byte to represent a number. Most computer systems use either two or four bytes to store a number. There is still a limit on the size that can be represented, but it is now much larger. The second problem is not so easy to solve, how to represent fractions. This will be looked at later in this chapter. The third problem is how to store negative numbers. The example we used for binary storage was 117 which becomes 01110101 in binary. If we want to store +117 or –117, these numbers need a second piece of data to be stored, namely the sign. There are two simple ways to store negative numbers. Sign and Magnitude. Use the first bit in the byte (the most significant bit (MSB)) to represent the sign (0 for + and 1 for -) instead of representing 128. This means that +117 = 01110101 and -117 = 11110101 Notes: The range of numbers possible is now –127 to +127. The byte does not represent just a number but also a sign, this makes arithmetic difficult. Two’s Complement The MSB stays as a number, but is made negative. This means that the column headings are -128 64 32 16 8 4 2 1 +117 does not need to use the MSB, so it stays as 01110101. -117 = -128 + 11 = -128 + (8 + 2 + 1) fitting this in the columns gives 10001011 Two’s complement seems to make everything more complicated for little reason at the moment, but later it becomes essential for making the arithmetic easier. 1.4.3 Binary Arithmetic The syllabus requires the addition of two binary integers, and the ability to take one away from another. The numbers and the answers will be limited to one byte. Addition. There are four simple rules 0 + 0 = 0 0 + 1 = 1 1 + 0 = 1 and the difficult one 1 + 1 = 0 (Carry 1) e.g. Add together the binary equivalents of 91 and 18 Answer: 91 = 0 1 0 1 1 0 1 1 18 = 0 0 0 1 0 0 1 0 + 0 1 1 0 1 1 0 1 = 109 1 1 Subtraction. This is where two’s complement is useful. To take one number away from another, simply write the number to be subtracted as a two’s complement negative number and then add them up. e.g. Work out 91 – 18 using their binary equivalents. Answer: 91 = 01011011 -18 as a two’s complement number is –128 + 110 = -128 +(+64 +32 +8 +4 +2) = 11101110 Now add them 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 + 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 But the answer can only be 8 bits, so cross out the 9th bit giving 01001001 = 64 + 8 + 1 = 73. Notes: Lots of carrying here makes the sum more difficult, but the same rules are used. One rule is extended slightly because of the carries, 1+1+1 = 1 (carry 1) Things can get harder but this is as far as the syllabus goes. 1.4.4 Character Sets. There are other types of data that have to be stored in computers besides numbers, for instance, the letters of the alphabet. These are stored as codes which look like binary numbers. For instance A could be stored as 000, B as 001 and so on. Unfortunately, there are only 8 possible codes using 3 bits, so we could store the letters A to H but not the rest, and what about the lower case letters and punctuation and…? The computer can store as many characters as necessary simply by using more and more bits for the code. Some systems don’t need to be able to recognise a lot of characters so they only use a few bits for each character. The number of bits needed to store one character is called a byte which is usually said to have 8 bits because most systems use 8 bits to store the code for each character. We now have enough codes, but another problem arises. If my computer stores A as 01000001 and your computer stores A as 01000010 then the computers cannot communicate because they cannot understand each other’s codes. In the 1960’s a meeting in America agreed a standard set of codes so that computers could communicate with each other. This standard set of codes is known as the ASCII set. Most systems use ASCII so you can be fairly sure that when you type in A it is stored in the computer’s memory as 01000001. However, you can’t be sure because some systems use other codes. A less common code is called EBCDIC, it was developed for use by larger scale computer systems and differs in that the code for each character is different than that used in ASCII. Notes: All the characters that a system can recognise are called its character set. ASCII uses 8 bits so there are 256 different codes that can be used and hence 256 different characters. (This is not quite true, we will see why in chapter 1.6.) A problem arises when the computer retrieves a piece of data from its memory. Imagine that the data is 01000001. Is this the number 65, or is it A? They are both stored in the same way, so how can it tell the difference? The answer is that characters and numbers are stored in different parts of the memory, so it knows which one it is by knowing whereabouts it was stored. 1.4.5 Data Types The computer needs to use different types of data in the operation of the system. All of these different types of data will look the same because they all have to be stored as binary numbers. The computer can distinguish one type of data from another by seeing whereabouts in memory it is stored. Numeric data. There are different types of numbers that the computer must be able to recognise. Numbers can be restricted to whole numbers, these are called INTEGERS and are stored by the computer as binary numbers using a whole number of bytes. It is usual to use either 2 bytes (called short integers) or 4 bytes (called long integers), the difference being simply that long integers can store larger numbers. Section 1.4.2 described how negative integers could be stored, these have to be treated as different types of data by the computer because the MSB of the data stands for something different than in an ordinary unsigned integer. Another type of number is a fraction (or decimal). This is stored by changing the position of the point so that it is always at the front of the number and then storing the fraction that is left (there is no whole part because the point comes first). However, 2.3 and .23 will now both be stored as .23 so we need an extra piece of data to be stored, the number of places the point has had to move. Fractions, then, are stored in two parts: the fraction part (called the mantissa) and the number of places the point has had to move (called the exponent). Because the first stage is always to move the point to the front of the number, and it is like the point “floating to the top”, fractions are stored in FLOATING POINT form. Obviously there is a lot missing from this explanation, but that is more than enough for this part of the course. Simply remember that fractions are stored in binary with the point at the front, and then a separate part is the number of places the point has had to move to get it to the front. Boolean data Sometimes the answer to a question is either yes or no, true or false. There are only two options. The computer uses binary data which consists of bits of information that can be either 0 or 1, so it seems reasonable that the answer to such questions can be stored as a single bit with 1 standing for true and 0 standing for false. Data which can only have two states like this is known as BOOLEAN data. A simple example of its use would be in the control program for an automatic washing machine. One of the important pieces of information for the processor would be to know whether the door was shut. A boolean variable could be set to 0 if it was open and to 1 if it was shut. A simple check of that value would tell the processor whether it was safe to fill the machine with water. Characters A character can be anything, which is represented in the character set of the computer by a character code in a single byte. 1.4.6 Arrays Data stored in a computer is stored at any location in memory that the computer decides to use. This means that similar pieces of data can be scattered all over memory. This, in itself, doesn’t matter to the user, except that to find each piece of data it has to be referred to by a variable name. e.g. If it is necessary to store the 20 names of students in a group then each location would have to be given a different variable name. The first, Jane, might be stored in location NAME, the second, John, might be stored in FORENAME, the third, Penny, could be stored in CHRINAME, but I’m now struggling, and certainly 20 different variable names that made sense will be very taxing to come up with. Apart from anything else, the variable names are all going to have to be remembered. Far more sensible would be to force the computer to store them all together using the variable name NAME. However, this doesn’t let me identify individual names, so if I call the first one NAME(1) and the second NAME(2) and so on, it is obvious that they are all peoples’ names and that they are distinguishable by their position in the list. Lists like this are called ARRAYS. Because the computer is being forced to store all the data in an array together, it is important to tell the computer about it before it does anything else so that it can reserve that amount of space in its memory, otherwise there may not be enough space left when you want to use it. This warning of the computer that an array is going to be used is called INITIALISING the array. Initialising should be done before anything else so that the computer knows what is coming. Initialising consists of telling the computer what sort of data is going to be stored in the array so that the computer knows what part of memory it will have to be stored in how many items of data are going to be stored, so that it knows how much space to reserve the name of the array so that it can find it again. Different programming languages have different commands for doing this but they all do the same sort of thing, a typical command would be DIM NAME$(20) DIM is a command telling the computer that an array is going to be used NAME is the name of the array $ tells the computer that the data is going to be characters (20) tells it that there are going to be up to 20 pieces of data. Notes: Just because the computer was told 20 does not mean that we have to fill the array, the 20 simply tells the computer the maximum size at any one time. The array that has been described so far is really only a list of single data items. It would be far more useful if each student had a number of pieces of information about them, perhaps their name, address, date of birth. The array would now have 20 students and more than one thing about each, this is called a two dimensional array. Obviously everything gets more complicated now, but don’t worry as use of multi-dimensional arrays is for later in the course. We should now have a picture of a part of memory which has been reserved for the array NAME$ Jane Name$(1) John Name$(2) - - NAME$ - - - - Zaid Name$(20) To read data into the array simply tell the computer what the data is and tell it the position to place it in e.g. NAME$(11) = Thomas will place Thomas in position 11 in the array (incidentally, erasing any other data that happened to be in there first. To read data from the array is equally simple, tell the computer which position in the array and assign the data to another value e.g. RESULT = NAME$(2) will place John into a variable called RESULT. Searching for a particular person in the array involves a simple loop and a question e.g. search for Vera in the array NAME$ Answer: Counter = 1 While Counter is less than 21, Do If NAME$(Counter) = Vera Then Print “Found” and End. Else Add 1 to Counter Endwhile Print “Name not in array” Try to follow the logic of the steps above. 1.4.7 Linked Lists When data is stored in a computer the processor can store it in any location provided it can get the data back. In order to get the data back, each of the locations where the data is stored is given an address so that if the computer can remember the address of where it put some information it can easily retrieve it. If everything has to be in the index then the index can get very large, it seems reasonable to try to cut down the size of the index by grouping things together under one index entry. One method for doing this is by using an array, the twenty people in the set could all be found by the one reference to NAME which pointed to the location of the array. There are two major problems with arrays. The first is that if the size of the set grows because a new student joins there is no room in the array to store the information. This is because the array size has to be predetermined. The array could be made much bigger, say size 50, so that we are sure it will never be too small, but this leads to the second problem, that most of this space will never be used, consequently wasting valuable memory. These problems can be overcome by using a linked list. A linked list of data items tells the computer to store the data in any location and to link it to the previous data item by giving the previous data item the address of the new one. That sounds very complex, the idea is simple if we look at it in diagram form. Imagine the list of names used in the example for the array. Address Data Next address Start address 60 Penny 75 61 61 Jane 86 62 63 Zaid XX - - 85 86 John 60 - The diagram is intended to show a portion of memory. Each of the locations that can hold data are given an address, and the address of the first data in the list is stored separately. This tells the computer where the first piece of data is. If the first data item is not the one that is wanted, the computer goes to the next piece of data by using the address that is stored with the data. These addresses are called pointers, the last one of which has to be one that cannot possibly exist so that the computer knows it has reached the end of the list. Notice that the two problems associated with arrays no longer exist. Linked lists are often drawn showing the data in boxes and the linking addresses as pointers. This is the better method for describing a list, just don’t lose sight of the fact that the arrows are really address references. Start Jane Penny John Zaid XX Note: The jagged line signifies that there are a number of others which would fit in there, but they are not shown. To initialise a list, all that needs to be done is to create a new start pointer for this list and add it to the index of start pointers for all the other lists. To search through the list for a particular piece of data follow these rules 1. Find the correct list in the index of lists 2. Follow the pointer to the next item 3. If the item is the one being searched for, report that it is found and end. 4. If the pointer shows that the end of the list has been reached, report that the item is not there and end. 5. Go to step 2. To remove a value from a list, simply change the pointer that points to it into one that points to the next value after it. E.g. to remove John from the example Start Jane Penny John Zaid XX Note that John’s data is still there, its just that there is no way of getting to it so it might just as well not be. 1.4.8 Stacks and Queues Queues. Information arrives at a computer in a particular order, it may not be numeric, or alphabetic, but there is an order dependent on the time that it arrives. Imagine Zaid, Jane, John and Penny send jobs for printing, in that order. When these jobs arrive they are put in a queue awaiting their turn to be dealt with. It is only fair that when a job is called for by the printer that Zaid’s job is sent first because his has been waiting longest. These jobs are held, just like the other data we have been talking about, in an array. The jobs are put in at one end and taken out of the other. All the computer needs is a pointer showing it which one is next to be done (start pointer(SP)) and another pointer showing where the next job to come along will be put (end pointer(EP)) 1. Zaid is in the queue for printing, the end pointer is pointing at where the next job will go. 2. Jane’s job is input and goes as the next in the queue, the end pointer moves to the next available space. 3. Zaid’s job goes for printing so the start pointer moves to the next job, also John’s job has been input so the end pointer has to move. EP EP John EP Jane SP Jane SP Zaid SP Zaid 1. 2. 3. Notes: The array is limited in size, and the effect of this seems to be that the contents of the array are gradually moving up. Sooner or later the queue will reach the end of the array. The queue does not have to be held in an array, it could be stored in a linked list. This would solve the problem of running out of space for the queue, but does not feature in this course until the second year. Stacks. Imagine a queue where the data was taken off the array at the same end that it was put on. This would be a grossly unfair queue because the first one there would be the last one dealt with. This type of unfair queue is called a stack. A stack will only need one pointer because adding things to it and taking things off it are only done at one end 1. Zaid and Jane are in the stack. Notice that the pointer is pointing to the next space. 2. A job has been taken off the stack. It is found by the computer at the space under the pointer (Jane’s job), and the pointer moves down one. 3. John’s job has been placed on the stack in the position signified by the pointer, the pointer then moves up one. This seems to be wrong, but there are reasons for this being appropriate in some circumstances which we will see later in the course. Pointer Pointer Jane Pointer John Zaid Zaid Zaid 1. 2. 3. In a queue, the Last one to come In is the Last one to come Out. This gives the acronym LILO, or FIFO (First in is the first out). In a stack, the Last one In is the First one Out. This gives the acronym LIFO, or FILO (First in is the last out). 1.4.9 Files, Records, Items, Fields. Data stored in computers is normally connected in some way. For example, the data about the 20 students in the set that has been the example over the last three sections has a connection because it all refers to the same set of people. Each person will have their own information stored, but it seems sensible that each person will have the same information stored about them, for instance their name, address, telephone number, exam grades… All the information stored has an identity because it is all about the set of students, this large quantity of data is called a FILE. Each student has their own information stored. This information refers to a particular student, it is called their RECORD of information. A number of records make up a file. Each record of information contains the same type of information, name, address and so on. Each type of information is called a FIELD. A number of fields make up a record and all records from the same file must contain the same fields. The data that goes into each field, for example “Jane Smith”, “3 Canal St.” will be different in most of the records. The data that goes in a field is called an ITEM of data. Note: Some fields may contain the same items of data in more than one record. For example, there may be two people in the set who happen to be called Jane Smith. If Jane Smith’s brother Tom is in this set he will presumably have the same address as Jane. It is important that the computer can identify individual records, and it can only do this if it can be sure that one of the fields will always contain different data in all the records. Because of this quality, that particular field in the record is different from all the others and is known as the KEY FIELD. The key field is unique and is used to identify the record. In our example the records would contain a field called school number which would be different for each student. Note: Jane Smith is 10 characters (1 for the space), Christopher Patterson is 21 characters. It makes it easier for the computer to store things if the same amount of space is allocated to the name field in each record. It might waste some space, but the searching for information can be done more quickly. When each of the records is assigned a certain amount of space the records are said to be FIXED LENGTH. Sometimes a lot of space is wasted and sometimes data has to be abbreviated to make it fit. The alternative is to be able to change the field size in every record, this comes later in the course. 1.4.10 Record Formats. To design a record format, the first thing to do is to decide what information would be sensible to be stored in that situation. e.g. A teacher is taking 50 students on a rock-climbing trip. The students are being charged 20 pounds each and, because of the nature of the exercise, their parents may need to be contacted if there is an accident. The teacher decides to store the information as a file on a computer. Design the record format for the file. Answer. The fields necessary will be Student name, Amount paid, Emergency telephone number, Form (so that contact can be made in school). There are other fields that could be included but we will add just one more, the school number (to act as the key field). For each one of these fields it is necessary to decide what type of data they will be and also to decide how many characters will be allowed for the data in that field, remember these are fixed length records. The easiest way is to write them in a table Student number Integer 1 byte Student name Character 20 bytes Amount paid Integer 1 byte Emergency number Character 12 bytes Form (e.g.3RJ) Character 3 bytes Notes: It would be perfectly reasonable to say that the school number was not a proper number so it should be stored as characters (probably 4 bytes) or in BCD (2 bytes). The student name is quite arbitrary. 15 bytes would be perfectly reasonable, as would 25 bytes, but 5 bytes would not. In other words there is no single right answer, but there are wrong ones. The amount paid is listed as an integer as the teacher will store the number of whole pounds so far paid. This is not the best data type for an amount of money but it is the only one that fits so far. When you have learned about other data types use them if they are more suitable. Many students expect that the emergency number should be an integer, but phone numbers start with a 0, and integers are not allowed to. As most numbers do start with 0, the computer can be programmed to put them in at the start of the rest of the number, but you would have to say this in your answer. As we are not going to do any arithmetic with these numbers, why make life more complicated than necessary. 3 characters were allowed for form. If in doubt, give an example of what you mean by the data, it can’t hurt and it may save you a mark in the question. 1.4.11 Sizing a File We have just designed the record format for a given situation. It may be necessary to calculate how large the file is going to be. Having decided on the size of each field, it is a simple matter of adding up the individual field sizes to get the size of a record, in this case 37 bytes. There are 50 students going on the trip, each of them having their own record, so the size of the data in the file will be 50 * 37 = 1,850 bytes. All files need a few extra pieces of information that the user may not see such as information at the start of the file saying when it was last updated, which file it is, is it protected in any way? These sort of extra pieces of information are known as overheads, and it is usual to add 10% to the size of a file because of the need for overheads. Therefore the size of the file is 1,850 bytes + (10% of 1,850 bytes) =2,035 bytes. The final stage is to ensure that the units are sensible for the size of the file. There are 1024 bytes in 1 Kbyte, so the size of this file is 2,035/1024 = 1.99Kbytes. Note: Don’t worry about dividing by 1024, because, after all, this is only an approximation anyway. If you gave the final answer as 2 (approx) then that is just as acceptable. Just make sure that you write down somewhere that you know that there are 1024 bytes in a Kbyte, otherwise you can’t be given the mark for knowing that. 1.4.12 Access Methods to Data Computers can store large volumes of data. The difficulty is to be able to get it back. In order to be able to retrieve data it must be stored in some sort of order. Imagine the phone book. It contains large volumes of data which can be used, fairly easily, to look up a particular telephone number because they are stored in alphabetic order of the subscriber’s name. Imagine how difficult it would be to find a number if they had just been placed in the book at random. The value of the book is not just that it contains all the data that may be needed, but that it has a structure that makes it accessible. Similarly, the structure of the data in a computer file is just as important as the data that it contains. There are a number of ways of arranging the data that will aid access under different circumstances. Serial access. Data is stored in the computer in the order in which it arrives. This is the simplest form of storage, but the data is effectively unstructured, so finding it again can be very difficult. This sort of data storage is only used when it is unlikely that the data will be needed again, or when the order of the data should be determined by when it is input. A good example of a serial file is what you are reading now. The characters were all typed in, in order, and that is how they should be read. Reading this book would be impossible if all the words were in alphabetic order. Another example of the use of a serial file will be seen in section 1.4.15. Sequential access. In previous sections of this chapter we used the example of a set of students whose data was stored in a computer. The data was stored in alphabetic order of their name. It could have been stored in the order that they came in a Computing exam, or by age with the oldest first. However it is done the data has been arranged so that it is easier to find a particular record. If the data is in alphabetic order of name and the computer is asked for Zaid’s record it won’t start looking at the beginning of the file, but at the end, and consequently it should find the data faster. A file of data that is held in sequence like this is known as a sequential file. Indexed sequential. Imagine a large amount of data, like the names and numbers in a phone book. To look up a particular name will still take a long time even though it is being held in sequence. Perhaps it would be more sensible to have a table at the front of the file listing the first letters of peoples’ names and giving a page reference to where those letters start. So to look up Jones, a J is found in the table which gives the page number 232, the search is then started at page 232 (where all the Js will be stored). This method of access involves looking up the first piece of information in an index which narrows the search to a smaller area, having done this, the data is then searched alphabetically in sequence. This type of data storage is called Index Sequential. Random access. A file that stores data in no order is very useful because it makes adding new data or taking data away very simple. In any form of sequential file an individual item of data is very dependent on other items of data. Jones cannot be placed after Monks because that is the wrong ‘order’. However, it is necessary to have some form of order because otherwise the file cannot be read easily. What would be wonderful is if, by looking at the data that is to be retrieved, the computer can work out where that data is stored. In other words, the user asks for Jones’ record and the computer can go straight to it because the word Jones tells it where it is being stored. How this can be done is explained in section 1.4.13. 1.4.13 Implementation of File Access Methods This section is about how the different access methods to data in files can be put into practice. There will not be a lot of detail, and some questions will remain unanswered, don’t worry because those will appear in further work. Serial access. Serial files have no order, no aids to searching, and no complicated methods for adding new data. The data is simply placed on the end of the existing file and searches for data require a search of the whole file, starting with the first record and ending, either with finding the data being searched for, or getting to the end of the file without finding the data. Sequential access. Because sequential files are held in order, adding a new record is more complex, because it has to be placed in the correct position in the file. To do this, all the records that come after it have to be moved in order to make space for the new one. e.g. A section of a school pupil file might look like this … James, Amanda, 21 Church St, F, 21/03/…….. Jones, David, 134 New Gardens, M, ……… Marsden, Thomas, ……….. Newsome, Claire, ……….. … If a new pupil arrives whose name is Johnson, space must be found between James and Jones. To do this all the other records have to be moved down one place, starting with Newsome, then Marsden, and then Jones. … James, Amanda, 21 Chur………… Jones, David, 134 Ne………. Marsden, Thomas, ……. Newsome, Claire, …….. This leaves a space into which Johnson’s record can be inserted and the order of the records in the file can be maintained. Having to manipulate the file in this way is very time consuming and consequently this type of file structure is only used on files that have a small number of records or files that change very rarely. Larger files might use this principle, but would be split up by using indexing into what amounts to a number of smaller sequential files. e.g. the account numbers for a bank’s customers are used as the key to access the customer accounts. The accounts are held sequentially and there are approximately 1 million accounts. There are 7 digits in an account number. Indexes could be set up which identify the first two digits in an account number. Dependent on the result of this first index search, there is a new index for the next two digits, which then points to all the account numbers, held in order, that have those first four digits. There will be one index at the first level, but each entry in there will have its own index at the second level, so there will be 100 indexes at the second level. Each of these indexes will have 100 options to point to, so there will be 10,000 blocks of data records. But each block of records will only have a maximum of 1000 records in it, so adding a new record in the right place is now manageable which it would not have been if the 1million records were all stored together. 00 01 00 02 01 … 02 99 ... … 00 0102000 DATA 99 01 0102001 02 0102002 One … 0102003 First level index 99 ………. (first two digits 0102999 in account number) 100 10,000 Second level indexes Final index blocks (third and fourth digits each containing up to in account number) 1000 account numbers Random access. To access a random file, the data itself is used to give the address of where it is stored. This is done by carrying out some arithmetic (known as pseudo arithmetic because it doesn’t make much sense) on the data that is being searched for. E.g. imagine that you are searching for Jones’ data. The rules that we shall use are that the alphabetic position of the first and last letters in the name should be multiplied together, this will give the address of the student’s data. So Jones = 10 * 19 = 190. Therefore Jones’ data is being held at address 190 in memory. This algorithm is particularly simplistic, and does not give good results, as we shall soon see, but it illustrates the principle. Any algorithm can be used as long as it remains the same for all the data. This type of algorithm is known as a HASHING algorithm. The problem with this example can be seen if we try to find James’s data. James = 10 * 19 = 190. The data for James cannot be here because Jones’s data is here. This is called a CLASH. When a clash occurs, the simple solution is to work down sequentially until there is a free space. So the computer would inspect address 191, and if that was being used, 192, and so on until a blank space. The algorithm suggested here will result in a lot of clashes which will slow access to the data. A simple change in the algorithm will eliminate all clashes. If the algorithm is to write down the alphabetic position of all the letters in the name as 2 digit numbers and then join them together there could be no clashes unless two people had the same name. e.g. Jones = 10, 15, 14, 05, 19 giving an address 1015140519 James = 10, 01, 13, 05, 19 giving an address 1001130519 The problem of clashes has been solved, but at the expense of using up vast amounts of memory (in fact more memory than the computer will have at its disposal). This is known as REDUNDANCY. Having so much redundancy in the algorithm is obviously not acceptable. The trick in producing a sensible hashing algorithm is to come up with a compromise that minimizes redundancy without producing too many clashes. 1.4.14 Selection of Data Types and Structures Data types. When the computer is expected to store data, it has to be told what type of data it is going to be because different types of data are stored in different areas of memory. In addition to the types of data that we have already described, there are other, more specialised, data types. Most can be covered by calling them characters (or string data, which is just a set of characters one after the other), but there are two others that are useful. Currency data is, as the name suggests, set up to deal with money. It automatically places two digits after the point and the currency symbol in. The other is date, this stores the date in either 6 or 8 bytes dependent on whether it is to use 2 or 4 digits for the year. Care should be taken with the date because different cultures write the three elements of a date in different orders, for example, Americans put the month first and then the day. Data structures Students should be able to justify the use of a particular type of structure for storing data in given circumstances. Questions based on this will be restricted to the particular structures mentioned in 1.4.7 and 1.4.8 and will be non-contentious. E.g. Jobs are sent to a printer from a number of sources on a network. State a suitable data structure for storing the jobs that are waiting to be printed giving a reason for your answer. Answer: A queue, because the next one to be printed should be the one that has been waiting longest. 1.4.15 Backing up and Archiving Data Backing up data. Data stored in files is very valuable. It has taken a long time to input to the system, and often, is irreplaceable. If a bank loses the file of customer accounts because the hard disk crashes, then the bank is out of business. It makes sense to take precautions against a major disaster. The simplest solution is to make a copy of the data in the file, so that if the disk is destroyed, the data can be recovered. This copy is known as a BACK-UP. In most applications the data is so valuable that it makes sense to produce more than one back-up copy of a file, some of these copies will be stored away from the computer system in case of something like a fire which would destroy everything in the building. The first problem with backing up files is how often to do it. There are no right answers, but there are wrong ones. It all depends on the application. An application that involves the file being altered on a regular basis will need to be backed up more often than one that is very rarely changed (what is the point of making another copy if it hasn’t changed since the previous copy was made?). A school pupil file may be backed up once a week, whereas a bank customer file may be backed up hourly. The second problem is that the back-up copy will rarely be the same as the original file because the original file keeps changing. If a back up is made at 9.00am and an alteration is made to the file at 9.05am, if the file now crashes, the back up will not include the change that has been made. It is very nearly the same, but not quite. Because of this, a separate file of all the changes that have been made since the last back up is kept. This file is called the transaction log and it can be used to update the copy if the original is destroyed. This transaction log is very rarely used. Once a new back up is made the old transaction log can be destroyed. Speed of access to the data on the transaction log is not important because it is rarely used, so a transaction log tends to use serial storage of the data and is the best example of a serial file if an examination question asks for one. Archiving data. Data sometimes is no longer being used. A good example would be in a school when pupils leave at the end of the sixth form. All their data is still on the computer file of pupils taking up valuable space. It is not sensible to just delete it, there are all sorts of reasons why the data may still be important, for instance a past pupil may ask for a reference. If all the data has been erased it may make it impossible to write a sensible reference. Data that is no longer needed on the file but may be needed in the future should be copied onto long term storage medium and stored away in case it is needed. This is known as producing an ARCHIVE of the data. (Schools normally archive data for 7 years before destroying it). Note: Archived data is NOT used for retrieving the file if something goes wrong, it is used for storing little used or redundant data in case it is ever needed again, so that space on the hard drive can be freed up. Example Questions 1. a) Express the number 113 (denary) in (i) binary (ii) in BCD using an appropriate number of bytes in each case. (4) b) Using the answer obtained in part (a) show how 113 (denary) can be expressed in (i) octal (ii) hexadecimal. (4) A. a) (i) 128 64 32 16 8 4 2 1 0 1 1 1 0 0 0 1 =01110001. (ii) 1 = 0001 1 = 0001 3 = 0011 Therefore 113 = 0000000100010011 b) (i) 113 = 001 110 001 in binary = 1 6 1 in octal. (ii) 113 = 0111 0001 in binary = 7 1 in hexadecimal. Notice: (i) and (ii) It was necessary to show a method of working out. Many students have calculators that will automatically change from one number system to another, so it is necessary to show that you know what you are doing, even if you use a calculator to check the results. Also, the question stated that the appropriate number of bytes be used, in part (i) this is obviously 8 bits and is an easy mark, but in part (ii) it is necessary to add a set of zeros to the front of the answer to make it a whole number of bytes. (iii) and (iv) The question stated that the first answer had to be used, so one of the two marks is going to be given for showing the relationship between binary and each of these two representations. Notice that for the octal answer it was necessary to add a 0 to the front of the binary number to give 9 bits (3 lots of 3). 2. Explain how the denary number –27 can be represented in binary in (i) sign and magnitude (ii) two’s complement notation, using a single byte for each answer. (4) A. (i) +/- 64 32 16 8 4 2 1 1 0 0 1 1 0 1 1 = 10011011 (ii) –27 = -128 + 101 = -128 + (+64 +32 +4 +1) -128 64 32 16 8 4 2 1 1 1 1 0 0 1 0 1 = 11100101 Notice that the question asked “Explain…”, so just writing the answer down is not acceptable. There will be a mark for showing the column headings, particularly the value of the MSB in each case. 3. Add together the binary equivalents of 34 and 83, using single byte arithmetic, showing you working. (3) A. 34 = 0 0 1 0 0 0 1 0 83 = 0 1 0 1 0 0 1 1 + 0 1 1 1 0 1 0 1 = 117 1 Note that the question asked for the working. The part that shows that you are capable of doing the arithmetic is the carry, don’t miss it out. In a question like this try to ask yourself what evidence the examiner is expecting for the mark. The other marks are for using 8 bits for each value, and for the answer. 4. Describe how characters are stored in a computer. (3) A. -Each character is given a code… -as a binary number -Each character code occupies one byte of data -Typically one byte is eight bits -Most computers use a standard set of characters like the ASCII set. There are lots of marks available here. The question has been left deliberately open so that there are plenty of mark points available. 5.a) Explain the difference between integer and floating point data types. (2) b) State what is meant by Boolean data. (1) A. a)-Integers are whole numbers -Floating point numbers allow for the storage of fractions. b)-Boolean data is data that can exist in two states, e.g. true or false. Note that in part (a) there may be lots of other differences that you know, but they are not really part of this syllabus. If your answer is valid though, it will be given the marks. In part (b), this is another standard definition, don’t try to elaborate too much or you may wander too far from the answer. 6. An array is to be used to store information. State three parameters that need to be given about the array before it can be used, explaining the reason why each is necessary. (6) A. -The size of the array (how many data items it will hold)… -so that this amount of space can be reserved in memory. -The type of data to be stored in the array… -so that it can be set up in the correct area of memory. -The name of the array… -so that it can be identified when it needs to be used. -The number of dimensions of the array… -so that the computer knows what form the data is going to be stored in. Note: these marks go in pairs. Don’t worry if you are not happy with the last point, this is a more advanced feature of arrays – look at the notes on section 1.4.6. 7. A garden centre stores details of each of the types of plant that it has for sale on a computer system. The details of the plants are stored in alphabetical order in a linked list. a)By drawing a diagram show how the plants are arranged in the list. You may use the following plants to illustrate your answer Pansy, Dahlia, Clematis, Sweet pea. (4) b)Describe how a new linked list, of plants that like shaded conditions, can be created from the original one. (4) A. a) Head of list table Dahlia (Start Pointer) Clematis Sweet pea XX Pansy Mark points: -Some start position for list -List in alphabetic order -Use of pointers -Use of null pointer to finish list. b)-Create new linked list by creating a new start point. -Read values from old list in turn -If plant in old linked list likes shade then… -add to new list and put in pointer. -Repeat until null pointer (end of list) -Place null pointer after last entry in new list Note: The first part is fairly simple. Stick to the simple type of diagram and don’t forget how to start and end the list. The second part is more difficult because it means a certain degree of analysis of the problem to be able to solve it. This is why there are rather more mark points available. When dealing with linked lists, if you draw a diagram and remember to start and finish it, you won’t go far wrong. 8. A stack is being held in an array. Items may be read from the stack or added to the stack. a) State a problem that may arise when (i) adding a new value to the stack (ii) reading a value from the stack. (2) b) Explain how the stack pointer can be used by the computer to recognise when such problems may occur. (2) A. a)(i) The array may be full, consequently no new value can be entered. (ii) The array may be empty, there is no value to be read. b)(i) The stack pointer will be pointing outside the array. (ii) The stack pointer will be pointing at the first location in the array. Notes. When discussing stacks and queues it is important to have a picture in your mind of what it looks like. The simplest picture is to imagine them being held in an array. This is not the only way to store them and you may have been shown other methods, but the array is perfectly adequate for this course. 9. A library stores details of the books that are available. a) Apart from title and author, state 3 other fields that it would be sensible for the library to store in this file, giving a reason why each of your chosen fields would be necessary. (6) b) State which field would be used as the key field of the record and explain why a key field is necessary. (2) c) State the size of each of the fields in your record. (2) d) If the library stores approximately 20,000 books, estimate the size the book file. (3) A. a)One mark for each of three sensible fields with an extra mark for an explanation of the need for that field. E.g. ISBN/to identify book Shelf number/ to allow for ease of search for book Fiction or reference or childrens’ (some form of category)/ to decide whereabouts in library it should go b)-Book number (ISBN) -because it is unique to that record and hence can be used as an identifier. c) Title 40 bytes Author 20 bytes ISBN 10 bytes Shelf 3 bytes Category 1 byte Mark points: -Use of bytes -Sensible field sizes. d)Total size of one record = 74 bytes -Size of file = 74 * 20000 = 1480000 bytes -+ Overheads of 10% = 1,480,000 + 148,000 = 1,628,000 bytes. - (Divide by 1024 until sensible units) = 1.55 Mbytes. Notes: In part (a), any sensible fields (can you justify it?) are acceptable. For instance, Publisher/so that more copies may be ordered, would be a sensible field. In part (b) you would be expected to provide the answer ISBN, although an explanation would be accepted if it was clear what was meant. Try to be aware of the standard key fields (account numbers, ISBN, barcodes, school numbers). The sizes of the fields are only given as guidance. If you suggested 100 bytes for the title it would be accepted as being reasonable. On the other hand 1000 bytes would be way over the top and 10 bytes would be unreasonably small. Don’t worry about trying to get the ‘correct’ answer, there isn’t one, just make it sensible. There is a correct answer for the ISBN, it is 10 characters long, but it is not part of your syllabus to learn things like that off by heart, so as long as you don’t say something silly like 30 bytes, which demonstrates that you don’t know what an ISBN is, you will get the mark. In part (d) make sure that you convert your answer to sensible units. Don’t worry about using a calculator, as long as you show that you did know the correct method is to divide by 1024, an approximate answer is all that is necessary. In this case it would be quite acceptable to write “Keep dividing by 1024, this gives approximately 1.6 Mbytes”. When you answer this question your answer is highly unlikely to be the same as this, don’t worry, the examiner will mark your answer and not expect it to be the same as a model answer. 10. a)Explain the difference between a serial file and a sequential file. (2) b)Describe what is meant by a hashing algorithm and explain why such an algorithm can lead to clashes. (3) A. a)-Serial file holds data in the order in which it was received. -Sequential file holds the data according to some order defined on the data. b)-A hashing algorithm is pseudo arithmetic… -carried out on the data… -in order to determine the location of the data in the memory. -A clash occurs when the answer to the pseudo arithmetic is the same despite the data used in the calculation being different. Note: This is a question where it is very easy to try to answer too much. Restrict yourself to what the question is asking. Many candidates faced with part (b) will go into great detail about what to do after clashes have occurred, this is not required and will earn no marks. 11. A library keeps both a book file and a member file. The library does a stock take twice a year and orders new books only once a year. Members can join or cancel their membership at any time. a) Describe how the library can implement a sensible system of backing up their files. (4) b) Explain the part that would be played by archiving in the management of the files. (4) A. a)-Book file needs to backed up twice (or three) times a year… -when the stock take or book purchasing has made the file alter. -Member file needs backing up daily(at least weekly) because of constant changes… -would also need to keep a transaction log for the member file. -Back up copies would be stored away from the building with the computer system in it to ensure that a copy of files survived in case of fire. -Multiple copies of the book file would be made -Member file copies and transaction logs may be kept for a number of back up periods. b)-When books are discovered to be missing, or if a book is replaced by a more up-to-date edition, the old records should be kept… -but they are no longer live so are taken from the hard disk… -after a copy (archive) has been made. -When members leave the library the data should be archived… -also when they have not taken a book out for a long period of time their record can be considered to be dormant. Notes: Backing up and archiving are topics which cause confusion among candidates. There really is no need for such confusion if you remember that archiving has nothing to do with recovery after a disk failure. Note that there are plenty of mark points available in this question. Note: This chapter, or section of the syllabus, is by far the largest portion of module 1, and consequently, candidates should expect a higher proportion of marks on the exam paper to relate to this work than to the other sections. Back |