The processing of responses from World Class Test items
The World Class Tests are the result of a UK Government initiative to provide tests which challenge and help identify gifted and talented students at ages 9 and 13. The project is now in its third year and there have been 5 test sittings. There are two subjects available, Problem Solving and Mathematics at each age. Each subject has two components, a paper test and a computer-based test. A computer-based maths test is typically composed of 12 items (usually single screen) while a Problem-Solving test is composed of 5 items (usually multiple screen). For the purposes of this paper an item is used in the sense of a single question.
Schools or test centres register online, enter their students and can then opt to take the computer tests online or offline, in which latter case they download the tests from the Internet and upload the resulting responses for processing.
Very nearly all responses (students’ output) from the range of computer-based Mathematics test items (drag and drop, multiple choice, hot spot etc.) are marked automatically while a proportion or parts of the computer-based Problem Solving item responses can be marked in this way. Human markers mark any responses that cannot be marked automatically using an individually customised marking package downloaded from the project’s servers over the Internet. Once marking has taken place, marks (scores) can then be uploaded back to the servers. Awarding takes place following standardising during which the agencies, responsible for writing and pre-testing the items, contribute data. Students are awarded a World Class Test pass, merit or distinction
The WCT Assessment system can be described according to the Four Process Architecture proposed by Almond et al. This Architecture is summarised here in diagrammatic form

Figure 1‑1 The four
principle processes in the assessment cycle (reproduced with permission from
Enhancing the Design and Delivery of Assessment Systems: A Four Process
Architecture. The Journal of Technology, Learning and Assessment Volume 1,
Number 5 - October 2002)
The purpose of this paper is to describe briefly the architecture of the World Class Test system, relating it to the Four Processes. Of these Four Processes (Activity Selection Process, Presentation Process, Response Processing and Summary Scoring Process), the World Class Tests’ implementation of Response Processing, is perhaps the most innovative.
Developing an innovative system for Response Processing was necessary because many of the computer-based items (tasks or questions) that are used for the World Class Tests produce output that is more complex than the single value obtained from simple multiple-choice questions. To derive a mark (score) from these items reliably and transparently was a challenge.
The most visible parts of The World Class Test systems are the items which are written in Macromedia Flash for ease of deployment on the web and in offline environments. They are designed with a consistent style and are hosted by a shell also written in Flash. Most Mathematics items are single ‘page’ (screen) items. A Problem Solving item is usually made up of more than one ‘page’ or screen; each screen is accessed by a button click from the opening screen
When a student first starts a test, the test shell (test management software in which the items are ‘embedded’) will ask them to enter their Candidate ID so that it can be validated by the test software. The validation file used by the test shell to verify the student’s ID is a small XML file. As in many parts of the World Class Test Assessment System XML has proved to be the format of choice due to its flexibility, simplicity and verifiable structure.
So, XML is also the format used for the output from the items. In the offline version of the tests this XML is captured and saved to a local disk for later upload to the WCT web servers. In the online version the XML is posted directly to a server by the user’s test shell. The students’ responses or ‘work products’ (Almond et al) are, therefore, small, digitally-signed, XML files .
The processing of these files takes place on the servers using markschemes also authored in XML. Although called markschemes, these electronic documents are more complex than a conventional paper markscheme. They contain the correct answers, the algorithms necessary for the automarking, test cases for the testing of the automarking algorithms, text guidance for human markers and rules for the display of the responses.
A number of servers backed by SQL database servers carry out the functions of web serving and the processing of data received over the Internet. Bespoke and off-the-shelf applications required for processing and display are integrated using a scripting host.
The test designers decided from the outset that they needed to control the selection of items (‘Activity Selection Process’ - Almond et al) and the presentation of the Tests for the 9 and 13 year-old users. So the collection of items that make up a test are selected by experts at a Test Review Group meeting and the chosen items are then compiled into the test shell.
World Class Tests items are not open to rendering by third party software such as a Learning Management System. The items are ‘black boxes’ designed to a set of guidelines, some of which describe the output, but most of which describe the presentation or ‘look and feel’.
The Activity Selection Process (Almond et al) and The Presentation Process (Almond et al) in the World Class Tests are largely pre-determined by the item designers. There is no scope for rendering the items in a way that differs from the original design. There is, though, no barrier to recombining items for purposes other than the specific one of creating a World Class Test. If required, items or tests can be selected and presented in such a way as to create formative assessment materials.

Figure 3‑1 An example of an interface to a World Class Test. Each item is accessed by clicking on an icon. Students can return to items when they want during the test. Copyright © 2001- 2003 Qualifications and Curriculum Authority

Figure 3‑2 An example of a computer-based mathematics item for 9 year-olds. Copyright © 2001- 2003 Qualifications and Curriculum Authority
The processing of the response to an item in the World Class Tests is determined by the markscheme for that item. Thus, a markscheme contains within it all the code and text needed for processing responses in the desired way. This code and text can be displayed and used in different ways according to the context in which the response is being processed (e.g. for a live test or for use in the classroom).
In the Four Process Architecture (Almond et al), “Evidence Rule Data provides specific information about elements that might be perceived in a Work Product that would cause particular Observable Variables to be set to certain values.” In the case of the World Class Tests, the ‘Evidence Rule Data’ is contained within the markschemes while a Work Product is an XML response file.

Figure 4‑1 - The relation of Response Processing to the Task/Evidence composite library which in the World Class Tests is called WorkBench
A response from a computer-based World Class Test item is structured to record various items of data, such as the name and ID of the student. It also includes a digital signature as a token of the file’s authenticity. All of this data is recorded in an XML file which conforms to a simple schema. The resulting response file rarely exceeds 4KB for a test and is usually much smaller (~1KB).
The test shell will have written the student’s responses to an item or a part of an item into an element in this XML response file labelled an ‘answer’ element. Responses stored in ‘answer’ elements are represented in a strict codified form that allows consistent automated processing whilst remaining human readable. The rules for the representation of a response to an item or a part of an item are formally defined as a ‘response grammar’.
The response grammar needs to be strict enough to allow the construction of a generic parser for responses from any item and be flexible enough to allow the capture of any possible output from a World Class Test item.
The example response given below in Figure 4‑2 is the output stored in the answer element of an XML response file from the TurnBlock item shown in Figure 4‑3. It describes the position of the 3 green squares once a student has attempted to place them correctly. In this straightforward example, the coordinates of the three green squares are recorded in the output string according to the conventions established in the ‘response grammar’.
![]()
Figure 4‑4 An example of an ‘answer’ element from a World Class Test response file
In the example given it would have been possible for the item to have code incorporated which would have scored the response as the student finished the item or when the test was finished. ‘1’ would be returned to the server if the correct coordinates had been detected and ‘0’ if they had not. The World Class Test item designers, programmers and examiners, however, took the view that this was highly undesirable for two reasons:
A. The re-use of the item for formative assessment would be more limited as there would be a reduced possibility of displaying the actual response as part of feedback to student or teacher. As Almond et al make clear in their paper, “This ability to separate scoring from presentation and decision-making allows us to re-use tasks in different contexts and to meet the requirements of different assessment purposes”.
B. There would be very limited data with which to audit the behaviour of the response capture. Checking that an automatic algorithm is behaving as desired is difficult. Any problems with the algorithm which become apparent after a live test could not be corrected if the marking (scoring) was incorporated into the item.
For these reasons, programmers coding the items to generate the responses are asked to ensure that, within reason, the data that is output to a response file from the item is as ‘raw’ as possible.
The definition of and adherence to a response grammar is a key feature of response processing in the World Class Test project. It would be perfectly feasible to store the response from the example item above in many different ways (for example, the string “[0,1] [2,2] [2,0]” would equally well convey the same information to both automated and human recipients), but by standardising the form of responses across all items, the processing pipeline was streamlined and both costs and interpretation errors were reduced.
Markschemes originally devised for the World Class Tests looked very much like conventional paper-based examination markschemes with text giving the correct answer, guidance to the marker and the number of marks to be allocated for each mark point. These markschemes had to be codified and structured so that they could be used with the automarking system and, at the same time, be presentable to a human marker working online.
The requirement for structure, for the incorporation of code for automarking and for a variety of different output display formats meant that XML was an ideal format to use. The structure of the marking points that would match the structure of captured responses is enforced by a schema, code can be extracted using the DOM (Document Object Model) and the text can be displayed in many different ways (in HTML or as Adobe Acrobat files) using XSLT (Extensible Style Language Transformation).
The fragment of XML given below (Figure 4‑5 ) is from a markscheme for the item Turnblock showing two elements used for containing the automarking code. Provided that the response output conforms to the grammar then the code used here will allocate 1 mark for the correct answer and zero for incorrect. The marking algorithms used can be much more complex if the interactions with an item are more complex. The complexity of the algorithms also depends on any requirements included in the markschemes for the allocation of partial marks (marks for a response that is partly correct).

Figure 4‑6 Fragment of a markscheme showing the algorithm used to automark the Turnblock item shown in Figure 3
Markschemes will need to be displayed in various ways depending on requirements. For example, the display for a human marker checking the functioning of an automarking algorithm will be different from that required by a teacher investigating their students’ answers to an item. To accommodate these different needs, the text and graphics can be manipulated using XSL stylesheets. The example below in Figure 7 shows the markscheme for Turnblock in HTML form. This is used by markers who visually check the functioning of the automarking algorithms during production and after a live test session.

Figure 4‑7 Markscheme for Turnblock displayed as an HTML file
Another example below shows a fragment from the complete test markscheme in Adobe Acrobat (.pdf) format. This document is provided to the examiners in paper form. This document is compiled and produced on-the-fly, from the markschemes for each item selected for a test

Figure 4‑8 Markscheme displayed as Adobe Acrobat (.pdf) file for printing
The marking of the World Class Tests is supervised by the UK’s biggest Examination Board (AQA). They employ Principal Examiners to supervise marking, standard setting and awarding while examiners (markers) mark the responses which require human marking. Very little of the examiners’ time goes to marking the computer-based mathematics tests as they are marked automatically. Some items or parts of items in the computer-based problem-solving tests are marked automatically, but the students taking these tests are also required to submit a paper workbook with their responses for many of the tasks. These require human marking. The paper components of each test are marked conventionally. The scores for any paper response are uploaded to a server using a marking program.
The marking program is downloaded from the servers by an examiner. In the package they receive, there will be the marking program itself, the responses that have been allocated to that examiner and the relevant markscheme. On execution on the examiner’s computer, the marking programme displays the students’ responses and the relevant markschemes.
A portion of the marking programme is shown below in Figure 5‑1.

Figure 5‑2Modified screenshot of part of the marking programme used by examiners
The screenshot shows responses listed in the left-hand window while the markscheme is displayed on the right. There is a facility to save the file locally or to upload it to a server (buttons on the tool bar at top of Figure 5‑3) where the scores can be compiled with scores for other components to give an overall score for a student’s World Class Test.
A student’s file of responses to all questions on the test can be accessed or, if the examiner wishes, a list of all the responses to a particular item can be displayed, suitably sorted (drop-down boxes on the tool bar at top of Figure 5‑4).
Since the item shown here (Turnblock) is a Mathematics item and has been automarked by the server, the scores are already present when the examiner downloads the package. The examiner can ‘over-ride’ these marks and enter a mark into the mark box. But the mark that is entered into this mark box is validated by reference to the value in the “mark-available” element container of the markscheme XML file (see above). The marking programme for the paper-based components or the Problem-solving components looks very similar, but, of course, the entry of marks into the mark boxes is a manual process.
The responses displayed to the examiner are not necessarily in their most raw state. The unprocessed strings contained in the ‘answer’ elements of the response files can be very difficult to interpret and so the responses are ‘formatted’ to display them in a more human-readable form. Again the code for how this is to be done for each item is incorporated into elements in the markscheme.
In the particular example given here the raw responses which are strings of co-ordinates (see Figure 5‑5), are rendered, using the response formatting code, into a series of graphics which show what the student’s response actually looked like. This technology can, of course, be used most effectively in a formative assessment context in which student or teacher needs to review what a student has done.
At the centre of the World Class Tests production and delivery systems there is a store of both paper and computer items, markschemes, ancillary data and graphics. This is much more than an item bank and corresponds to the task/evidence composite library shown in the centre of the diagram depicting the Four Process Architecture in Figure 6‑1. However, in the World Class Tests, the task evidence composite library integrates a sophisticated ‘production-line’ for testing items and their markschemes, recording changes requests, assembling tests and linking data taken from live tests on the behaviour of items (equivalent to ‘Task Level Feedback’ in Figure 6‑2). The project has named the system that carries out these functions, ‘WorkBench’.
One of the requirements for this system was that it should allow distributed working as the teams of item designers, item programmers and examiners are all located in different parts of the UK and Europe. For this reason the system is hosted on the World Class Test servers and is accessed over the Internet.
A vital part of the system deals with the integration of the response processing for items. If response processing is to be kept separate from the items themselves, then there has to be a high level of coordination between the item designers who broadly specify the desired output from an item and the markscheme for that item, the programmers who code the output routines, and the examiners who must have the final say in the interpretation of the ‘rules of evidence’ in the markscheme. This tight co-ordination is facilitated by the technology of WorkBench and business rules for its use and for item design. So, for example, a test item is not signed-off as complete unless the item designers and examiners have carried out tests on it and its associated response processing and have indicated they are satisfied by checking an online form.
The item can be tested online for its ‘look and feel’, but the response processing and the behaviour of the markscheme also need to be thoroughly tested. This is done in two ways:
a) An item can be accessed and attempted. A press of a button displays the resulting output string and shows the score that the automarking algorithm from the XML markscheme generates. The algorithm can be changed online if a problem is detected.
b) Test cases can be incorporated into the item’s markscheme and then tested against the markscheme’s automarking algorithm.
The World Class Test Assessment System conforms to the Four Process Architecture quite closely although the terminology used on the project is different from that used by Almond et al.
There are though a few points worth emphasising from the experience gained in the World Class Test project which relate to the Four Process Architecture.
a) The ‘Activity Selection Process’ (Almond et al) and the presentation interface for the World Class Tests are deliberately inflexible. What is presented and how it is presented are hallmarks of the World Class Tests, though whole items may be re-presented as formative assessment tasks.
b) Response processing needs to be quite separate from scoring to allow auditing, testing and re-use or re-purposing. The World Class Test Assessment systems vindicate the importance of the separation of response and response processing. This is not only to allow re-use of items in different contexts. The separation allows control and monitoring of the behaviour of automarking. In very rare cases a problem with an algorithm has not been discovered until after a live test. This has not posed a problem as the algorithm can be changed and the tests instantly remarked. Had the scoring been incorporated into the test item there would be no possibility of adjustment.
c) Response data produced by test items needs to be as ‘raw’ as possible to allow thorough auditing. If the data being captured has been too highly processed within the item and there is a problem with the design of the data capture then there is a risk that responses cannot be marked successfully.
d) Separating response processing from scoring for items (the output from which can be much more complex than a single value) requires close coordination between designers, programmers and examiners. In the World Class Tests this has been done by integrating production, testing and item storage into the Task/Evidence composite library.