Test banks need to comprise multiple and equivalent versions of the test so that each version represents the test instrument in the same way.

Key Issues

A lot of effort and resources could go into designing and developing a robust and effective test instrument. The tasks could be valid and reliable, the test instrument could address all aspects of the criteria listed in these guidelines. Such a test instrument could well be a very effective and valid means of assessing language proficiency of pilots or air traffic controllers. However, if only one version of this ideal test was ever produced, the testing system would soon lose its effectiveness once it is in use. With multiple applications of the test, security issues will begin to emerge. The more test-takers there are who need to take this test and the more frequently this test is used, the less effective it will become as its repeated exposure to test-takers will result in the content of the test becoming known.

In order for the results of a testing system to effectively represent each test-taker’s ability in a population of pilots or air traffic controllers, multiple versions of the test need to be developed and available for administration. In addition, the versions need to be equivalent in terms of the range and complexity of language assessed. To achieve this, each version needs to have comparable content, i.e. containing equivalent but different samples of language-use situations that pilots or air traffic controllers could face in real-world operational situations.

Considerations

Each test version needs to be generated based on the test specifications so that all versions look the same and have the same structure, sections, task-types, number of items so that all versions represent the same set of constructs in each part of the test. Each part also needs to be parallel in its level of difficulty. It is important for the reliability of the testing system that each version is equivalent in how it measures language proficiency so there is confidence in the results irrespective of which test version is used.

It is possible that different versions of the test may contain similar or overlapping topics, but the language and tasks or items would need to be sufficiently different to ensure each version of the test is unique.

What does this mean for test design?

Tests need to comprise a test bank where each version of the test has aspects which are unique to that version of the test. Test developers need to ensure that each version of the test is written to a set of specifications so that all test versions are parallel and more or less equivalent in their level of difficulty and the range of language and communicative contexts that they assess.

The larger the test-taker population and the more often they need to be tested, the larger the test bank needs to be.

ICAO Statements & Remarks

The following statements from ICAO Document 9835 (2nd Edition, 2010) are related to this issue.

6.3.5.8. A full description of security measures required to ensure the integrity of the testing process should be documented and available to all decision-makers.

— What it means. Test security refers to the ability of the testing organization to protect the integrity of the testing process. Testing organizations should ensure that people do not have access to specific test content or questions before the test event.

— Why it is important. The ongoing reliability, validity and confidentiality of a language proficiency testing system will depend heavily on the test security measures that are in place.

6.3.5.9. In the case of semi-direct test prompts (which are pre-scripted and pre-recorded), there should be adequate versions to meet the needs of the population to be tested with respect to its size and diversity.

— What it means. Tests with specific pre-recorded or pre-scripted questions or prompts require multiple versions. Decision-makers need to know that there are adequate versions of the test to ensure security for their particular testing needs.

— Why it is important. Once test items have been used, there is the possibility that people may repeat or share the prompts with other test-takers; this would violate the security and validity of the test.

— Additional information. It is not practical to prescribe the number of versions or test prompts required for any specific test situation. The determination of what is adequate in any situation is dependent on specific circumstances. Examples of variables that impact adequacy are:

a) The number of test-takers.

b) The geographic and organizational proximity of the test-takers. The closer the individuals are within the test-taking population, the more likely it is that they will share their testing experience with each other. If people share test information and that same information is used in another test, test-takers have the opportunity to prepare a response for a known test prompt. This is an example of negative test washback described in 6.2.4.3.

c) The variability inherent in the test design. A test that contains very little variability in prompts (in other words, all test-takers are asked the same questions or very similar questions) will require more frequent version changes than a test in which the interlocutor can, for a particular item, ask the test-taker a variety of questions.

It is common in large testing initiatives for a testing service to use a version of a test only once before retiring it. In other cases, a testing service develops a number of versions, then recycles them randomly. Test-takers may then generally know the sorts of questions and prompts they will encounter during a test, but will be unable to predict the specific questions and prompts they will encounter during a particular testing interaction. One security measure that testing organizations may take is to always include at least one completely new prompt or question in every version. A pattern of test-takers achieving high scores on most or all test prompts or questions, but failing the new prompt, may indicate a breach in test security.

Special Guidance on ICAO Doc 9835 (2010), 6.3.5.9.

Although 6.3.5.9. refers to the need for adequate versions in the case of semi-direct test prompts, it should be noted that this need for a bank of test versions equally applies to test instruments that utilise direct test models for speaking (where an interlocutor uses prescribed prompts or questions to elicit responses from test-takers). Note, however, that there also needs to be an adequate number of versions of listening tests.

Why this issue is important

In the event that a test bank is too small relative to the size of the test-taker population, there is more possibility that prospective test-takers will become aware of specific test through pooling or sharing of test content and task or item answers. As a result, test-takers may prepare model responses or memorise answers prior to taking the test. In such cases test-taker results are not likely to be an accurate representation of their true language proficiency. Instead, the test results reflect test-takers’ memorisation skills and language knowledge related to a very limited sample of the language domain. If this occurs, the test results are no longer valid and the security of the testing system would have been compromised.

To minimize specific test content from becoming known to the test target population it is essential that a sufficient number of test versions are developed. Testing systems which comprise a sufficient number of test versions remain more viable and robust over time for two reasons. Firstly, test-takers in the target population come to recognise the fact that it is less likely that they will be exposed to content or items contained in previously administered versions of the test. As a consequence, when they take the test they will be less inclined to prepare model answers. Secondly, test-takers who have taken a test are already aware that multiple versions of the test are in use so realise it is less worthwhile to pool or share information. In general, testing systems which comprise multiple test versions are more secure and are more respected by stakeholders. They are also less likely to have negative washback effects resulting in prospective test-takers only focusing on learning isolated areas of language instead of learning a wider scope of domain relevant language which would develop proficiency for real-world operation communication situations.

While it is necessary for a testing system to utilise a test bank that comprises multiple versions of the test to safeguard the security of the test, it is also important that the versions of the test are equivalent. Equivalence among different test versions ensures the testing system is reliable – so that different test versions measure the same things and produce the same results. In other words, it does not matter which version of a test a test-taker sits, because even though each version draws on different content, tasks and questions, each version measures the same underlying aspects of language proficiency and the ability to communicate over the radio in real-world aviation contexts which pilots and controllers may encounter. Equivalence among test versions improves confidence in the meaningfulness of the results of each version and therefore the overall testing system.

Next ➟ Best Practice Options

Best Practice Options

Test Developers need to consider the following points when designing test instruments.

There are very few firm academic principles related to how many test versions a test bank should comprise. However, if test developers take practical precautions to minimize over-exposure of each version by ensuring the test bank contains a sufficient number of versions, they can help safeguard the overall security of the testing system and ensure their testing systems last longer, are more robust and therefore more respected.

In mainstream general language testing, test developers may provide a new test version that is administered once only on a single test day and then retire that version so that it is never used again. In such cases these tests are developed for wide-scale testing of very large populations. These large numbers of test-takers mean that the test providers are able to justify the investment in developing test versions for single use because of the volume and revenue these tests generate. However, in LPRs, due to the fact that the size of the test-taker population is much smaller in any given area or at any given time, it would not be realistic or practical for test developers to be expected to produce new, equivalent and valid versions of the test for each administration period.

It is recommended therefore that LPR test providers develop multiple versions of the test and rotate their usage so as to minimize opportunities for the test-taker population to become familiar with the content in each test version.

The following variables affect how many versions an LPR test bank needs to comprise:

  1. The size of the test-taker population (larger populations of pilots or controllers require the test bank to contain more versions);
  2. The frequency of test administration sessions (more frequent testing requires more test versions);
  3. The length of the test administration session period at any given time and the number of test-takers taking the test during that period (e.g. whether all test-takers sit the test at exactly the same time or whether different test-takers are able to sit the test over an extended period because the test administration session lasts for a few days or weeks).
  4. The expected number of attempts made by test-takers both during a testing period and in subsequent test administration periods (e.g. every three years for ICAO Level 4). There may be requirements for test-takers to resit the test if they do not achieve their desired ICAO level on the first attempt. Clearly these test-takers should not sit test versions already previously attempted. In such cases, when there are more test-takers who fall into this category, there is a need for more test versions in the test bank to be made available.

For LPR testing, the following variables and real-world practices influence the number of versions an LPR test bank needs to contain.

  1. A test-taker population may vary between 50 in a small ANSP or airline to up to a few thousand in a large international airline, or many thousands in cases where the test is caters for a large test-taker population scattered over a number of countries. Clearly, the larger the test-taker population is, the more test versions the test bank should contain.
  2. Test administration cycles may be linked to three-year cycles (in line with ICAO’s recommendation for re-testing of level 4), requiring a period of intensive testing followed by quieter periods, or be carried out on an ongoing basis, but less intensively. More intensive testing periods require more test versions to be available during these periods to minimize over-exposure of test versions. In these situations, the higher number of test-takers means there is likely to be more opportunities and incentives for test-takers to pool resources and share test content experiences.
  3. Testing cycles may last for a longer time to accommodate a test centre’s capacity (examiner, rater, or invigilator resources, for example) or test-taker availability. As a result, LPR test centres may conduct testing less frequently, but over an extended period. In these cases, with the increased interval between test sittings, there is more time for prospective test-takers to become familiar with specific test content associated with known versions. In such cases, there is a greater need for the test bank to utilise more versions, reducing the perception among test-takers that test-content can be predicted and therefore rehearsed.

Careful attention should be given to ensure the test bank contains a sufficient number of versions and that their exposure and usage is carefully managed to take account of the above possibilities.

In other language testing contexts, often many test-takers sit the test at the same time in a one-off test administration event. This minimises the need for multiple versions to be made available. However, due to aviation operational limitations, this model is often unrealistic for the assessment of pilots and controllers. Clearly then, multiple versions of the test need to be administered over the course of the testing period for test security purposes.

A test bank which contains as few as three or four test versions may be viable if content from each of the unique versions is blended with content from other versions. This can assist in creating the perception among test users that there is a significant amount of variation and therefore it is unlikely that content can be predicted.

Obviously the larger the test bank, the less vulnerable the test is to test content becoming known to test users, and the more secure the overall testing system is.

Although it is difficult to provide concrete guidelines on how many versions a test bank should contain due to the large number of variables that affect test exposure, the following examples serve as practical guiding principles that test developers can consider when developing and administering an LPR testing service.

  1. In cases where the test population is up to 500 personnel and test administration sessions are scheduled for relatively short periods (e.g. two weeks – to limit the gradual exposure of test content over an extended time period – see point 4), it is recommended that test providers ensure their test banks contain an adequate number of versions (three or four versions is usually not adequate even if these versions are mixed together form sub-sets of the original versions). Test content should be mixed between these versions to produce derivative . This allows for more variation and reduces exposure of the unique test version content and therefore increases the sustainability and security of the test bank.
  2. In situations when the test-taker population increases (e.g. for a medium-sized airline or ANSP with 600-800 or more pilots or controllers), and which require test administration sessions to be made available for a longer periods (e.g. over the course of one or two months) then a test bank needs to contain correspondingly more versions. In such cases it is recommended that a test bank contain six or more unique test versions. Test content should be mixed between these versions to produce derivative . This allows for more variation and reduces exposure of the unique test version content and therefore increases the sustainability and security of the test bank.
  3. In cases where the test-taker population is extremely large – made up of several thousand pilots or controllers and includes multiple airlines or organisations (e.g. in a large state where all personnel are required to take the same test), a test bank should contain no less than 15 unique versions. Test content should be mixed between these versions to produce derivative . This allows for more variation and reduces exposure of the unique test version content and therefore increases the sustainability and security of the test bank. In addition, the test developer should update and replace at least twenty percent of the test versions for each test cycle (e.g. introduce at least three new test versions every three years, in line with the retesting requirement for ICAO Level 4). In such cases the higher volume of test-takers justifies this additional investment in test development and ongoing maintenance.

Example model for the production of derivative test versions based on unique versions:

Test developers should commit resources to maintain the test bank. This includes:

  1. Adding new versions to the test bank; and
  2. Removing versions that have been in use:
    – with a significant proportion of the test-taker population; and/or
    – for a significant amount of time; and/or
    – for over three test cycles (9 years with the typical three-year test cycle for ICAO Level 4 test-takers).

As many variables affect the viability of a test based on the size of its test bank, the above guidelines represent guiding principles only.

Effort also needs to be taken to ensure the versions in the test bank are equivalent. The level of difficulty should be consistent across the test bank so that, for example, some versions do not produce higher or lower results than other versions. Each of the versions should be equivalent in the breadth and scope of language test-takers are expected to produce or understand. Finally, each version should be equivalent in how it assesses language representative of the type and variety of language use situations that pilots and controllers may need to communicate in during real-world situations.

Next ➟ External References

External References

The following references are provided in support of the guidance and best practice options above.

1. Fulcher (2010)

“A test form means that it is generated from a test specification; one reason for having test specifications is to try to ensure that each form looks roughly the same because it is made up of the same item types, with the same number of items, representing the same set of constructs in each section. It is also designed to try to make sure that each form is of the same difficulty” (p. 129).

Fulcher, G. (2010). Practical language testing. London: Hodder Education.

2. Fulcher and Davidson (2007)

“In large-scale testing there may be a requirement for hundreds of forms (versions) of the test each year. Even for institutional testing there may be a need for more than one form of the test, especially if testing is to take place at different times. The most obvious reason why we need multiple forms is test security” (p. 117).

Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book. London and New York: Routledge.

3. Alderson, Clapham and Wall (1995)

“What is important with equivalent tests is that they each measure the same language skills and that they correlate highly with one another. It is to be hoped, of course, that equivalent versions will be of a similar level of difficulty and have a similar spread of scores” (p. 97).

Alderson, J. C., Wall, D., & Clapham, C. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.

NAVIGATION

CRITERION 1 Use of language in radiotelephony contexts
CRITERION 2 Different tests for pilots and air traffic controllers
CRITERION 3 Dedicated tasks to assess listening comprehension
CRITERION 4 Distinct sections with appropriate task types
CRITERION 5 Engage in interactive and extended communication
CRITERION 6 Tasks and items can differentiate between levels
CRITERION 7 Assess abilities for real-world communications
THIS PAGE A sufficient number of equivalent test versions
GO TDG Criteria: HOME
GO Test Evaluation Tool
GO Intro & Objectives
GO TDG Workshops 2019