Test instruments need to include tasks and items which allow the assessment to differentiate between proficiency levels.

Key Issues

The main objective of the LPRs is to determine whether pilot or ATCO test-takers have sufficient language proficiency (ICAO Level 4 or above) to maintain safe operations. However, there may also be a need for tests to assess lower and higher levels accurately.

In cases were LPR tests aim to differentiate between for example Level 3, 4 and 5, different tasks or test parts need to be designed to assess each of these different levels and their associated competencies, as reflected in the ICAO LPR rating scale.

This means that in a speaking component of a test, the test stimulus or input needs to be able to effectively elicit language that demonstrates proficiency at the required level. And, in a listening component, the test input and associated tasks and questions need to be able to assess comprehension at the specified levels.

Therefore, tests which claim to differentiate between levels, for example between ICAO Level 3, 4 and 5 need to include test tasks that correspond to each of these levels. In this way, a test-taker who is able to meet the minimum requirements of a say, Level 4 in both the speaking and listening parts of the test, but is unable to demonstrate sufficient skills equivalent to Level 5 in the speaking components or meet a Level 5 standard in the listening comprehension components of the test can be confidently awarded an ICAO Level 4 result.

Considerations

When test instruments claim to differentiate between multiple ICAO Levels (for example, just below ICAO Level 4 and at ICAO Level 4), this directly relates to the validity of the test. If a test instrument is designed in such a way as to have test components which are designed to assess competencies associated with different levels, we can be more confident that test is able to effectively differentiate between ICAO levels. In such cases, the test is more likely to have higher validity. If tests claim to assess higher levels (e.g. differentiate between ICAO Levels 4 and 5) but only include content which is ‘too easy or insufficient’ and do not contain task or components designed to assess higher levels, it is likely that such tests are awarding test-takers with ICAO levels above their true proficiency level.

What does this mean for test design?

Parts and components of tests need to be accessible and achievable for test-takers at different levels, with some parts/tasks/items catering to test-takers with lower proficiency levels (below ICAO Level 4), and other parts for ICAO Level 4 (and above). If the test claims to be able to assess ICAO Level 5 or Level 6, different tasks/items or test parts need to be dedicated to assessing these higher levels and their associated competencies (as reflected in the ICAO LPR Rating Scale).

ICAO Statements & Remarks

ICAO Doc 9835 (2010) does not directly address this issue.

AELTS (the ICAO’s Aviation English Language Test Service) information and advice, does however, note in point 13) that:

“If the test contains components, tasks and items which are intended to target language ability at a specific level, what has been done to determine that these components are able to effectively assess this level and discern between other levels?”

Why this issue is important

The effectiveness of a test in being able to differentiate between levels is a feature of the test design, input stimulus and test tasks. It is important that the speaking components of tests contain tasks and test content that are able to elicit test-taker responses that allow assessment process (raters’ evaluations) to evaluate whether test-takers demonstrate language skills up to the level the test claims to evaluate.

Similarly in the listening components of the test, the input needs to be sufficiently complex and varied to allow test-takers’ results to correspond to a range of levels, up to and including the highest level the test claims to assess.

Put simply, if test-takers are able to perform well in all aspects of an LPR test, which by design, does not contain sufficiently complex content or input, or does not require test-takers to produce language associated with higher ICAO Levels, we cannot be confident that the test is able to assess higher levels, even though the test-takers have ‘successfully’ demonstrated they are able to achieve on all aspects of that test. In such cases, there is a risk that raters or the test provider may award test-takers with levels that are higher than the test-takers’ actual proficiency level. This can happen because the tests do not provide opportunities (tasks and content) at a level to evaluate such higher level skills and competencies.

We know that for safety reasons the ICAO LPRs require a minimum of Level 4 in all 6 criteria on the rating scale. Therefore, a simple approach to designing a test instrument would be to make sure it is able to differentiate between ICAO Level 4 and below Level 4. If this were the case, it would only be necessary for the test instrument to assess whether test-takers have the minimum proficiency to fulfill the minimum requirements for each of the six criteria at Level 4 (and not ICAO Level 5 or 6).

However, the ICAO LPRs and rating scale do allow for test developers to award ICAO Levels at higher (and lower) levels. Some ANSPs, airlines, other organisations or states may require tests to assess test-takers at a range of levels for operational, recruitment and safety reasons.

In order therefore to be confident that a test instrument is able to assess a range of proficiency levels – test content and task types need to target the proficiency levels and skills associated with each of the required levels the test aims to assess and these need to be incorporated into the test design.

If a test instrument lacks diversity in the task types, input, type, range, and complexity of responses that prompts are designed to elicit (i.e., are basically too simple or too easy), yet states the test is able to assess a range of ICAO levels, this would negatively affect the validity of the test.

Imagine a situation where such a test is developed and is administered but is only able to assess limited aspects of some of the competencies at the lower ICAO levels – it would therefore be ineffective in assessing the full range of language competencies at Level 4. Imagine also that such a test does not sufficiently assess higher levels skills, requirements and competencies at Level 5, so does not include input, content or elicit responses that are sufficiently beyond test-takers’ language abilities to allow decisions to be made about the point at which test-takers can and cannot adequately use language and respond to or understand input. If such a test claims to assess ICAO Level 5, and issues test-takers with ICAO Level 5 results, there would be serious doubts about the validity of these results, and therefore the validity of test overall. The problem in such cases is that the test developer, raters, users and administrators may believe their test is effective because they are not aware that test instrument is deficient in its ability to differentiate between levels, in this case ICAO Level 4 and 5. Only through careful test design can such issues be avoided.

The interaction between item prompts and the test-taker responses is an extremely complex process and varies from one test-taker to the next. The multiple cognitive and environmental factors involved mean that different test-takers may well respond to the same prompts in different ways. There should therefore be sufficient prompt material to ensure that each test-taker has the maximum opportunity to be able to demonstrate language proficiency at the required levels. Including multiple opportunities for test-takers to demonstrate their language abilities also increases the fairness of the test and therefore test-takers’ perception and attitude towards the test and the ICAO LPRs.

Next ➟ Best Practice Options

Best Practice Options

Test Developers need to consider the following points when designing test instruments.

Test specifications need to identify the levels (e.g. below Level 4 or at Level 4, or Level 2, 3, 4, 5) the test is designed to assess and the rationale for both the prompt material and expected responses in a speaking component of the test. In the comprehension part of the test, the test specifications need to identify the extent to which test-takers need to demonstrate their comprehension (e.g. scores needed to demonstrate equivalence to different levels or a means of attributing anticipated responses to levels in cases where test-taker responses are given orally).

The test input or test stimulus needs to be categorised as being suitable for assessing levels identified in the test specifications. The test specifications need to outline what components of the test instrument target each of the levels (e.g. Level 3, 4 and 5), how, and also the expected test-taker`s responses which are indicative of achievement of that requirement for each level.

However, levels can also be attributed to the language produced by a test-taker and not just from the fact that a task is designed to elicit an expected response. It is possible that a test-taker may respond to a prompt or use language in a role-play that is indicative of a higher language proficiency level, even though the task or content was not intended to elicit such complex language. In such cases test-takers should be given credit for demonstrating language associated with a higher proficiency level, even though the task was not designed to elicit such a response. It would be unfair to award a Level 4, for example, to a test-taker who consistently performed better on test tasks which were developed to assess primarily Level 4 competencies. Of course, in such cases, the test-taker would also need to demonstrate proficiency above ICAO Level 4 on test tasks designed to assess Level 5 abilities.

As many opportunities as possible should be provided for the test-taker to demonstrate proficiency skills at the required level. This is normally achieved by including a variety of prompts that assess equivalent skills at a given level (refer to Criterion 4) – hyperlink text to Criterion 4.

Care should be taken in situations where prompts are designed to elicit operational and technical vocabulary. Language in radiotelephony communication may be learned or acquired in a limited context and the user may not have the full language proficiency to extend the use of certain technical words and phrases beyond an operational environment. This may therefore be sufficient for Level 4 but not automatically attributable to Levels 5 or even 6. In other words, operational controllers and pilots may be able to use and understand a range of technical and operational language and vocabulary which are routine and common in operational contexts, but which are not necessarily representative of language abilities and skills associated with higher levels.

Next ➟ External References

External References

The following references are provided in support of the guidance and best practice options above.

1. Weir (2005)

“Test service providers need to furnish evidence that they [tests] are construct-valid, i.e., that they adequately address context, theory-based and scoring parameters of validity appropriate to the level of language ability under consideration. A framework is required that helps identify the elements of both context and processing and the relationships between these at varying levels of proficiency” (p. 3).

“A framework for testing purposes would need to comprehensively address at different levels of proficiency the (various) components of validity” (p. 4).

“They need […] to operationalize appropriate significant divergent conditions under which tasks are performed that enable us to differentiate between performances of a task at adjacent proficiency levels” (p. 4).

“Exam providers need to clearly specify the extent to which purposes differs from level to level and provide evidence on how this criterion can be used to distinguish between adjacent proficiency levels” (p. 10).

“ (the demands of both content and language knowledge) will vary from task to task […] these need to be made explicit in relation to proficiency level. In short, testers need more detail on the components of processing in both receptive and productive modes and how these develop through the different levels of proficiency” (p. 15).

Weir, C. J. (2005). Limitations of the CEFR for developing comparable examinations and tests. Language Testing, 22 (3) 1-20.

2. Association of Language Testers in Europe (2011)

“There are two aspects to defining levels: what people can do and how well they can do them. In an exam the ‘what’ is defined through the tasks which are specified. How well these tasks are done is what raters have to judge” (p. 41).

Association of Language Testers in Europe (2011). Manual for Language Test Development and examining. Council of Europe.

3. Alderson (2009)

“ […] if there is no statistical information available to support the claimed level of the test, the equivalence of different versions from year to year, and the comparability of the results of different tests purporting to measure the same target level of proficiency […] then little or no confidence can be held in the meaningfulness, reliability and validity of […] language tests […] and […] it is highly likely that they […] fail to meet minimal standards of quality” (p. 177).

Alderson, C. J. (2009). Air safety, language assessment policy, and policy implementation: The case of Aviation English. Annual Review of Applied Linguistics, 29, 168–187.

NAVIGATION

CRITERION 1 Use of language in radiotelephony contexts
CRITERION 2 Different tests for pilots and air traffic controllers
CRITERION 3 Dedicated tasks to assess listening comprehension
CRITERION 4 Distinct sections with appropriate task types
CRITERION 5 Engage in interactive and extended communication
THIS PAGE Tasks and items can differentiate between levels
CRITERION 7 Assess abilities for real-world communications
CRITERION 8 A sufficient number of equivalent test versions
GO TDG Criteria: HOME
GO Test Evaluation Tool
GO Intro & Objectives
GO TDG Workshops 2019