Criterion 7:  Test instruments need to comprise distinct sections with a range of appropriate test task types

Key Issues & Considerations

When designing a language test, five key questions should be asked at the outset.

1. What language skills and language knowledge need to be tested?

This forms the basis for the test construct and affects all issues related to test instrument design.

2. In what communication contexts does this language knowledge and these language skills for communication occur in real-world communication situations?
3. What test content is required to allow the communication contexts to relate to the way communication happens in real-world situations?
4. How can this language knowledge and these language skills be best observed through test performance?
5. What types of test tasks can best obtain these types of performances in assessment situations?

Considerations

The types of test tasks and items used to build a test need to be assembled in a way that each plays a role in how the test instrument, as a whole, functions. The type, sequence and combination of test tasks need to take account of how effectively each task can measure a specific and defined set of language knowledge and skills (the test construct). The combination of test tasks needs to be balanced to properly reflect the range and diversity of the communication context in which test-takers are to be assessed.

In ICAO LPR test instrument design, sections with distinct assessment purposes and a range of test task types that are representative of the domain of radiotelephony communications should be included, so that a wide range of the language knowledge, language skills and communicative competencies required for air-ground communication can be effectively and efficiently assessed.

ICAO Statements & Remarks

ICAO does not directly refer to the need for test instruments to comprise distinct sections. Nevertheless, the following remarks from ICAO Document 9835 (2nd Edition, 2010) are related to this issue, as they refer the need for tests to assess a wide range of language skills and knowledge, and this is most effectively achieved when different parts of a test focus on specific language skills or knowledge. The test structure is linked to the test instrument design and shapes the scope and range of language skills and knowledge the test assesses.

6.2.5.4. Proficiency tests require test-takers to demonstrate their ability to do something representative of the full spectrum of required knowledge and skills, rather than to simply demonstrate how much of a quantifiable set of curriculum learning objectives they have learned. In an aviation context, proficiency testing should establish the ability of test-takers to effectively use appropriate language in operational conditions.
6.3.2.3. A description and rationale for test construct and how it corresponds to the ICAO Language Proficiency Requirements should be accessible to all decision-makers in plain, layperson language. […] A description of the test structure and an easy-to-understand explanation of reasons for the test structure is one form of evidence that it is an appropriate tool for evaluating language proficiency for the ICAO requirements for a given context.

Why this issue is important

The following seven points outline the importance of having tests with distinct sections with a range of test task types and the role this has on the overall quality of a test.

1. Test Validity

Test instruments that contain a variety of test tasks which:

  • serve a required purpose;
  • measure specific knowledge/skills related to the real-world communication needs; and,
  • play an effective role in how the test operates,

are more likely to ensure the test is effective in assessing what it aims to assess (have higher validity). A test instrument which does not contain a sufficient range of task types designed to assess specific language knowledge/skills is likely to be ineffective in its ability to adequately represent the required range and complexity of language for assessment purposes, and will therefore have lower validity.

2. Fairness

Test instruments which contain multiple task types increase the fairness of the test. For example, if a test-taker is unable to engage effectively with one test task type in a test that does not contain a range of test task types, this may inadvertently negatively impact on his/her overall test result. This would result in a final score which does not accurately reflect the test-taker’s language ability.

Providing a range of task types allows test-takers to demonstrate their language knowledge and skills in a wider range of contexts. Test-takers are therefore given more opportunities to demonstrate their abilities. For example, if a test-taker does not engage so effectively with one task type in a test that contains multiple test sections and task types, this would only have a small impact on the test-taker’s overall result and would improve the fairness of the test’s results.

3. Range of knowledge and skills

Test developers need to include all important aspects of the range of language knowledge and language skills associated with air-ground communication. That means, elements that are representative of pilots’ and ATCOs’ communication needs where plain English in air-ground communication is required. This can most effectively be achieved in language testing situations by including a range of test tasks – each specifically designed to assess separate elements of language knowledge and language skills.

For example, one task may be specifically designed to assess the test-takers’ abilities to use the correct vocabulary in unusual flight situations; another task may require test-takers to demonstrate their abilities to clarify information and resolve communication break-downs; while another test task might assess test-takers’ abilities to communicate using complex language to give/receive information about a situational complication.

A good example of the need for separate test task serving separate purposes is listening comprehension. Listening comprehension needs to be assessed through separate dedicated test tasks where speaking performance is not directly assessed. It should not be assessed in tasks specifically developed to assess speaking performance only (see ITEM 1).

4. Assessment of real-world competence

Only through including a range of different test tasks can a test instrument effectively assess the types of language knowledge and language skills, the range of proficiency levels and the variety of communication contexts in which this communication occurs.

As a consequence this allows the results that the test gives to resemble communicative competence in the wide range of real-world communication situations the test aims to replicate.

5. Characteristics of real-world communication

Test tasks should also be produced not only in terms of the relevance to the content of any communication but in how much the characteristics of the real-world pilot/ATCO communication are incorporated into the test tasks.

6. Avoiding the engagement of other skills

In a similar way to making sure that all elements necessary to testing language proficiency in pilot/ATCO communications are included, test developers should ensure that no parts of the testing process are included that may unnecessarily affect the result in a negative way.

This could mean, for example, that if a listening test requires test-takers to listen and read long text associated with test items at the same time, the test is inadvertently assessing the ability to read and listen at the same time – a skill which the test is not aiming to assess.

7. Fostering positive washback

In terms of washback effect, which is broadly understood as the impact of tests on training practices, if important aspects of what needs to be tested are not included in the test (or are under-represented in the test), learners – pilots and ATCOs – may not practice and develop the important skills required for real-life communication (negative washback). This is because the test partly focuses on those aspects that are not relevant.

Tests therefore need to contain a variety of test task types which adequately reflect the language knowledge and language skills associated with the operational needs of pilot/ATCO communication.

Next ➟ Best Practice Options

Best Practice Options

Test Developers should consider the following questions when designing test instruments.

1. What type and complexity of language and skills do we need to assess and in what kind of contexts?

Decide on what the test should and should not assess – establish specific boundaries to what the test should assess so that the test construct is clear.

2. How can these areas of language knowledge and language skills be separated into specific and definable areas to help the assessment process work?

Separate parts of the test that need to serve their own function should not overlap too significantly in their aims with other parts of the test. For example, it is important that a test includes explicit speaking and listening comprehension tasks to ensure that the speaking ability and listening comprehension of language can be evaluated for assessment purposes independently of each other (see ITEM 1).

3. How can separated language areas and skills be reflected in different language tasks that best reflect the way communication happens in real-world contexts?

Consider types of test tasks that can enable the assessment process to maximize both practicality and authenticity.

4. How will each test task contribute in a useful way to the overall result that the test gives?

Consider the role and overall level of importance of each part of the test, including the contribution each part of the test makes to the overall result and the level of difficulty.

5. How will each of these test task types be put together to build the test instrument?

As test-takers work through the test there should be a logical flow, so that the test functions smoothly.

Next ➟ External References

External References

The following references are provided in support of the guidance and best practice options above.

1. International Language Testing Association (ILTA) Guidelines for Practice

PART 1

A. Basic Considerations for good testing practice in all situations

1. The test developer’s understanding of just what the test, and each sub-part of it, is supposed to measure (its construct) must be clearly stated.

B. Responsibilities of test designers and test writers

2. A test designer must decide on the construct to be measured and state explicitly how that construct is to be operationalised.
3. The specifications of the test and the test tasks should be spelled out in detail.

http://www.iltaonline.com/page/ITLAGuidelinesforPra

2. Messick (1989)

Validity is “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment.” (p.13)

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: Macmillan

3. Messick (1996)

“However, it is not sufficient merely to select tasks that are relevant to the construct domain. In addition, the assessment should assemble tasks that are representative of the domain in some sense. The intent is to insure that all important parts of the construct domain are covered, which is usually described as selecting tasks that sample domain processes in terms of their functional importance. Both the content relevance and representativeness of assessment tasks are traditionally appraised by expert professional judgment, documentation of which serves to address the content aspect of construct validity.” (p.10)

“In practice, test makers are mainly concerned about adverse consequences that are traceable to sources of test invalidity such as construct under-representation and construct-irrelevant difficulty. These concerns are especially salient in connection with issues of bias, fairness, and distributive justice, but also potentially with respect to negative washback. For example, if important constructs or aspects of constructs are underrepresented on the test, teachers might come to overemphasize those constructs that are well-represented and downplay those that are not.” (p.14)

Messick, S. (1996). Validity and washback in language testing. ETS Research Report Series, i–18. doi:10.1002/j.2333-8504.1996.tb01695.x

4. Alderson, Clapham and Wall (1995)

“It is important to realize that the method used for testing a language ability may itself affect the student’s score. This is called the method effect, and its influence should be reduced as much as possible.” (p.44)

“…the best advice that can be offered to item writers is: ensure that you use more than one test method for testing any ability. A useful discipline is to devise a test item to cover some desired ability or objective, then to devise another item testing the same ability using a different method or item type. […] In general, the more different methods a test employs, the more confidence we can have that the test is not biased towards one particular method or to one particular sort of learner.” (p.45)

Overall balance of the test – After being drafted, test items/tasks should be assembled into a draft test paper/subtest. It should then be considered by an editing committee (a group of testing experts and SMEs) for the degree of match with the test specifications, likely level of difficulty, wording of instructions, etc. The committee should also consider the test as a whole, with special attention to the overall balance of the subtest or paper. (p.63)

Some components of the test construct may be easier to assess than others, therefore “item writers sometimes find they are unable or unwilling to test the more difficult aspects, and, because of this, the content of some test papers may be unbalanced”. (p. 69)

Alderson, J. C., Wall, D., & Clapham, C. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press

5. Fulcher and Davidson (2007)

“The role of task design in language testing is closely linked to what we argue will constitute evidence for the degree of presence or absence of the kinds of knowledge or abilities (construct) to which we wish to make inferences.” (p.64)

“Each item or task must elicit evidence that will be useful in drawing inferences to constructs. In order to do this, the designer must be clear about the kind of responses that are to be expected from tasks, and be able to define those features of tasks that are critical in eliciting relevant evidence.” (p.67)

In order to combine test items or tasks, test developers should follow test assembly rules, which tell us “how many items of certain types are to be used, what content coverage must be achieved, which sections they are to be placed in and what the intended measurement properties are.” (p.119)

Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book. New York: Routledge

6. Fulcher and Davidson (2009)

“We now turn our attention to the conceptual elements of test specifications (Mislevy et al., 2003; Mislevy, 2003a). A given specification may or may not label these elements, but in a well-constructed specification each is needed.

The Presentation Model tells us how the items and tasks are presented to the test-takers. An Assembly Model tells the test designers how the tasks and items should be combined to produce a test form. It specifies targets, such as the reliability with which each construct should be measured, and the constraints on the mix of items that need to be included to achieve an adequate representation of the domain of inference. Finally, the Delivery Model explains how the actual test is delivered, including administration, security and timing.

Many specifications are needed to create a test, at least one for each of the different components (sections, tasks, items) from which the whole is constructed. On top of this are the test specifications which additionally include information on presentation, assembly and delivery.” (p.129)

Fulcher, G., & Davidson, F. (2009). Test architecture, test retrofit. Language Testing, 26(1), 123-144

7. Fulcher (2010)

“The test assembly specification therefore plays a critical role in showing that the number and range of items in any form of the test adequately represent the key features of the criterion situation in the real world.” (p.128)

Fulcher, G. (2010). Practical language testing. London: Hodder EducationDavidson and Lynch (2002)

8. Davidson and Lynch (2002)

“Writing a full test is frequently a process of assembling a number of tasks generated by several specs. Specs and their tasks exist in a “one-to-many” relationship; that is, each spec is designed to produce many tasks of the same type. A test will therefore be made up of many tasks of different types, each produced from a different spec.” (p.60)

Davidson, F. & Lynch, B. K. (2002). Testcraft. A teacher’s guide to writing and using language test specifications. New Haven: Yale University Press.

9. Douglas (2000)

Douglas (2000) defines a language for specific purpose test as “the combined set of tasks that will give us as complete a picture as possible of the test-taker’s field specific language ability.” (p.248)

He discusses the importance of a specification document in outlining the necessary information to describe the content of a test and distinguishes two stages in the test development process: “the specifications containing all the information needed to produce the test/tasks and the operationalisation stage referring to the actual production of test materials.” (p.249)

Douglas, D. (2000). Assessing languages for specific purposes. Cambridge: Cambridge University Press

10. Weir (1993)

“We also have to ensure that the sample of communicative language ability in our tests is as representative as possible. What and how to sample in our tests is a key issue in language testing. If we are to extrapolate from our test data and make statements about communicative language ability in real-life situations, great care needs to be taken with the tasks we employ in our tests. The more closely we can specify what needs to be tested, the more representative our sampling in tests might become.” (p.29)

Weir, C. J. (1993). Understanding and developing language tests. New York, N.Y: Prentice Hall

11. Weir (2005)

“We need to ensure that the constructs we are eliciting are precisely those we intent to and that these are not contaminated by other irrelevant variables, such as method effect. If important constructs are under-represented in our tests, this may have an adverse washback effect on the teaching that precedes the test, teachers may simply not teach certain important skills if they are not in the test.” (p.18)

Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Basingstoke: Palgrave Macmillan

12. Moder and Halleck (2009)

Moder and Halleck investigated the variation in oral proficiency demonstrated by 14 ATCOs across two types of testing tasks: work-related radiotelephony-based tasks and non-specific English tasks on aviation topics (common occurrence and less expected occurrence). The results demonstrate significant differences in the performance of test-takers across task types with respect to the established minimum required proficiency, Operational Level 4.

“Of greater concern from a public safety perspective is the finding that some controllers performed at Operational level on one of the general description in the aviation context tasks and failed to demonstrate minimum proficiency on the radiotelephony tasks. In such a case, the general aviation task would have inaccurately predicted the controller’s performance level on a critical workplace task.” (p.13)

Moder, C. L. & Halleck, G. B. (2009). Planes, politics and oral proficiency: Testing international air traffic controllers. Australian Review of Applied Linguistics, 32(3), 25.1-25.16. DOI 10.2104/aral0925

NAVIGATION

GO GUIDELINES HOME PAGE
GO CRITERION 1 Test tasks specifically designed to assess comprehension
GO CRITERION 2 Different tests for pilots and air traffic controllers
GO CRITERION 3 Air-ground radiotelephony communication contexts
GO CRITERION 4 Negotiated, extended communication
GO CRITERION 5 Capability to differentiate between levels
GO CRITERION 6 Task types reflect real-world communications