Test instruments need to comprise distinct sections with a range of appropriate test task types.
In language test design, there are five key issues that need to be addressed at the beginning of the design of any test instrument:
- The language skills and language knowledge that need to be tested. This forms the basis for the test construct and affects all issues related to test instrument design.
- The communication contexts this language knowledge and these language skills for communication occur in real-world communication situations.
- The test content that is required to allow the communication contexts to relate to the way communication happens in real-world situations.
- The relationship between the test content and the language knowledge and language skills and how this can best be observed through performance in assessment situations.
- The types of test tasks that can best obtain these types of performances in assessment situations.
The test task types and items used to build a test need to be assembled in a way that each plays a role in how the test instrument, as a whole, functions. The type, sequence and combination of test tasks need to take account of how effectively each task can measure a specific and defined set of language knowledge and skills (the test construct). The combination of test tasks needs to be balanced to properly reflect the range and diversity of the communication contexts in which test-takers are to be assessed.
Test instruments need to contain task types that represent an appropriate range of test-takers’ knowledge, skills and competencies to understand and communicate in real-world communication contexts. In ICAO LPR test instrument design, sections with distinct assessment purposes and a range of test task types that are representative of the domain of radio-telephony communications should be included, so that a wide range of the language knowledge, language skills and communicative competencies required for radio-telephony communication can be effectively assessed.
Further, ICAO LPR test tasks and sections should not contain too few items, questions or tasks, or provide insufficient opportunities for test-takers to demonstrate their proficiency. Test reliability is influenced by the number of items or tasks test-takers are required to attempt. In cases where the test instrument contains an insufficient number of items or tasks, this can result in the test being unreliable. When there are fewer items or tasks the result of the test is more susceptible to being influenced by randomness.
In cases were a listening test contains items which are scored, it is important that there are enough items to ensure that there is an adequate distribution of items for the different levels the test aims to assess. Too few items in this case can affect both the reliability and fairness of the test. All too often it is assumed that reliability is determined solely by how consistently and accurately raters rate test-takers’ performances. In fact, reliability is also a feature of how consistent the test instrument is in its ability to measure what it is designed to measure. In the LPRs, this means that listening tests need to not just contain a range of test task types but also enough items within each task type to allow those scores be meaningful in terms of the test’s ability to measure comprehension skills.
Test instruments that provide more opportunities for test-takers to demonstrate their proficiency are also fairer. If a test-taker is unfamiliar with some aspects of the content or misunderstood the task requirements then this may negatively affect the result, even though in fact the test-taker has a higher language proficiency than that result would indicate. In addition, including more items and content and task types provides more scope for the test instrument to assess how well test-takers can understand and use language in a wider range of contexts associated with the domain. Including more items reduces the chance for these kinds of situations to influence test results and therefore improves the reliability of the test.
In the speaking component of LPR test instruments, it is important that a sample of language usage needs to be elicited from the test-taker that is both large enough and diverse enough in the type of language and discourse (language for communication associated with radiotelephony communication, incident debriefs, job discussions, recalling of work-related events etc) that needs to be assessed to allow a reliable assessment to be made. This is a feature of the test instrument design: speaking test tasks need to provide multiple and varied opportunities to elicit language use in a range of communication contexts relevant to the domain of radiotelephony communications.
Therefore, test instruments need to contain task types that represent an appropriate range of test-takers’ knowledge, skills and competencies to understand and communicate in real-world communication contexts (refer to Criterion 7).
What does this mean for test design?
A variety of different task types, items, situations and content needs to be included throughout the test instrument to ensure the domain and range of language proficiency levels are effectively sampled.
Next ➟ ICAO Statements & Remarks
ICAO Statements & Remarks
The following statements from ICAO Document 9835 (2nd Edition, 2010) are related to this issue.
|3.2.1.||All uses of a language and all language-learning environments have unique characteristics that are the consequence of the context of communication and the tasks and purposes of the users.|
|3.2.2.||The context of the communication includes features such as:
a) domains (personal, occupational, etc.);
b) situations (physical location, institutional conventions, etc.);
c) conditions and constraints (acoustic interference, relative social status of speakers, time pressures, etc.);
d) mental contexts of the user and of the interlocutor (i.e. filtering of the external context through different perceptual mechanisms);
e) language activities (receptive/productive/interactive/ mediating); and
f) texts (spoken/written).
|3.2.3.||The tasks and purposes of the users determine:
a) communication themes or topics;
b) dominant speech acts or language functions to be understood or produced;
c) dominant interactive schemata or speech-act sequences and exchange structures;
d) dominant strategies (e.g. interaction: turn-taking, cooperating, communication repair, etc.).
|188.8.131.52.||Proficiency tests require test-takers to demonstrate their ability to do something representative of the full spectrum of required knowledge and skills, rather than to simply demonstrate how much of a quantifiable set of curriculum learning objectives they have learned. In an aviation context, proficiency testing should establish the ability of test-takers to effectively use appropriate language in operational conditions.|
|184.108.40.206.||A description and rationale for test construct and how it corresponds to the ICAO language proficiency requirements should be accessible to all decision-makers in plain, layperson language. […] A description of the test structure and an easy-to-understand explanation of reasons for the test structure is one form of evidence that it is an appropriate tool for evaluating language proficiency for the ICAO requirements for a given context.|
Next ➟ Why this issue is important
Why this issue is important
A test instrument that contains a variety of test tasks which:
- serve a required purpose;
- measure specific knowledge and skills related to the real-world communication needs; and
- play an effective role in how the test operates,
is more likely to ensure the test is effective in assessing what it aims to assess (i.e., it is likely to have higher validity). A test instrument which does not contain a sufficient range of task types designed to assess specific language knowledge and skills is likely to be ineffective in its ability to adequately represent the required range and complexity of language for assessment purposes, and will therefore have lower validity (i.e., provide less valid interpretations made on the basis of test results).
This happens in LPR testing when a test instrument covers only limited components of the construct, failing to fully assess proficiency across the domain of radiotelephony communications. An example is tests which overly rely on picture description and therefore overemphasize descriptive language at the expense of other language functions and communicative strategies and competencies that could arguably be more relevant to manage unexpected situations in radiotelephony communications.
Test instruments which contain multiple task types increase the fairness of the test. For example, if a test-taker is unable to engage effectively with one test task type in a test that does not contain a range of test task types, this may inadvertently negatively impact on his/her overall test result. This would result in a final score which does not accurately reflect the test-taker’s language ability. Providing a range of task types allows test-takers to demonstrate their language knowledge and skills in a wider range of contexts. Test-takers are therefore given more opportunities to demonstrate their abilities. For example, if a test-taker does not engage so effectively with one task type in a test that contains multiple test sections and task types, this would only have a small impact on the test-taker’s overall result and would improve the fairness of the test’s results.
Test instruments need to include all important aspects of the range of language knowledge and language skills associated with radiotelephony communication. That means, elements that are representative of pilots and ATCOs’ communication needs where plain English in radiotelephony communication is required. In other words, test instruments must include elements that are representative of the communication needs of pilots and controllres where plain English is radiotelephony is required. This can most effectively be achieved in language testing situations by including a range of test tasks – each specifically designed to assess separate elements of language knowledge and language skills.
For example, one task may be specifically designed to assess the test-takers’ abilities to use the correct vocabulary in unusual flight situations; another task may require test-takers to demonstrate their abilities to clarify information and resolve communication break-downs; while another test task might assess test-takers’ abilities to communicate using complex language to give or receive information about a situational complication.
A good example of the need for separate test tasks serving separate purposes is listening comprehension. Listening comprehension needs to be assessed through separate dedicated test tasks where speaking performance is not directly assessed. It should not be assessed in tasks specifically developed to assess speaking performance only (refer to Criterion 3).
Only through including a range of different test tasks can a test instrument effectively assess the types of language knowledge and language skills, the range of proficiency levels and the variety of communication contexts in which this communication occurs. As a consequence, this allows the results that the test gives to resemble communicative competence in the wide range of real-world communication situations the test aims to replicate.
Test tasks should also be produced not only in terms of the relevance to the content of any communication but in how much the characteristics of the real-world radiotelephony communication are incorporated into the test tasks.
In a similar way to making sure that all elements necessary to testing language proficiency in radiotelephony communications are included, test developers should ensure that no parts of the testing process are included that may unnecessarily affect the result in a negative way. This could mean, for example, that if a listening test requires test-takers to listen and read long text associated with test items at the same time, the test is inadvertently assessing the ability to read and listen at the same time – a skill which the test is not aiming to assess.
In terms of washback effect, which is broadly understood as the impact of tests on teaching practices, if important aspects of what needs to be tested are not included in the test (are underrepresented in the test), learners – pilots and ATCOs – may not learn the important skills required for real-life communication (negative washback). This is because the test partly focusses on those aspects that are not relevant. Tests therefore need to contain a variety of test task types which adequately reflect the language knowledge and language skills associated with the operational needs of pilot and ATCO communication.
Next ➟ Best Practice Options
Best Practice Options
Test Developers need to consider the following points when designing test instruments.
- What type and complexity of language and skills do we need to assess and in what kind of contexts?
Decide on what the test should and should not assess – establish specific boundaries to what the test should assess so that the test construct is clear.
- How can these areas of language knowledge and language skills be separated into specific and definable areas to help the assessment process work?
Separate parts of the test need to serve their own function and should not overlap too significantly in their aims with other parts of the test. For example, it is important that a test includes explicit speaking and listening comprehension tasks to ensure that speaking and listening comprehension skills can be evaluated for assessment purposes independently of each other (refer to Criterion 3).
- How can separated language areas and skills be reflected in different language tasks that best reflect the way communication happens in real-world contexts?
Consider types of test tasks that can enable the assessment process to maximize both practicality and authenticity.
- How will each test task contribute in a useful way to the overall result that the test gives?
Consider the role and overall level of importance of each part of the test, including the contribution each part of the test makes to the overall result and the level of difficulty.
- How will each of these test task types be put together to build the test instrument?
As test-takers work through the test there should be a logical flow, so that the test functions smoothly.
Next ➟ External References
The following references are provided in support of the guidance and best practice options above.
|1.||International Language Testing Association (ILTA) – Guidelines for Practice
A. Basic Considerations for good testing practice in all situations
1. The test developer’s understanding of just what the test, and each sub-part of it, is supposed to measure (its construct) must be clearly stated.
B. Responsibilities of test designers and test writers
2. A test designer must decide on the construct to be measured and state explicitly how that construct is to be operationalized.
3. The specifications of the test and the test tasks should be spelled out in detail.
Validity is “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (p. 13).
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: Macmillan.
“However, it is not sufficient merely to select tasks that are relevant to the construct domain. In addition, the assessment should assemble tasks that are representative of the domain in some sense. The intent is to insure that all important parts of the construct domain are covered, which is usually described as selecting tasks that sample domain processes in terms of their functional importance. Both the content relevance and representativeness of assessment tasks are traditionally appraised by expert professional judgment, documentation of which serves to address the content aspect of construct validity” (p. 10).
Messick (1996) defines construct underrepresentation as a threat to validity in which “the test is too narrow and fails to include important dimensions or facets of focal constructs” and construct irrelevant variance as a threat to validity in which “the assessment is too broad, containing excess reliable variance that is irrelevant to the interpreted construct” (p. 4).
“In practice, test makers are mainly concerned about adverse consequences that are traceable to sources of test invalidity such as construct underrepresentation and construct-irrelevant difficulty. These concerns are especially salient in connection with issues of bias, fairness, and distributive justice, but also potentially with respect to negative washback. For example, if important constructs or aspects of constructs are underrepresented on the test, teachers might come to overemphasize those constructs that are well-represented and downplay those that are not” (p. 14).
Messick, S. (1996). Validity and washback in language testing. ETS Research Report Series, i–18. doi:10.1002/j.2333-8504.1996.tb01695.
|4.||Alderson, Clapham and Wall (1995)
“It is important to realize that the method used for testing a language ability may itself affect the student’s score. This is called the method effect, and its influence should be reduced as much as possible”(p. 44).
“…the best advice that can be offered to item writers is: ensure that you use more than one test method for testing any ability. A useful discipline is to devise a test item to cover some desired ability or objective, then to devise another item testing the same ability using a different method or item type. […] In general, the more different methods a test employs, the more confidence we can have that the test is not biased towards one particular method or to one particular sort of learner”(p. 45).
“After being drafted, test items/tasks should be assembled into a draft test paper/subtest. It should then be considered by an editing committee (a group of testing experts and SMEs) for the degree of match with the test specifications, likely level of difficulty, wording of instructions, etc. The committee should also consider the test as a whole, with special attention to the overall balance of the subtest or paper”(p. 63).
“Some components of the test construct may be easier to assess than others, therefore “item writers sometimes find they are unable or unwilling to test the more difficult aspects, and, because of this, the content of some test papers may be unbalanced”(p. 69).
Alderson, J. C., Wall, D., & Clapham, C. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.
|5.||Fulcher and Davidson (2007)
“The role of task design in language testing is closely linked to what we argue will constitute evidence for the degree of presence or absence of the kinds of knowledge or abilities (construct) to which we wish to make inferences” (p. 64).
“Each item or task must elicit evidence that will be useful in drawing inferences to constructs. In order to do this, the designer must be clear about the kind of responses that are to be expected from tasks, and be able to define those features of tasks that are critical in eliciting relevant evidence” (p. 67).
In order to combine test items or tasks, test developers should follow test assembly rules, which tell us “how many items of certain types are to be used, what content coverage must be achieved, which sections they are to be placed in and what the intended measurement properties are” (p. 119).
Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book. London and New York: Routledge.
|6.||Fulcher and Davidson (2009)
“Many specifications are needed to create a test, at least one for each of the different components (sections, tasks, items) from which the whole is constructed. On top of this are the test specifications which additionally include information on presentation, assembly and delivery” (p. 129).
Fulcher, G., & Davidson, F. (2009). Test architecture, test retrofit. Language Testing, 26(1), 123-144.
“The test assembly specification therefore plays a critical role in showing that the number and range of items in any form of the test adequately represent the key features of the criterion situation in the real world” (p. 128).
Fulcher, G. (2010). Practical language testing. London: Hodder Education.
|8.||Davidson and Lynch (2002)
“Writing a full test is frequently a process of assembling a number of tasks generated by several specs. Specs and their tasks exist in a “one-to-many” relationship; that is, each spec is designed to produce many tasks of the same type. A test will therefore be made up of many tasks of different types, each produced from a different spec” (p. 60).
Davidson, F. & Lynch, B. K. (2002). Testcraft. A teacher’s guide to writing and using language test specifications. New Haven: Yale University Press.
Douglas (2000) defines a language for specific purpose test as “the combined set of tasks that will give us as complete a picture as possible of the test-taker’s field specific language ability” (p. 248). He discusses the importance of a specification document in outlining the necessary information to describe the content of a test and distinguishes two stages in the test development process: “the specifications containing all the information needed to produce the test/tasks and the operationalization stage referring to the actual production of test materials” (p. 249).
Douglas, D. (2000). Assessing languages for specific purposes. Cambridge: Cambridge University Press.
“We also have to ensure that the sample of communicative language ability in our tests is as representative as possible. What and how to sample in our tests is a key issue in language testing. If we are to extrapolate from our test data and make statements about communicative language ability in real-life situations, great care needs to be taken with the tasks we employ in our tests. The more closely we can specify what needs to be tested, the more representative our sampling in tests might become” (p. 29).
Weir, C. J. (1993). Understanding and developing language tests. New York, N.Y: Prentice Hall.
“We need to ensure that the constructs we are eliciting are precisely those we intend to and that these are not contaminated by other irrelevant variables, such as method effect. If important constructs are under-represented in our tests, this may have an adverse washback effect on the teaching that precedes the test, teachers may simply not teach certain important skills if they are not in the test” (p. 18).
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Basingstoke: Palgrave Macmillan.
|12.||Empirical Research: Moder and Halleck (2009)
The authors investigated the variation in oral proficiency demonstrated by 14 ATCOs across two types of testing tasks: work-related radiotelephony-based tasks and non-specific English tasks on aviation topics (common occurrence and less expected occurrence). The results demonstrate significant differences in the performance of test-takers across task types with respect to the established minimum required proficiency, Operational Level 4. “Of greater concern from a public safety perspective is the finding that some controllers performed at Operational level 4 on one of the general description tasks in the aviation context tasks and failed to demonstrate minimum proficiency on the radiotelephony tasks. In such a case, the general aviation task would have inaccurately predicted the controller’s performance level on a critical workplace task” (p. 13).
Moder, C. L. & Halleck, G. B. (2009). Planes, politics and oral proficiency: Testing international air traffic controllers. Australian Review of Applied Linguistics, 32(3), 25.1-25.16. DOI 10.2104/aral0925.