Designing language proficiency tests

I.     Introduction

Continuous focus on quality and productivity forces many institutions to assess competency of potential candidates. Effective communication strengthens proficiency, while ineffective communication weakens it. Language testing can prove to be a powerful tool in measuring proficiency, through detailed and systematic test design process which increases construct validity and therefore enhances quality (Brown, 2000; ETS, 2009; Hughes, 2003).

This paper looks at testing language proficiency by presenting a literature review of the current and relevant issues related to language proficiency testing. Furthermore, some listening, reading, speaking, and writing sections techniques are discussed in accordance with the assessment and testing theories.  Finally, strengths and further improvements of these tests are suggested.

II.  Language proficiency tests

2.1.1.      Listening

Real-life activities such as attending meetings, listening to lectures, television, radio programs or public announcements, participating in discussions or debates largely require proficiency in listening. Efficient listening skills increase one’s effectiveness by providing a better understanding of tasks’ objectives and underlying meaning (Shahomy, 1997).

Listening tests assess candidate’s ability to understand the main idea, follow the key points and arguments, get specific facts, and recognize the implied meaning, among others (Hughes, 1989). Validity is maintained through the use of relevant and authentic monologue, dialogue, and multiple-participant type discussions with the use of both native and non-native speakers in narrative, instructive, descriptive, expository or argumentative formats (ETS, 2011; Hughes, 2003). The recommended and relevant testing techniques should include: multiple choice; sentence completion; gap filling; or true and false.

The listening section should contain clear and unambiguous instructions that identify and capture the underlying skills and knowledge. The section directly assess basic comprehension of main idea and meaning, making inferences and drawing conclusions while indirect assessing essential note-taking skills (ETS, 2009; Hughes, 1989). Extra time should be given for candidates with disabilities, if required.

Content validity is addressed by using authentic academic content that includes corpora of English accents to reduce bias and non-academic use while maintaining topic comparability (Sawaki et al., 2009 cited in Anderson, 2009: 623). The lecture dialogues should be maintained at around 140 words per minute for clarity and decipherability (Hughes, 2003: 162).

The use of a multiple-choice method is very common. From an administration perspective, multiple choice items offer an efficient and objective way to assess candidates, provide a greater range of material and wider range of difficulty, and render positive backwash of test and candidate’s difficulty (Hughes, 1987: 59-62). From an educational perspective, multiple choice items test many levels of learning as well as a test candidate’s ability to integrate information, and provide constructive feedback to the candidate about the correctness of the right answers and incorrectness of the distractors, respectively.

The use of five options instead of four options and the avoidance of response options such as all of the above, both A & B or similar, make it harder for candidates to guess the correct answer with only partial knowledge. Furthermore, this concept allows for a more cost effective and valid discrimination between the stronger and weaker candidates (Hughes, 1989: 59-60).

Nevertheless, there exist disadvantages with the use of multiple choice questions. From an administrative perspective, the design of multiple choice items is very time consuming and challenging (Hughes, 1989: 61). From an educational perspective, multiple choice items do not allow test takers to demonstrate knowledge beyond the options provided and may even encourage some guessing which does not indicate candidate’s true proficiency (Anderson, 2009).

The listening section may contain a sentence completion or information transfer technique with two-points given for the correct answer. This is where a higher level of understanding and memory is required than that of the multiple choice technique. According to Shizuka (2004), this technique offers an economical and uniform measure of proficiency while maintaining a high level of reliability, validity, and discrimination. Furthermore, it entirely emulates academic context, where note-taking occurs frequently. However, it is argued that this technique produces an unreliable result of the overall listening ability (Hughes, 1987: 66).

The reliability of the listening test is maintained by using reasonably spaced unambiguous questions with clear and concise scoring criteria (Hughes, 2003). Furthermore, grammar or spelling errors are not penalized in the scoring of this receptive skill (Hughes, 1989: 137-139). Moreover, collected and centrally-marked results provide comparability of scoring, and maintain scorer reliability (Anderson, 2009; Uysal, 2010).

2.1.2.      Reading

Developing strong reading skills means that you actively interact with what you read to better identify main ideas, ask questions, and draw conclusions. Good comprehension involves effectively decoding and interpreting information that uses both the macro and micro skills. These skills may include: reading for the gist; identifying main ideas or details; understanding inferences and implied meaning; recognising writer’s opinions; identifying attitudes and purposes; following the development of an argument; skimming; scanning; and guessing meaning of unfamiliar words (Hughes, 1989: 117).

High content validity can be achieved by using authentic, non-technical and familiar material of formal or informal type, carefully selected from textbooks, newspaper, journal, magazine articles, or advertisements. Relevant content may follow narrative, instructive, descriptive or expository formats and include optional graphical content such as: flow charts; tables and graphs; diagrams; and illustrations. The recommended and relevant test techniques include multiple choice, short-answer, and fill-in the blanks to identify information and views or inferred meanings (Hughes, 2003).

Clear and concise instructions and marking system are recommended with no penalty given for any grammatical, spelling and punctuation errors (Hughes, 1989: 137-139).

The reading section should contain clear and unambiguous instructions to identify and capture the skills and knowledge. Clearly-marked line numbers and use of appropriate font offer readability (Hughes, 2003:140). Candidates with disabilities should be given extra time, if required. The items should be designed to measure detailed comprehension, identifying inferences, and reading to learn. The texts may be of descriptive format and use multiple choice and true/false techniques to assess skills. They should contain authentic, familiar and interesting content with some specific details, and assess general reading skills such as reading for gist, main ideas and implied meaning.

As already discussed in the listening section, there are several reasons for using the multiple choice technique, mainly, it provides an efficient and objective way to assess candidates while allowing for a greater range of material with wider range of difficulty, renders positive backwash of test and candidate’s difficulty, and provides constructive feedback to the candidate about the correctness and incorrectness of the answers and distractors, respectively (Hughes, 1989).

2.1.3.      Speaking

Various studies have shown that speaking and listening skills are interconnected and include productive and receptive skills (Ekbatani, 2011; Hughes, 2003).  Speaking skills involve: providing information (transactional skills); expressing emotion (interactional skills); and managing the conversation (talk management skills).

In order to make speaking scoring valid test designers should use rubrics-based marking which assesses: fluency and coherence; appropriateness; grammatical range and accuracy; lexical flexibility; contribution; accent; and pronunciation (ETS, 2000a; Hughes, 2003: 127).

Valid material should contain monologue, dialogue, or multi-member type conversations in either technical or non-technical contexts of formal or informal type. The recommended and relevant speaking test techniques are interactive interviews with the use of questions or pictures that elicit short or long responses. The appropriate length of the speaking test should be between 10-30 minutes with sufficient number of different topics printed for candidate’s selection (Anderson, 2009; ETS, 2009).  The candidates with disabilities may request extra time.

The speaking section should contain clear and unambiguous instructions that identify and capture the underlying skills and knowledge about authentic and familiar topics.  Candidates should use a full range of speaking skills (transactional, interactional, talk management) during a longer type answer. A topic of choice can be selected and an answer is prepared by candidates. This involves comparing, conveying information, explaining ideas and defending opinions clearly, coherently and accurately (Hughes, 2003).

A marking scale which closely reflects the theory of communicative competence is recommended. The marking criteria may include: general description (intelligibility, task fulfilment and coherence); delivery (fluency, clarity, intonation, stress, pronunciation); language use (vocabulary and grammar); and topic development (relevance, relationship and progression of ideas). Responses should be recorded and centrally marked for increased reliability and marker’s objectivity and reduced bias (Davis, 2003: 580; ETS, 2004a; ETS, 2011:1; Hughes, 2003: 127).

The use of an interlocutor instead of the marker allows the interlocutor to focus on the task of administering the test while candidates are put at ease and anxiety levels are reduced.

The results of the statistical methods applied to the collected results are a vital part of the constant monitoring of validity and reliability issues that provide valid backwash for the updating of handbook and retraining of markers. (Brown, 2000; Hughes: 1989; Shavelson et al, 2002).

2.1.4.      Writing

Many professionals view writing skills as crucial to effective communication. The use of accessible, authentic, challenging, and interesting introductory topics is recommended for candidates to describe, explain, compare, contrast, classify, and argue for or against a viewpoint. The assessed skills involve: developing a topic through knowledge; organizing an essay with cohesion and coherence; and providing arguments with appropriate usage of vocabulary, structures and written expressions, punctuations, spellings, and upper/lower cases (Hughes, 2003).

With strong emphasis on authenticity, which is defined as “the degree to which the test tasks represent the tasks that we expect the students to perform in real-life situations” (Ekbatani, 2011: 61), typical questions should encourage responses that are relevant and non-specialized containing between 250-300 words. Furthermore, responses should be directed towards native or non-native English speakers or university lecturers with a use of formal and consistent one English dialect.

The writing test can involve various tasks, where candidates are provided with brief but clear instructions that include the type of writing required, time allowed and the expected minimum number of words. Closed-ended type tasks ask candidates to look at a diagram of a process and write a shorter type essay to expresses opinions and choices.

Responses may be rated both holistically and analytically. Holistic marking involves assessing for overall impression on a certain scale. This scoring is rapid but less reliable. Analytical marking involves: development; organization; and appropriate and accurate use of grammar and vocabulary. This type of marking is more thorough, but less efficient and the focus on the quality as a whole is more important than on separate parts (Hughes, 2003).  Responses to both tasks are also scored on the quality of the writing for the completeness and accuracy of the content (ETS, 2004b).

Open-ended type tasks involve candidates being given one topic to write an essay that expresses opinions or choices. Only one topic is given to restrict the candidates, thus increase validity and reliability (Hughes, 2003: 94). Candidates are usually given 40 minutes to write 300 words and prepare and write their responses. Candidates with disabilities are given extra time. Note-taking is allowed throughout these type of tasks.

All writing tasks should enable candidates to express themselves in their own words through an organized and integrated format. Studies indicate that candidates prepare more for essay-type examinations than for multiple-choice tests, thus increasing the validity of the results. Essay assessment forces candidates to use both macro and micro (underlying) skills as they are obliged to focus on broader issues and interrelationships between them rather than on the specific details (McKeachie, 1986).  However, as these tasks only use a reduced number of questions, content validity may be affected. Furthermore, compromise on reliability by reducing subjectivity or inconsistencies in marking is attained by centrally marking the collected responses using a rubric-based scoring system (Enright & Quinlan, 2009; ETS, 2009b).

III.             Conclusion

This paper has analysed relevant literature on testing of listening, reading, speaking and writing sections of language tests. Furthermore, it involved the mechanics of designing international proficiency tests based on the recent assessment and testing theories of second language acquisition.

The strengths of proficiency tests are related to cost-effective, direct, objective, and reliable characteristics of multiple choice techniques. Furthermore, the use of authentic and relevant material which reflects “the degree to which the test tasks represent the tasks that we expect the students to perform in real-life situations” (Ekbatani, 2011, p. 61), increases quality. Moreover, the standardized administration such as the central marking and the number-based identity increases objectivity and reliability of these tests (Hughes, 1989: 35, 42).

The constant evaluation of the validity and reliability issues and positive backwash can further improve the overall quality of future updates, handbooks, or marker’s training (Brown, 2000: 385-387; Hughes: 1989: 22, 29; Shavelson et al, 2002: 5-6; Uysal, 2010).

Finally, as with all proficiency tests the weakness lies in the design itself.  Sampling of a candidate’s competency in a closed system at one particular time is unauthentic and not a true indication of the communicative competencies that can be assumed externally (Fink, 2005: 53, 78; Valdman, 1988: 125, cited in Brown, 2000: 396). It is further argued that while standardised, large-scale tests are practical and reliable, they possess fundamental issues in finding effective ways to connect with communicative abilities of examinees (Brown, 2000).

Copyright, Robert Mijas 2012.

IV.              References

Anderson, C. (2009). Test of English as a Foreign Language: Internet Based Test (TOEFL iBT). Language Testing, Vol. 28, No.4, 621-631.

Brown, H.D. (2000). Teaching by principles: An interactive approach to language pedagogy. White Plains, NY: Pearson Education

Carroll, J. B. (1961). Fundamental considerations in testing English language proficiency of foreign students. In H. B. Allen & R. N. Campbell (Eds.), Teaching English as a second language (2nd ed.,pp. 313–321). New York: McGraw-Hill.

Davis, A., Hamp-Lyons, & `Kemp. C. (2003). Whose norms? International proficiency tests in English. World Englishes. Vol.22, No. 4, pp 571-584.

Hughes, A. (1989). Testing for language teachers (1st ed.). Cambridge, NY: Cambridge University Press.

Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge, New York: Cambridge University Press.

Ekbatani, G. (2011). Measurement and evaluation in post-secondary ESL. New York: Routledge.

Enright, M. K. & Quinlan, T. (2009). Complementing human judgment of essays written by English language learners with e-rater® scoring. Manuscript submitted for publication, Educational Testing Service.

ETS. (2004a). iBT/Next Generation TOEFL Test Independent Speaking Rubrics (Scoring Standards). Retrieved September 10, 2011, from

ETS. (2004b). iBT/Next Generation TOEFL Intergrated Writing Rubrics (Scoring Standards). Retrieved October 20, 2011, from

ETS. (2009). TOEFL iBT at a Glance. Retrieved September 15, 2011, from

ETS. (2011). Reliability and comparability of TOEFL iBT scores. TOEFL iBT research, series 1 Vol 3. Retrieved September 05, 2011, from

Fink, A. (2005). Conducting research literature reviews: from internet to paper. California: Sage Publications.

Lado, R. (1961). Language testing. New York: McGraw-Hill.

McKeachie, W. J. (1986).  Teaching Tips. (8th ed.) Lexington, Mass.: Heath.

Sawaki, Y. and Nissan, S. (2009). Criterion-Related Validity of the TOEFL® iBT Listening Section. Princeton: New Jersey, ETS.

Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL Internet-based test. Language Testing, 26(1), 5–30.

Shahomy, E. (1997). Critical language testing and beyond. Paper delivered at the American Association of Applied Linguistics, Orlando, Fl, March.

Shavelson, R.J., Eisner, E.W. & Olkin, L. (2002). ‘In memory of Lee J. Cronbach (1916-2001).’ Educational Measurement: Issues and Practice 21, 2, 171-177.

Shizuka, T. (2004). Reliability and Validity of “Invisible-Gap Filling” Items. JLTA Journal, 6, 108-127.

Uysal, H. (2010) A critical review of the IELTS writing test. ELT Journal Volume 64/3.

This entry was posted in Language Education and tagged , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.