Quality Test Construction


A good classroom test is valid and reliable.

Validity is the quality of a test which measures what it is supposed to measure. It is the degree to which evidence, common sense, or theory supports any interpretations or conclusions about a student based on his/her test performance. More simply, it is how one knows that a math test measures students' math ability, not their reading ability. Another aspect of test validity of particular importance for classroom teachers is content-related validity. Do the items on a test fairly represent the items that could be on the test? Reasonable sources for "items that should be on the test" are class objectives, key concepts covered in lectures, main ideas, and so on. Classroom teachers who want to make sure that they have a valid test from a content standpoint often construct a table of specifications which specifically lists what was taught and how many items on a test will cover those topics. The table can even be shared with students to guide them in studying for the test and as an outline of what was most important in a unit or topic. 
 
Reliability is the quality of a test which produces scores that are not affected much by chance. Students sometimes randomly miss a question they really knew the answer to or sometimes get an answer correct just by guessing; teachers can sometimes make an error or score inconsistently with subjectively scored tests. These are problems of low reliability. Classroom teachers can solve the problem of low reliability in some simple ways. First, a test with many items will usually be more reliable than a shorter test, as whatever random fluctuations in performance occur over the course of a test will tend to cancel itself out across many items. By the same token, a class grade will itself be more reliable if it reflects many different assignments or components. Second, the more objective a test is, the fewer random errors there will be in scoring, so teachers concerned about reliability are often drawn to objectively scored tests. Even when using a subjective format, such as supply items, teachers often use a detailed scoring rubric to make the scoring as objective, and, therefore, as reliable as possible. 
 
Classroom tests can also be categorized based on what they are intended to measure. Traditional paper-and-pencil classroom tests (e.g. multiple-choice, matching, true-false) are best used to measure knowledge. They are typically objectively scored (a computer with an answer key could score it). Performance-based tests, sometimes called authentic or alternative tests, are best used to assess student skill or ability. They are typically subjectively scored (a teacher must apply some degree of opinion in evaluating the quality of a response). Performance-based tests are discussed in a separate area on this website. 
 
Tests designed to measure knowledge are usually made up of a set of individual questions. Questions can be of two types: a) selection (or select) items, which allow students to select a correct answer from a list of possible correct answers (e.g. multiple-choice, matching) and b) supply items, which require students to supply the correct answer (e.g. fill-in-the-blank, short answer). Scoring selection items is usually quicker and objective. Scoring supply items tends to take more time and is usually more subjective. Sometimes teachers decide to use selection items when they are interested in measuring basic, lower levels of understanding (at the knowledge or comprehension level in a Bloom's taxonomy sense, Bloom et al.,1956) and use supply items if they are interested in higher levels of understanding, but a well-written selection item can still get at higher levels of understanding. 
 
Teacher-made tests can also be distinguished by when they are given and how the results are used. Tests given at the end of a unit or semester or after learning has occurred are called summative tests. Their purpose is to assess learning and performance and usually affects a student's class grade. Tests can also be given while learning is occurring, and these are called formative tests. Their purpose is to provide feedback, so students can adjust how they are learning or teachers can adjust how they are teaching. Usually these tests do not affect student grades. 
 
Classroom assessment is an integral part of teaching (Chase, 1999; Popham, 2002; Trice, 2000; Ward & Murray-Ward, 1999) and may take more than one-third of a teacher's professional time (Stiggins, 1991). Most classroom assessment involves tests that teachers have constructed themselves. It is estimated that 54 teacher-made tests are used in a typical classroom per year (Marso & Pigge, 1988) which results in perhaps billions of unique assessments yearly world-wide (Worthen, Borg, & White, 1993). Regardless of the exact frequency, teachers regularly use tests they have constructed themselves (Boothroyd, McMorris, & Pruzek , 1992; Marso & Pigge, 1988; Williams, 1991). Further, teachers place more weight on their own tests in determining grades and student progress than they do on assessments designed by others or on other data sources (Boothroyd, et al., 1992; Fennessey, 1982; Stiggins & Bridgeford, 1985; Williams, 1991). 
 
Most teachers believe that they need strong measurement skills (Wise, Lukin & Roos, 1991). While some report that they are confident in their ability to produce valid and reliable tests (Oescher & Kirby, 1990; Wise, et al., 1991), others report a level of discomfort with the quality of their own tests (Stiggins & Bridgeford, 1985) or believe that their training was inadequate (Wise, et al.). Indeed, most state certification systems and half of all teacher education programs have no assessment course requirement or even an explicit requirement that teachers have received training in assessment (Boothroyd, et al.; Stiggins, 1991; Trice, 2000; Wise, et al.). In addition, teachers have historically received little or no training or support after certification (Herman & Dorr-Bremme, 1984). The formal assessment training teachers do receive often focuses on large-scale test administration and standardized test score interpretation rather than on the test construction strategies or item-writing rules that teachers need (Stiggins, 1991; Stiggins & Bridgeford, 1985). 
 
A quality teacher-made test should follow valid item-writing rules. However, empirical studies establishing the validity of item-writing rules are in short supply and often inconclusive, and, "item writing-rules are based primarily on common sense and the conventional wisdom of test experts" (Millman & Greene, 1993; p. 353). Even after half a century of psychometric theory and research, Cronbach (1970) bemoaned the almost complete lack of scholarly attention paid to achievement test items. Twenty years after Cronbach's warning, Haladyna and Downing (1989) reasserted this claim, stating that the body of knowledge about multiple-choice item writing, for example, was still quite limited and, when revisiting the issue a decade later, added that "item writing is still largely a creative act" (Haladyna, Downing & Rodriguez, 2002, p. 329). 
 
The current empirical research literature for item-writing rules-of-thumb focuses on studies which look at the relationship between a given item format and either test performance or psychometric properties of the test related to the format choice. There are some guidelines supported by experimental or quasi-experimental designs, but the foundation of best practices in this area remains, essentially, only recommendations of experts. Common sense, along with an understanding of the nature of the two characteristics of all quality tests (validity and reliability), provides the framework that teachers use to make the best choices when designing student assessments. 
 
Developed by: Bruce B. Frey, Ph.D., University of Kansas