eJournals Arbeiten aus Anglistik und Amerikanistik 43/1

Arbeiten aus Anglistik und Amerikanistik
aaa
0171-5410
2941-0762
Narr Verlag Tübingen
Es handelt sich um einen Open-Access-Artikel, der unter den Bedingungen der Lizenz CC by 4.0 veröffentlicht wurde.http://creativecommons.org/licenses/by/4.0/61
2018
431 Kettemann

Conceptualization of Validity in Educational Testing

61
2018
Nikola  Dobric
There are few testing and assessment notions that have been so much written about as validity. Seen as the central psychometric issue, it has had a long history of theoretical and practical development and has stirred up quite a controversy within academic and non-academic ranks over time. The present paper traces this development within educational (and psychological) testing and presents the current cutting edge.
aaa4310003
Conceptualization of Validity in Educational Testing — Historical Discussion and Contemporary Consensus Nikola Dobrić There are few testing and assessment notions that have been so much written about as validity. Seen as the central psychometric issue, it has had a long history of theoretical and practical development and has stirred up quite a controversy within academic and non-academic ranks over time. The present paper traces this development within educational (and psychological) testing and presents the current cutting edge. 1. Introduction When it comes to any scientific endeavor, validity is, next to reliability, replicability, and generalizability (Sigott 2004: 43) one of most important components of its academic and logical viability. Focusing on educational (and psychological) testing, it is also arguably the core component of psychometrics (Sigott 1994: 287). However, despite its apparent centrality, the concept of validity is hotly debated in terms of what it should entail and how it can be achieved and demonstrated when it comes to tests. The fact that after some seven decades of intensive development in the field of test validation we still cannot offer a clear, commonly agreedupon, account of validity of educational test creates distrust and apprehension when it comes to the application of validation research outside of the expert community. Hence, it is important to unpack what precisely validity in educational (and psychological) testing entails and to define the scientific steps that have led us to the accounts we consider as cutting-edge today. To start off with a general definition, validity in educational testing is essentially meant to indicate that the results obtained from a certain mea- AAA - Arbeiten aus Anglistik und Amerikanistik Band 43 (2018) · Heft 1 Gunter Narr Verlag Tübingen Nikola Dobrić 4 surement procedure do in fact objectively reflect the phenomenon it is intended to measure (and are not due to any measurement-irrelevant variables or chance). Whereas reliability concerns itself with the question of how much variance is due to measurement error and other unrelated factors, validity focuses on the question of which specific abilities being measured account for the attested reliable variance of measurement (Bachman, 1990: 239). Seen this way, validity then depends on the extent to which quantified measurements of presumptive behavior or ability are clearly distinguishable (Sigott 2004: 44). In other words, reliability can be understood as the quality of the data collected, while validity is the quality of the inferences (and decisions) we make subsequent to the measurement (Chan 2014: 9). In this sense validity also depends on the degree to which aspects of the said behavior or ability the test is supposed to measure are covered by the given test (Sigott 2004: 44). Validation is the process in which we gather and evaluate evidence to support the said appropriateness, meaningfulness, and usefulness of the inferences and decisions we make based on measurement scores (Zumbo 2007; 2009). Adapting this rather abstract and general idea of scientific evaluation of measurement procedures to the area of educational (and closely related psychological) testing has required a great effort on part of the academic community, has represented a winding developmental journey, and is still, to a certain extent, fraught with disagreement and controversy. Historically, or rather prehistorically (prescientifically), we can see the roots of the contemporary outline of validity within psychological testing in the concept of criterion correlation with test scores (Scott 1917; Spolsky 1977; Spolsky 1985; Thorndike 1918; von Mayrhauser 1992), whereby the criterion was understood as the intended domain performance for which the test was just a proxy (Shepard 1993: 409). Validity was in essence seen as this single correlation coefficient, and the argumentation was to the effect that the test is valid for everything it correlates with (Guilford 1946: 429). The biggest problem for the applicability of this kind of approach was that the validity of the criterion seen in this way had to be taken for granted as there was nothing available to validate it against. While this is less of an issue for natural sciences, when it comes to social sciences and humanities it is not difficult to realize that there is a serious lack of well-defined criteria in this sense (Kane 2001: 320). In educational settings, we mostly - if at all - only have partly completed formulations - consider the issues behind defining language competence, for example, and all they entail. The way out of this conundrum was seen in employing a criterion measure which would reflect some kind of desired behavior or performance. This marked the addition of one more component to validity - namely content (Ruch 1929; Tyler 1934). Content-based validity can be traced to achievement tests and their inherent emphasis on actual knowledge and/ or skills. In essence, it relates to the representativeness of the content of the test in relation to the Conceptualization of Validity in Educational Testing 5 content of the domain of reference (Bachman 1990; Canale and Swain 1980; Lado 1961). The domain of reference (often terminologically interchangeable with universe) represents the „real world‟ performance or behavior that is ultimately the object of the measurement. The representativeness of the content is most commonly observed through the prism of expert judgments (Alderson et al. 1995: 174; Angoff 1988: 22). As a validation procedure, it comes with a strong confirmatory bias as is usually conducted by the test developers themselves (Guion 1977: 2). Hence, up to the 1950s, what we have had was a veritable toolbox of validation efforts, from which test developers and publishers could freely pick and choose - the criterion-based model was normally employed for validation of selection and placement tests, while the content-based model was most commonly employed to justify the validity of achievement tests (Kane 2001: 322). The realization soon came that both of these models could only be considered puzzle pieces, albeit important ones, making up the overall validity of the test. In terms of organized efforts to resolve this, the American Psychological Association (APA) and their contracting of Paul Meehl and Robert Challman in the early 1950s (Cronbach 1989) are particularly noteworthy. 2. Standards for Educational and Psychological Testing of 1955 and 1966 True to their times, Mehl and Challman quite naturally adopted the thenpopular but highly academically ambitious hypothetico-deductive (HD) model (Betti 1990; Bowers 1936; Hanson 1958; Popper 1935; Suppe 1977) of accounting for scientific results. The model described the functioning of a scientific theory by referring to a complex axiomatic system (Kane 2001: 321). This nomological network at its core has a set of implicitly defined terms which are connected by a set of axiomatic claims, a collection of which in effect represents the construct or the core of a theory. The implicitly defined terms are based on observable variables and as such defined using systems of corresponding rules (Messick 1987: 7). The axioms can then be used to make predictions relating to the observable variables and the relationships among them which are then accounted for empirically and explained away using the theoretical construct (Hempel 1965). The validity of any proposed interpretations of the scores is then evaluated by means of how the scores satisfy the theory (Kane 2001: 321). Cronbach and Meehl summed it up and indicated a new kind of validity - construct validity - originally intended as an alternative to the more practically-orientated criterion-related validity and content validity, and in fact intended to be used in cases where a measurement of an attribute or quality is not operationally defined, though potentially applicable to any test (1955: 282). Seen as leaning too much Nikola Dobrić 6 towards the theoretical side, it was recommended as a path to take when no other, more solid evidence (such as score-criterion correlations) was available. It was originally imagined as the application of scientific rigor to the process of interpretation of test scores by focusing both on rational argument and empirical evidence (Cronbach 1949; Cronbach and Meehl 1955). Despite being actually seen as the weakest form of the validity argument at the time, this „type‟ of validity was to take center stage in the decades to follow. Building on these developments, the APA and the National Council on Measurement in Education issued their notable Standards for Educational and Psychological Testing (henceforth, the Standards) in 1955 and in it we can find the first properly comprehensive and scientifically grounded account of test validity related to educational settings. The section dealing with validity was originally meant to be drawn from the contribution received from Edward Cuerton who, following from the established tradition, operationalized validity as the relationship between test scores and criterion scores which represented the actual task (1951: 622). Luckily, the 1955 Standards took it a bit further and included the previously commissioned work done by Challman, Cronbach, and Meehl. The Standards of 1955 put forward a system of four types of validity corresponding to, as they saw it, four different general aims of testing, namely content validity (related to the representativeness of the test content in respect to the target test-external domain), concurrent validity (the relationship between the scores and other existing related measurements), predictive validity (the ability of the scores to predict future measurements), and the then novel feature of construct validity (which related to the question of how the test as a measurement relates to the theoretical conceptualization of the target domain it is a representation of). The second edition of the Standards published in 1966 brought limited novelty to the conception of validity as it only joined the concurrent and predictive validity under the umbrella of criterion-related validity, arguing that they were both operationally and logically related. Construct validity was not acknowledged as more significant until the third edition in 1974. Prompted by Cronbach (1970; 1971), the 1974 Standards for the first time elaborated on the conceptual interrelatedness of the three established types of validity. Cronbach in essence pointed out that due to the lack of any well-defined criterion any kind of a measurement referring to the person‟s internal processes by definition required construct validation (1971: 451). In addition to that, the problem of a veritable medley of validation methods and the opportunistic approach to validating tests this caused was discussed (Guion 1977). Finally, as early as 1957 Loevinger suggested that since predictive, concurrent, and content validity were all ad hoc, construct validity indeed represented validity as a whole from an academic point of view (1957: 636). These realizations paved the way Conceptualization of Validity in Educational Testing 7 towards the unification of the concept of validity, most strongly voiced at the time by Samuel Messick (1975; 1980; 1987; 1989). 3. Standards for Educational and Psychological Testing of 1971 and 1985 By the 1980s three principles of construct validity had been specified (Cronbach, 1980; Embretson, 1983) and had come to be understood as the three general principles of any validation effort (Kane, 2001: 323): 1) the extended effort of including a theoretical background meant that the validation process ended up being more comprehensive and thorough; 2) more emphasis was put on the need to define the proposed interpretation prior to test implementation and subsequent evaluation; and 3) additional prominence was given to exploring possible competing theories and reasons behind alternative interpretations. Based on these tenets, Messick outlined a new unified view of validity as a combination of all experimental, statistical, and philosophical means normally found in evaluation of scientific theories and hypotheses applied to the framework of testing (1989: 14). In his view, validity integrated the judgment to the extent to which empirical evidence and theoretical rationale support the adequacy of subsequent actions based on test scores (Moss et al. 2006: 115). Messick further argued that since content and criterion-related validity do not have a clear link either to the construct of a given test or to its scores (including their subsequent interpretation), they should not be seen as types of validity. Instead, the argumentation is that there cannot be any types of validity; rather it is a holistic feature of a test (and its score). We only can have different aspects of the concept or different dimensions or different steps of test validation. This is based on the claim that the methodological principles of construct validation extend to all validation efforts (the need for explicit statement of proposed interpretation; the need for theoretically extended analysis within the process of validation, and the obligation to consider alternative hypotheses). At long last the concept of validity became unified, though under a somewhat terminologically problematic umbrella of “construct validity”. The issue was (and still somewhat is) that the unified approach imagined as such transcends the theory-based validation which the term originally marked upon its conception in the 1950s (Kane 2001: 324). In addition to that, Messick (1989: 13) raises the issue of consequential validity which basically refers to the social consequences of the inferences we make Nikola Dobrić 8 based on test scores and the actions we take subsequently, which should by extent likewise be taken under the umbrella of unified validity. The assertion of this unification was also that rarely, if ever, we can consider evidence related to the content or the criterion as sufficient and removed from other sources of validity evidence (Sigott 2004: 44). They all contribute to the overall validity argument. This kind of united view of validity was adopted by the Standards published in 1985 and was to become the dominant paradigm within the following decade or more. Nevertheless, Messick (1988: 42) went on to criticize the 1985 Standards, because they continued to refer to “types of evidence” for validity as well, for all intents and purposes harkening back to the separatist modeling of the previous decades. Indeed, we will see that this continuing tendency of constantly referring to subdivisions within validity, despite the relative general consensus regarding its unified nature (at least conceptually) actually results from the need to have an account which is as adaptable and applicable as the every-day validation practice demands. True to that statement, this unitary model of validity was already subjected to criticism in the late 1980s and early 1990s. The criticism was, in essence, again targeted at the academic opaqueness and the lack of applicability, especially in educational settings. For example, Shepard argued that viewing test validation as a never-ending process of gathering evidence and arguing only the highest possible degree of (construct) validity left many test makers fearful of their ability to meet such high academic standards, particularly among the non-academic (educational) practitioners (1993: 428). Not only that, but the open-endedness of the procedure itself left the validation process vulnerable to the „give what you have‟ approaches which revolve around the idea that any amount of evidence towards validation is enough since it can never be entirely provided scientifically (Shepard 1993: 429). Reminiscent of the circularity one often finds in academic development in general, an argument was now made for a more practically orientated account of validity, one which would provide the testing community with a concrete set of tools to run successful and resource-friendly validation procedures, essentially asking for segmentation of the only just unified validity. The case was further strengthened by the fact that the nomological network and the hypothetico-deductive modeling put too much strain on the validation efforts within the social sciences in general as they include few, if any, tightly connected axioms in their constructs and work at best with crude and halfexplicit designs (Kane 2001: 325). The community started to realize that there was room for both a theoretical and a practical account of validity. Hence, Cronbach (1988: 14) proposed a distinction between a strong program of validation (as the one outlined by Cronbach and Meehl (1955)) and a weak program of construct validation (as described essentially in Meehl and Golden (1982)). The weak program implied explanatory empiricism, where any correlation is welcome, while the strong pro- Conceptualization of Validity in Educational Testing 9 gram included an outline of the theoretical background and a series of deliberate validation challenges (Kane 2001: 326). The problem with this kind of division was that the weak program pulled too much into the academic opportunism the unified theory in fact aimed to avoid. Since the validation efforts are usually conducted by the test developers themselves, there is a strong and quite expected conformational bias, which is another reason why the proposed weak program of construct validity can be problematic (Kane 2001: 327). Any entirely unified view of validity comes with additional serious downsides: eliminating any type of subdivision makes the whole process much more opaque. Also, since each validation effort is in many ways unique due to the context dependency of each testing situation, it is hard to suggest that “unified” should also mean “uniform” in terms of validation processes. Constructing a system in which all variables depend on the theory is likewise problematic because it must suppose the validity of the theory as a given. Again, these are the reasons why terms such as “construct validity” or “content validity” still persevere - certain patterns of evidence do contribute to certain distinctive aspects of overall validity and are extremely useful in practice (Kane 2001: 332). The emergence of this kind of discussion, a push and pull towards and away from unified (construct) validity conceptualization indicated that even after some 30 years of formal deliberation on the nature of validity of tests, no clear criteria or solid guiding principles commonly agreed on still existed. In the 1990s and early 2000s this resulted in several new, mostly practically-orientated, accounts of validity, most notably the one by Bachman (1990), Kane (1990; 1992; 1994; 2006; 2012), and Weir (1993; 2005). All three approaches generally view validity as a unitary concept, two of them shying away from terming it construct validation though. They all also find merit in dividing the largely academically grounded unitary representation into smaller constituents in order to facilitate easier practical implementation, which will further influence the developments within the Standards of 1999 and 2014 and the contemporary consensus regarding validity and validation. 4. Bachman’s model of Validity — Evidential and Consequential Validity The earliest validity accounts within language testing as an important dimension of educational assessment came from authors such as Lado (1961) and Davies (1968), who espoused validity as focusing on face values of the test (validity by assumption), on its content, on control of extraneous factors, on conditions required to answer test items, and on empirical insight (D‟Este 2012: 63). Then, Campbell and Fiske (1959) saw validity as having a convergent dimension (measures that should be re- Nikola Dobrić 10 lated are found to be related) and a discriminate dimension (measures that should not be related are found to be not related). Campbell and Stanley (1966) referred to the concepts of internal and external validity. Bachman embraced the unified theory of validation put forward by Messick and considered validity of a given use of a test as an outcome of a complex process that must include several crucial features (1990: 237): evidence supporting a particular interpretation and use, ethical values which are the basis of the given interpretation, and the test taker‟s performance. In addition to that, Bachman also argued for the stronger inclusion of the concept of reliability within the validity considerations since when it comes to language testing, it is not easy to distinguish between effects of different test methods nor between traits and test method (1990: 239). In all other ways, his account entirely follows the focus on construct validity prevalent in the late 1980s, together with an emphasis on considering value implications of score interpretation and the consequences of their use (D‟Este 2012: 63). Specifically, Bachman thinks that two general types of evidence need to be collected in any serious validation effort as supporting two major bases of (construct) validity: evidential and consequential. Evidential basis of validity, according to Bachman (1990: 248), is grounded in evidence that supports the relationship between test scores and their interpretations and subsequent use. There are three tributaries that feed into this evidential basis, all supporting overall validity (D‟Este 2012: 67): ‣ content relevance and content coverage (what is known as content validity); ‣ criterion relatedness (i.e. criterion validity); and ‣ meaningfulness of construct (construct validity). Content validity refers to the domain specifications which underlie the test, criterion relatedness refers to a meaningful relationship between test scores and other indicative criteria, while construct validity means the extent as to which performance of the test is consistent with the predictions we make on the basis of the theory of abilities (Bachman 1990: 246- 269). In addition, the consequential (or ethical) basis of validity refers to the fact that tests have not been designed to be used in an academic vacuum but rather have real-life applications and are influenced by society as a whole. Following Cronbach (1984: 5) and the claim that tests are supposed to be an impartial way of performing a political function of determining who gets what, a lot of emphasis needs to be put on the consequential dimension of language tests (Bachman 1990: 280). Conceptualization of Validity in Educational Testing 11 4.1 Evidential Validity Looking at Bachman‟s view of the evidential basis of validity in more detail, the first aspect of content relevance is already problematic to validate because domains of language ability are difficult to define in a finite and clear-cut way (even if we focus on the ones we see as inherently simple, such as for example vocabulary). Evidence of content coverage and relevance is further problematic as it tends to focus on what a test taker can do rather than on what he cannot do, because evidence of inability implies the consideration of competing hypotheses (Bachman 1990: 246). Criterion validity is described by Bachman in terms of the traditional division into concurrent and predictive relatedness. Validation of the concurrent criterion relatedness involves one of the two commonly employed procedures: examining differences in test performance among individuals at different levels of proficiency and/ or examining correlations among different measurements of the given ability (Bachman 1990: 248). Predictive criterion relatedness focuses on demonstrating a link between test scores and some future performance where the test scores predict the criterion behavior of interest - for example having a writing test as a predictor of place allocations in writing courses of different levels. Finally, construct validation is related to the basic question of what is the nature of that something that an individual possesses or displays which is the object of our measurement (Messick 1975: 957. Constructs can then be seen as definitions of people‟s attributes assumed to be reflected in their performance (Cronbach and Meehl 1955: 283). By postulating constructs as a way of classifying behavior, we can argue that construct validity then must incorporate all of the evidence supporting the validity of a test, including its content and its degree of integrity (Bachman 1990: 256). Understood in this general sense, construct validity resembles a standard procedure of verifying or falsifying a scientific theory and as such is seen as divisible into two parts: logical analysis and empirical investigation (D‟Este 2012: 68). The logical part of the procedure involves the process of defining the construct theoretically and operationally, which provides us with the means of linking scores to the actual ability of interest and with the means of postulating the hypotheses we subsequently want to test. The empirical analysis focuses on finding empirical evidence (in terms of correlations and experimentation) in order to confirm or disprove a particular interpretation of the obtained test scores (Bachman 1990: 258-266). 4.2 Consequential Validity Bachman talks about the ethical basis of validity as incorporating considerations which are neither scientific nor technical and which focus on the influence of a particular (educational) system on the interpretation of a Nikola Dobrić 12 test as well as on the washback effect that test use has on that particular system in reverse (1990: 279). The issues behind this commonly termed consequential validity aspect are complex and involve looking into the rights of the test takers (secrecy, confidentiality, privacy), the values inherent in test developers and raters, the values inherent in the particular social system, background knowledge, cognitive characteristics of the test taker, and influence on teaching and learning (Bachman, 1990: 280-284), 5. Kane’s model of Validity — the Interpretative Argument Kane‟s (1992) conception of validation stems basically from the proposed view of the process as a series of arguments (Cronbach 1980; House 1980) which House termed “the logic of evaluation argument” (1977: 47) and Cronbach referred to as an “evaluative argument” (1988: 4). Kane called it an “interpretative argument” (1992: 529) and saw it as stemming first from a clear and sound statement regarding the claims included in the interpretation of a test. In this way we would end up with a network of inferences which would in the end present a credible account of the validity of a particular test (Chapelle 2012: 19). The whole interpretative argument revolves around a series of inferences starting with the universe of generalization and ending with a decision taken based on generalized scores. The model as such involves five general interpretative/ argumentation steps Kane (1992): 1. eliciting a student performance (in essence task design); 2. scoring the performance; 3. attesting the typicality of the score (basically looking at the reliability and generalizability of the score); 4. interpreting the score; and 5. using the score to make decisions. By following these five steps and the four suggested inferences linking them, one is supposed to make a rationale for regarding observed performances and the conclusions and decisions stemming from them as being plausible to the relevant stakeholders (Kane 2012: 13). 5.1 Universe of generalization and the sample of observations — step 1 Found at every stage of the historical discussion of validity, the universe of generalization (UG) is parallel to the target (criterion) domain, though within this framework it is meant to represent a small subset of the overall domain. The given subset seen as the UG is to be more precisely defined than the overall domain and is intended as the basis for sampling Conceptualization of Validity in Educational Testing 13 for subsequent test construction. It is meant to be a more practicallyorientated response to the problems of properly defining the domain in order to make it more transparent for succeeding content-related inferences (Chapelle 2012: 22). The procedure also encroaches on the standard process of modelling the construct as comprising the assumed features of the nontest performance (Messick 1989: 55). Ultimately, it is equivalent to any practical test design we can commonly find (Bachman 1990; Bachman 2002). 5.2 Scoring and observed score — step 2 The next set of inferences the model focuses on is scoring following from the test created from the obtained sample of observations representative of the UG. In this step we are supposed to ascertain the degree of the observed score representativeness regarding the observed (and sampled) behavior or performance. As such, it again follows the pattern of all previously presented models, falling in essence within both criterion-related and, ultimately, construct-related validity or within scoring validity seen later in Weir‟s (2005) modeling. 5.3 Generalization and expected universe score — step 3 Step three of the whole process basically involves making the second relevant inference and that is comparing the obtained score to what Kane (2006: 38) calls the universe score. Essentially what this means is that the expectation is on providing evidence that the scores obtained are consistent across tasks, raters, and particular testing instances. This inferencing procedure is mainly used in reliability studies (whether through generalizability theory, item response modeling, or other methods). 5.4 Expected score, theory-based interpretation, and the construct — step 4 Step four, again echoing construct validation, is the third inference that needs to be made. It basically focuses on linking the now-established reliable scores to the nontest behavior via a theoretical model. The procedure involves gathering all available evidence that can relate the said reliable scores to the UG. The only discernable difference from the commonly viewed construct validation is the fact that Kane (1992) does not view constructs as an a priori existing formulation, but as an interface between prior work, conceptual possibilities, and pragmatic needs (Chapelle 2012: 24). Nikola Dobrić 14 5.5 Implications and the decision The final inference and step five of the whole argumentation procedure has to do with what is known in various models (Messic 1989; Weir 2005) as consequential validity or the assessment use argument (Bachman and Palmer 2010). As expected, the emphasis is on the consequences the test scores have in terms of final decisions based on them and their social implications (Bachman 2005; Chapelle 2012; Shohamy 2001). The major advantage of conceiving a validation effort as a set of arguments in this was the guidance it provides to practitioners. It offers a clear inventory of the evidence that needs to be gathered in order to satisfy each of the steps in the argumentation structure (Kane 2001: 331). The overall approach of segmenting the validation procedure and structuring it in terms of argued interpretations is currently very influential (and also generally adopted as the methodology by the Standards published in 1999 and in 2014). 6. Weir’s model of Validity — Evidence-based Test Validation Given the extensive discussion on the nature and underpinnings of validity, it is hardly surprising that another model - elaborated fully by Weir in 2005 and also aimed predominantly at educational (language) testing - is by definition one part of a reformulation. The model itself presents overall test validity as an interplay between 5 different types: jointly influenced by test-taker characteristics as a separate piece of the validity puzzle we have cognitive validity, context validity, scoring validity, consequential validity, and criterion validity. The two former types are mostly imagined as being attested a priori to the test implementation (though an a posteriori validation of cognitive validity is possible as well) and the latter three types a posteriori, with context validity straddling both and being related to pre and post evaluation. 6.1 Test-Taker Characteristics As a novel concept which cannot formally be found within any of the outlined (previous or subsequent) models of validity, it is significant because while other models consider this aspect of the testing process as given, Weir strongly emphasizes the need to actively consider the testtaker. He distinguishes between three classes of test-taker characteristics (Shaw and Weir 2007: 5): ‣ physical/ physiological characteristics, which include any special needs on the side of the test-taker, such as ones stemming from dyslexia or eyesight impairment; Conceptualization of Validity in Educational Testing 15 ‣ psychological characteristics, including test-taker motivation, personality type, learning styles, and more; and ‣ experiential characteristics, incorporating for example the degree of test-taker familiarity with the test format or content. These individual characteristics can then be viewed as systematic, if they affect a test-taker‟s performance consistently (such as dyslexia or personality traits) or unsystematic, when they have a random, perhaps one-off effect (for example motivation or test format familiarity). 6.2 Cognitive Validity This proposed type of validity focuses on the determination of the cognitive processes to be used as models for designing test items and represents another partially novel concept put forward by Weir (2005) and introduced by Bachman (1990). Validation in these terms is meant to be mostly a priori, where the focus is on piloting and trialing work in order to ascertain the relevant cognitive representations which need to be exemplified in the test content. Post-test validation is also imagined and should function as a correlational analysis that can be used to attest the link between scores and „real life‟ performance. For instance, in terms of writing assessment, Shaw and Weir (2007: 34) recognize six different aspects of cognition behind writing ability: 1. macro-planning: gathering ideas and identifying major constraints such as genre, readership, and goals; 2. organization: ordering the ideas and identifying relationships between them; 3. micro-planning: focusing on individual parts of the text and considering issues such as the goal of the paragraph in question, including both its alignment with the rest of the text and the ideas and the sentence and content structure within the paragraph itself; 4. translation: the content previously held in a propositional form is transferred into text; 5. monitoring: focusing on the surface linguistic representation of the text, the content and the argumentation presented in it, and its alignment with the planned intentions and ideas; and 6. revising: results from the monitoring activity and involves fixing the issues found to be unsatisfactory. 6.3 Context Validity Context validity is to a large extent parallel to the better known concept of content validity, where the idea is that, in terms of language testing for Nikola Dobrić 16 example, we move away from the sole focus on the linguistic representation and also include the social and cultural dimension within which the writing has been produced (Weir 2005), adding a layer of novelty as well. Seen this way, context validity is then split into: 1. setting: it consists of the task (response format, purpose, knowledge of criteria, weighing, text length, time constrains, and writer-reader relationship) and the administration (physical conditions, uniformity of administration, and security); and 2. linguistic demands: they include lexical resources, structural resources, discourse mode, functional mode, and content knowledge. It is clear that the suggested model significantly builds on the previously espoused and conceptually related content validity, both in theoretical and in practical terms. 6.4 Scoring Validity Focusing on the interplay between validity and reliability and the argument of how the two are in fact complementary aspects of identifying, estimating and interpreting different sources of variance in test scores (Bachman 1990: 239), Wier (2005) goes on to propose a formal aspect of validity which would address this issue even more directly. Generally, scoring validity is concerned with all aspects of the testing process that can have an impact on the scores, or more precisely an impact on score attribution. This includes (Shaw and Weir 2007: 146): ‣ criteria and type of the rating scale; ‣ rater characteristics and variability (incorporating both ratercandidate and rater-item interactions); ‣ rating process; ‣ rating conditions; ‣ rater training; ‣ post-exam adjustments; and ‣ grading and awarding. In conceptual terms, the importance of scoring validity has often been emphasized as almost central as it both represents the culmination of other (mostly a priori) validation and the only firm link to all of the inferences and ensuing decisions (Messick 1989). In this way it is conceptually linked both to criterion-related validity and, ultimately, to construct validity as conventionally depicted. More specifically, the presented organization of this type of validity stems, among others, from Milanovic and Saville (1994), Pollitt and Murray (1996), O‟Sullivan (2000), Shaw Conceptualization of Validity in Educational Testing 17 (2002), Weir and Milanovic (2003), Falvey and Shaw (2005), and more, as listed in Shaw and Weir (2007: 144). 6.5 Consequential Validity First raised by Messick (1987; 1989), it conceptually deals with the social impact a test and its inferences can have. More precisely, Weir (2005) outlines consequential validity as engaged with: ‣ washback (or backwash) on individuals in, for example, a classroom or workplace; ‣ impact on institutions and society; and ‣ avoidance of test bias, which essentially includes making sure that the test use and interpretations serve the intended testing purpose (Cole and Moss 1989). Apart from being conceptually linked to prior validity developments, the division stems from a wide set of publications including Burrows (1998), Hamp-Lyons (2000), McNamara (2000), Hughes (2003), Shohamy (2001), Hawkey (2006), and others. 6.6 Criterion-related Validity Here, Weir (2005) again reimagines - for practical purposes and while focusing in particular on assessing language - one traditional concept, namely that of criterion-related validity. The focus is the same: comparisons to external measurements such as different test scores, ratings coming from teachers or other expert informants, or even self-assessment (all measurements usually being expressed in correlation coefficients). However, recognizing the issues behind test comparability, Weir keeps the standard division into predictive and concurrent aspects of the criterionrelated validity, more concretely instantiated as: ‣ cross-test comparability: looking at ways of correlating different existing tests designed to measure the same; ‣ comparison with different versions of the test: having different versions of one test applied to the same candidates under the same conditions; and ‣ comparison with external standards: comparison with different established frameworks, such as the Common European Framework of Reference for Languages (2003). Additionally, even though Weir, just like Kane, reluctantly uses the term “construct validity”, he does recognize that it can be considered to conceptually relate cognitive, context, and scoring validity. Furthermore, in Weir‟s model (similary to Kane‟s, 1999) we can see a conception of validi- Nikola Dobrić 18 ty in terms of a subdivision and in terms of consecutive steps of validation. Instead of calling them arguments, Weir, however, opts for more traditional labels. We can see context validity as parallel to the first step of sampling the observations in Kane (1992); scoring validity as equivalent to scoring and generalization steps; criterion-related validity parallel to extrapolation; and consequential validity as similar to Kane‟s (1992) dual step of implications resulting in a decision. A novel focus introcued, in principle, by Weir, is the insistence on looking into the test-taker characteristics and into the cognitive processes involved (cognitive validity), which was strongly implied by Bachman in 1990. This is also considered in the contemporary view of validity to be further discussed below. 7. Standards for Educational and Psychological Testing of 1999 and 2014 — the current consensus The first thing that would have to be discussed is the terminological question of whether we should refer to unified validity as “construct validity” or just simply as “validity”. We have seen that authors such as Messick (1987; 1989), Cronbach (1971; 1980; 1988), and others (i.e. Loevinger 1957; Fitzpatrick 1983) have presented an extremely solid case for why validity should be conceptually considered as a single notion, a single plausibility value that the use of a test may have. However, while that may be undisputable from a conceptual point of view, we could also see a strong argument for why “construct validation” is not a suitable term to represent unified validity. The term was coined in the 1950s and was meant to strictly refer to the then-adopted hypothetico-deductive modeling and the nomological network approach to theory construction. Not only has it been shown to be essentially inapplicable to social sciences, but “construct validity” has moved away from this constrained reference to theory to include referencing the test takers, cognitive processes, content and context, criterion, and consequences. As such, unified validity has outgrown the term “construct validity” in its original meaning, which now misrepresents what validity actually entails. In addition to that, referring to a theoretical background as the sole underpinning of a test‟s validity makes the whole concept very difficult to explain to nonacademic practitioners, with terms such as “content validity” being much more convenient (Sireci 2007: 478). Validity can easily be unified without referencing the term “construct” itself and many of the more contemporary views of validation do just that. The second important point that needs to be addressed is whether there should be any subdivision within the unified conceptualization of validity, either in terms of „types‟ of validity, „aspects‟ of validity, or „types of evidence‟ for validity. If we review the previous discussion, it is not difficult to reach a conclusion that even though in purely theoretical Conceptualization of Validity in Educational Testing 19 terms validity is indeed best seen as unified, in practical terms it is much better to consider it segmented. Having subcategories or steps allows us to have a more manageable tool (or more manageable tools) to perform validation, allows us to more easily define the types of evidence required to complete the whole picture, and makes the whole concept more approachable in general. Emerging from the seven decades of scientific discussion of validity, we can see several fundaments (Chan 2014: 10; Kane 2001: 328; Sireci 2007: 477): ‣ validity is not a property of a test, but it refers to the use of a test for a particular purpose - it is the interpretation, not the test, that is being evaluated; ‣ to evaluate the appropriateness of the intended test use requires us to pursue multiple sources of evidence; ‣ validity can never be fully achieved, so we need to consider sufficient evidence to defend every intended purpose of a test; ‣ the validation procedure must be implemented both during the test design (a priori) and after the test (a posteriori); ‣ the validation effort should include an extended analysis including the relevant theory and a consideration of possible competing interpretations; ‣ is extremely important to include the consequential dimension of each test in the evaluation; ‣ evaluation of validity is not a one-time static activity, but rather an ongoing process; and ‣ validity is a unified evaluation of the test‟s interpretation as a whole, but is in practice more manageable when parsed out into pattern-defined parts. Hence, the contemporary view of validity and validation (as propounded by the Standards published in 1999 and in 2014, and supported by most of the psychometric community today) avoids much terminological reference and focuses on the aspects of validity that should be covered, sources of evidence that we can tap, and steps to be taken within any serious validation effort. The Standards (1999: 9) define validity first as “the degree to which evidence and theory support the interpretation of test scores entailed by the proposed uses of a test.” Further, the Standards adopt Kane‟s (1992) focus on the validation argument as a methodological framework to be completed in steps, incorporate Bachman‟s (1990) differentiation of the evidential and consequential validity, and incorporate several Weir‟s (2005) developments as well. We can, in fact, see five relevant sources of validity listed: ‣ test content: is imagined as comprised of items, format and wording, response options, and administration and scoring. The evi- Nikola Dobrić 20 dence is meant to be obtained by examining the relationship between the content of an instrument and the construct being measured (Chan 2014: 12); ‣ response process: gathering evidence involves looking into the cognitive processes involved when people do tests and should derive from procedures such as think-aloud protocols and qualitative interviews; ‣ internal structure: evidence is based on statistical relationship between the test items and their organization and the construct in terms of representativeness (Chan 2014: 13); ‣ relation to other variables: dictates comparing instrument scores and external variables; and ‣ consequences of use: evidence derives from looking into issues such as washback effects. If we look at the summary of the cutting-edge in terms of understanding the history of validity research within the setting of educational tests, as suggested by the presented discussion and given in the latest editions of the Standards (1999 and 2014), we can see that while the field has indeed managed to make certain big steps in terms of development, it has not, in many ways, considerably moved away from certain seminal advances made between the 1950s and the 1980s. What we can observe as the major development is that the field has generally moved away from terminological issues which often caused problems in terms of controversy, disagreement, and opaqueness. While it is a fact that most researchers refer to concepts such as “content validity” or “construct validity” by default because of tradition and because of their usefulness in describing certain aspects of validity as a whole, most of them also generally agree with the post-methodological divisions such as Kane‟s (1992), which basically outline the steps to be taken in validation efforts and the kind of evidence you would need to collect in order to provide an plausible argument as to the validity of a particular use of a test. However, in this way the validation research has basically only managed to throw away the shackles of the strict theoretical and terminological tenants of the 1950s. It has not brought about too much novelty in terms of the actual substance of a serious validation effort. We can easily see that the traditional concepts „content validity‟, „criterion validity‟, and overall „construct validity‟ are still relevant. „Consequential validity‟, brought about by Messick (1989), is also still there. Weir‟s (2005) „cognitive validity‟ is basically the only novel conceptual aspect included from beyond the 1980s (and, as mentioned, this aspect was also influenced by Bachman‟s (1990) considerations). Finally, regardless of terminological or theoretical preferences, validity in a contemporary and practically-orientated sense must be ultimately Conceptualization of Validity in Educational Testing 21 seen as a holistic value of a particular use of a test and is to be evaluated via a series of steps, involving stating certain presumptions and gathering evidence to defend them (Kane 2001: 330): 1. first we need to state the planned use and interpretation of the test as clearly as possible; 2. then we develop the entire validation argument - this includes defining the target domain, the representative content, the scoring procedure, how the scores relate to the outside criterion, how the scores relate to any underlying theory, how the scores will be interpreted and with which consequences, together with the criteria and evidence supporting the plausibility of these assumptions; 3. next, we evaluate both logically and empirically all of the previously proposed assumptions, focusing on the most problematic ones, and adjust any interpretations or the procedure accordingly; and 4. at the end we restate the whole interpretative and validity argument and repeat the previous step 3 until the entire argument is plausible in its validity or rejected. Presented in this way, validation can be practically undertaken by all test practitioners, regardless of their academic astuteness or theoretical preferences. Furthermore, as each validation effort is unique and entirely context-specific, it allows test developers to break the procedure down into steps and define the type and amount of necessary evidence in order to satisfy the different aspects comprising validity as a whole. References Angoff, William (1988). “Validity: An Evolving Concept.” In Wainer Howard & Henry Braun (eds.). Test Validity. Hillsdale, NJ: Lawrence Erlbaum. 19-32. Alderson, Charles, Caroline Clapham & Dianne Wall (1995). Language Test Construction and Evaluation. Cambridge: CUP. American Psychological Association (1955). Standards for Educational and Psychological Testing and Manuals. Washington, DC: Author. American Psychological Association (1966). Standards for Educational and Psychological Testing and Manuals. Washington, DC: Author. American Psychological Association (1974). Standards for Educational and Psychological Testing. Washington, DC: Author. American Education Research Association, American Psychological Association, and National Council on Measurement in Education (1985). Standards for Educational and Psychological Testing. Washington, DC: Authors. American Education Research Association, American Psychological Association, and National Council on Measurement in Education (1999). Standards for Educational and Psychological Testing. Washington, DC: Authors. American Education Research Association, American Psychological Association, and National Council on Measurement in Education (2014). Standards for Educational and Psychological Testing. Washington, DC: Authors. Nikola Dobrić 22 Bachman, Lyle (1990). Fundamental Considerations in Language Testing. Oxford: OUP. Bachman, Lyle (2002). “Some Reflections on Task-Based Language Performance Assessment.” Language Testing 19: 453-476. Bachman, Lyle (2005). “Building and supporting a case for test use.” Language Assessment Quarterly 19 (4): 453-476. Bachman, Lyle, and Adrian Palmer (2010). Language Assessment in Practice: Developing Language Assessments and Justifying Their Use in the Real World. Oxford: OUP. Burrows, Catherine (1998). Searching for Washback: An Investigation into the Impact on Teachers of the Implementation into the Adult Migrant English Program of the Certificate in Spoken and Written English. Unpublished doctoral dissertation, Macquarie University, Sydney, Australia. Betti, Emilio (1990). “Hermeneutics as the General Methodology of the Geisteswissenschaften.” In Gayle Ormiston & Alan Schrift (eds.). The Hermeneutic Tradition: From Ast to Ricoeur. Albany: SUNY Press. 159-97 Bowers, Raymond (1936). “Discussion of a “Critical Study of the Criterion of Internal Consistency in Personality Scale Construction: ” An Analysis of the Problem of Validity.” American Sociological Review 1: 69-74. Canale, Michael & Merrill Swain (1980). “Theoretical Bases of Communicative Approaches to Second Language Teaching and Testing.” Applied Linguistics 1: 1-47. Chan, Eric (2014). “Standards and Guidelines for Validation Practices: Development and Evaluation of Measurement Instruments.” In Bruno Zumbo & Eric Chan (eds.). Validity and Validation in Social, Behavioral, and Health Sciences. New York: Springer. 9-23. Chapelle, Carol (2012). “Validity Argument for Language Assessment: The Framework is Simple…” Language Testing 29 (1): 19-27. Cole, Nancy & Pamela Moss (1989). “Bias in Test Use.” In Educational Measurement, edited by Robert Linn, 201-219. Washington, DC: American Council on Education and National Council on Measurement in Education. Cronbach, Lee (1949). Essentials of Psychological Testing - First Edition. New York: Harper & Brothers. Cronbach, Lee (1970). Essentials of Psychological Testing - Third Edition. New York: Harper & Row. Cronbach, Lee (1971). “Test Validation.” In Robert Thorndike (ed.). Educational Measurement. Washington, DC: American Council on Education. 443-507. Cronbach, Lee (1980). “Validity on parole: How can We go Straight? New Directions for Testing and Measurement: Measuring Achievement over a Decade.” In William Schrader (ed.) Proceedings of the 1979 ETS Invitational Conference. San Francisco: Jossey-Bass. 99-10. Cronbach, Lee (1984). Essentials of Psychological Testing. New York: Joanna Cotler Books. Cronbach, Lee (1988). “Five Perspectives on Validity Argument.” In Wainer Howard & Henry Braun (eds.). Test Validity. Hillsdale, NJ: Lawrence Erlbaum. 3- 17. Cronbach, Lee (1989). “Construct validation after thirty years.” In Intelligence: Measurement, Theory and Public Policy, edited by Robert Linn, 147-171. Urbana, IL: University of Illinois Press. Cronbach, Lee & Paul Meehl (1955). “Construct Validity in Psychological Tests.” Psychological Bulletin 52: 281-302. Conceptualization of Validity in Educational Testing 23 Cureton, Edward (1951). “Validity.” In Edward Lindquist (ed.). Educational Measurement. Washington, DC: American Council on Education. 621-694. Davies, Alan. (ed.) (1968). Language testing symposium. London: Oxford University Press. D‟Este, Claudia (2012). “New views of validity in language testing.” Educazione Linguistica Language Education 1 (1): 61-76. Embretson, Susan. (1983). “Construct Validity: Construct Representation Versus Nomothetic Span.” Psychological Bulletin 93 (1): 179-197. Fiske, John (1989). “America's test mania.” New York Times: EDUC16-EDUC20. Fitzpatrick, Anne (1983). “The Meaning of Content Validity Article.” Applied Psychological Measurement 7 (1): 3-13. Guilford, Joy (1946). “New Standards for Test Evaluation.” Educational and Psychological Measurement 6: 427-439. Guion, Robert (1977). “Content Validity: The Source of my Discontent.” Applied Psychological Measurement 1: 1-10. Hanson, Russel (1958). Patterns of Discovery. Cambridge: CUP. Hamp-Lyons, Liz & William Condon (2000). Assessing the Portfolio: Principles for Practice, Theory, and Research. Cresskill, N.J.: Hampton Press. Hawkey, Roger (2006). “Teacher and Learner Perceptions of Language Learning Activity.” ELT Journal 60 (3): 242-252. Hempel, Carl Gustav (1965). Aspects of Scientific Explanation: And Other Essays in the Philosophy of Science. New York: Free Press House, Ernest (1977). The Logic of Evaluative Argument. Los Angeles: University of California. House, Ernest (1980). Evaluating with Validity. Beverly Hills, CA: Sage. Hughes, Arthur (2003). Testing for Language Teachers. Cambridge: CUP. Kane, Michael (1990). An argument-based Approach to Validation. Iowa City, Iowa: The American College Testing Program. Kane, Michael (1992). “An argument-based Approach to Validity.” Psychological Bulletin 112: 527-535. Kane, Michael (1994). “Validating Interpretive Arguments for Licensure and Certification Examinations.” Evaluation and the Health Professions 17: 133-159. Kane, Michael (2001). “Current Concerns in Validity Theory.” Journal of Educational Measurement 38(4): 319-342. Kane, Michael (2006). “Validation.” In Robert Brennan (ed.). Educational Measurement. Westport, CT: American Council on Education/ Praeger Publishers. 17-64. Kane, Michael (2012). “All Validity Is Construct Validity. Or Is It? ” Measurement: Interdisciplinary Research and Perspectives 10 (1-2): 66-70. Lado, Robert (1961). Language Testing : The Construction and Use of Foreign Language Tests: A Teacher’s Book. New York: McGraw-Hil. Loevinger, Jane (1957). Objective Tests as Instruments of Psychological Theory. Missoul: Southern Universities Press. McNamara, Tim (2000). Language Testing. Oxford: OUP. Meehl, Paul & Robert Golden (1982). “Taxometric Methods.” In Philip Kendall and James Butcher (eds.). Handbook of Research Methods in Clinical Psychology. New York: Wiley. 127-181. Messick, Samuel (1975). “The Standard Problem. Meaning and Values in Measurement and Evaluation.” American Psychologist 30 (10): 955-966. Messick, Samuel (1980). “Test Validity and the Ethics of Assessment.” American Psychologist 35 (11): 1012-1027. Nikola Dobrić 24 Messick, Samuel (1987). Validity - Research Report. Princeton, New Jersey: Educational Testing Service. Messick, Samuel (1988). “Meaning and Values in Test Validation: The Science and Ethics of Assessment.” ETS Research Report Series 1988: i-28. Messick, Samuel (1989). “Validity.” In Robert Linn (ed.). Educational Measurement. Washington, DC: American Council on Education and National Council on Measurement in Education. 13-103. Milanovic, Michael & Nick Saville (1994). An Investigation of Marking Strategies Using Verbal Protocols. Cambridge: University of Cambridge Local Examination Syndicate. Moss, Pamela, Brian Girard & Laura Haniford (2006). “Validity in Educational Assessment.” Review of Research in Education 30: 109-162. O'Sullivan, Barry (2000). Towards a Model of Performance in Oral Language Testing. Unpublished PhD dissertation, University of Reading. Pollitt, Alastair & Neil Murray (1996). “What Raters Really Pay Attention To.” In Michael Milanovic & Nick Saville (eds.). Performance Testing, Cognition and Assessment. Cambridge: University of Cambridge Local Examinations Syndicate and Cambridge University Press. 74-91. Popper, Karl (1935). Logik der Forschung. Zur Erkenntnistheorie der modernen Naturwissenschaft. Berlin: Verlag von Julius. Ruch, Giles (1929). The Objective or New-type Examination: An Introduction to Educational Measurement. Chicago: Scott, Foresman and Co. Scott, Walter (1917). “A Fourth Method of Checking Results in Vocational Selection.” Journal of Applied Psychology 1: 61-66. Shaw, Stuart (2002). “The effect of training and standardisation on rater judgement and inter-rater reliability.” Research Notes 8: 13-17. Shaw, Stuart & Cyril Weir (2007). Examining Writing: Research and Practice in Assessing Second Language Writing. Cambridge: CUP. Shepard, Lorrie (1993). “Evaluating Test Validity.” Review of Research in Education 19: 405-450. Shohamy, Elana (2001). “Democratic Assessment as an Alternative.” Language Testing 18 (4): 373-391. Sigott, Günter (1994). “Language Test Validity: An Overview and Appraisal.” Arbeiten aus Anglistik und Amerikanistik 19 (2): 287-294. Sigott, Günther (2004). Towards identifying the C-Test construct. Frankfurt am Main: Peter Lang. Sireci, Stephen (2007). “On Validity Theory and Test Validation.” Educational Researcher 36 (8): 477-481. Spolsky, Bernard (1977). “Language Testing: Art or Science.” In Gerhardt Nickel (ed.). Proceedings of the Fourth International Congress of Applied Linguistics. Stuttgart: Hochschulverlag. 7-28. Spolsky, Bernard (1985). Formulating a Theory of Second Language Learning. Cambridge: CUP. Suppe, Frederick (1977). “Introduction.” In Frederick Suppe (ed.). The Structure of Scientific Theories. Urbana, IL: University of Illinois Press. 3-241. Tyler, Ralph (1934). Constructing Achievement Tests. Columbus: Bureau of Educational Research, Ohio State University. Thorndike, Edward (1918). “The Nature, Purposes and General Methods of Measurements of Educational Products.” In Guy Whipple (ed.). The measurement of educational products. National Society for the Study of Education Yearbook. Chicago: National Society for the Study of Education. 16-24. Conceptualization of Validity in Educational Testing 25 von Mayrhauser, Richard (1992). “The Mental Testing Community and Validity: A Prehistory.” American Psychologist 47 (2): 244-53. Weir, Cyril (1993). Understanding and Developing Language Tests. New York: Prentice-Hall. Weir, Cyril & Michael Milanovic (2003). Continuity and innovation: Revising the Cambridge Proficiency in English Examination 1913-2002. Cambridge: CUP. Weir, Cyril (2005). Language Testing and Validation: An Evidence-Based Approach. Houndgrave, Hampshire, UK: Palgrave. Zumbo, Bruno (2007). “Validity: Foundational Issues and Statistical Methodology.” In Calyampudi Radhakrishna Rao & Sandip Sinharay (eds.). Handbook of Statistics, Vol. 26: Psychometrics. Amsterdam: Elsevier Science. 45-79. Zumbo, Bruno (2009). “Validity as Contextualized and Pragmatic Explanation, and its Implications for Validation Practice.” In Robert Lissitz (ed.). The Concept of Validity: Revisions, New Directions and Applications. Charlotte: IAP - Information Age Publishing. 65-82. Nikola Dobrić English Department Alpen-Adria-University Klagenfurt