eJournals Arbeiten aus Anglistik und Amerikanistik / Agenda: Advancing Anglophone Studies50/2

Arbeiten aus Anglistik und Amerikanistik / Agenda: Advancing Anglophone Studies
aaa
0171-5410
2941-0762
Narr Verlag Tübingen
10.24053/AAA-2025-0010
aaa502/aaa502.pdf0216
2026
502 Kettemann

Oral communication exams in the EFL classroom

0216
2026
Philipp Siepmann
Oral communication exams (OCEs) have become an integral part of classroom-based summative assessment in the English as a Foreign Language (EFL) classroom in many federal states of Germany. However, the official guidelines issued by the Ministries of Education invariably lack a theoretical and empirical foundation and are limited to formal and practical aspects of administering oral examinations. Arguing strongly in favour of an evidence-based assessment practice, this article develops a conceptual framework of the OCE which extends several existing models of oral language performance assessment while considering the (limiting) conditions of local, classroom-based assessment. It includes elements relevant to the development of an OCE such as test construct, assessment tasks, criteria, and procedures. For each of these elements, an in-depth literature review is conducted and the implications from empirical research are discussed. The key points are summarised in four preliminary design principles (DP) to provide guidance to teachers and school policymakers in developing an OCE. The design principles relate to constructive alignment (DP1), assessment tasks (DP2), assessment criteria (DP3), and optimising examination conditions and procedures (DP4). The conceptual framework as well as the design principles formed the theoretical basis of a design-based research study on OCEs, which has been conducted in close collaboration with teachers at a partner school in North Rhine-Westphalia (see Siepmann 2024a, 2025, forthc.).
aaa5020137
Oral communication exams in the EFL classroom A conceptual framework and design principles for evidence-based assessment policy and practice Philipp Siepmann Oral communication exams (OCEs) have become an integral part of classroom-based summative assessment in the English as a Foreign Language (EFL) classroom in many federal states of Germany. However, the official guidelines issued by the Ministries of Education invariably lack a theoretical and empirical foundation and are limited to formal and practical aspects of administering oral examinations. Arguing strongly in favour of an evidence-based assessment practice, this article develops a conceptual framework of the OCE which extends several existing models of oral language performance assessment while considering the (limiting) conditions of local, classroom-based assessment. It includes elements relevant to the development of an OCE such as test construct, assessment tasks, criteria, and procedures. For each of these elements, an in-depth literature review is conducted and the implications from empirical research are discussed. The key points are summarised in four preliminary design principles (DP) to provide guidance to teachers and school policymakers in developing an OCE. The design principles relate to constructive alignment (DP1), assessment tasks (DP2), assessment criteria (DP3), and optimising examination conditions and procedures (DP4). The conceptual framework as well as the design principles formed the theoretical basis of a design-based research study on OCEs, which has been conducted in close collaboration with teachers at a partner school in North Rhine-Westphalia (see Siepmann 2024a, 2025, forthc.). 1. Introduction Over the past decade, many German federal states have introduced new mandatory oral examinations for classroom-based and high-stakes summative assessment in English as a Foreign Language (EFL) classrooms. AAA - Arbeiten aus Anglistik und Amerikanistik Agenda: Advancing Anglophone Studies Band 50 · Heft 2 Gunter Narr Verlag Tübingen DOI 10.24053/ AAA-2025-0010 Philipp Siepmann 138 Although these assessments are known by different names, including Mündliche Prüfung (oral exam), Kommunikationsprüfung (communication exam) or Sprechprüfung (speaking exam), they have in common the fact that they are intended to test students’ oral communicative and interactional competences in the target language. This article will refer to these new assessments collectively as oral communication exams (German: Mündliche Kommunikationsprüfung; hereafter: OCEs) to distinguish them clearly from more content-based oral examinations in EFL classrooms, such as oral Abitur examinations. The introduction of OCEs is part of a wider assessment reform in the wake of outcomes orientation and standardisation, as reflected in the National Educational Standards for Modern Languages (Bildungsstandards), which provide for speaking tests as part of the written Abitur examination or as a substitute for a written examination during the qualifying period (KMK 2012: 25). OCEs were formally enacted through amendments to examination regulations and curriculum guidelines. In order to facilitate practical implementation, some states have issued handouts with recommendations on the design and administration of examinations as well as sample tasks or entire examination designs (e.g. Hessisches Kultusministerium 2020; MSW NRW 2014; Niedersächsisches Kultusministerium 2014). However, it is striking that none of these official documents is (explicitly) based on empirical evidence. This article takes a critical stance towards such a practice and argues for a theoretically grounded and evidence-based OCE policy and practice. Therefore, it proposes evidence-based design principles for OCEs to support policy makers and teachers based on a review of the empirical research literature and textbooks on classroom-based language assessment. The article is divided into three parts: First, it outlines a conceptual framework for an OCE and its main design elements. Second, it discusses the implications of relevant empirical studies for the design of the elements of an OCE. Third, the article derives four theory-based design principles (DP) from the theoretical discussion. 2. Conceptual framework for the oral communication exam The conceptual framework for an OCE proposed in this article emerged from a design-based research project (Siepmann 2024a, forthc.; Siepmann & Bruns forthc.). It was carried out between 2019 and 2023 in close collaboration with teachers and students from a partner school in the federal state of North Rhine-Westphalia. The term conceptual framework is used according to a definition by Miles et al. (2018: 15): Oral communication exams in the EFL classroom 139 A conceptual framework explains, graphically and/ or in narrative form, the main things to be studied - for example, the key factors, variables, phenomena, concepts, participants - and the presumed interrelationships among them - as a network. The decision to use a visual framework was made at an early stage of the project to focus the attention of participating teachers on the key factors influencing the quality of an OCE and their interrelationship. In the collaborative research process, this framework helped to identify key issues with the current examination practice and set priorities for the design process while maintaining a focus on theory development. The present design of the conceptual framework for an OCE (Fig. 1) encompasses four levels, which are arranged to reflect the hierarchical structure of the German education system 1 : 1. the ‘outer frames’, which encompass the target language use (TLU) domain (i.e. the real-world language use modelled both in the classroom and the OCE; Bachman & Palmer 1996); several levels of legal frameworks, examination regulations, as well as curriculum guidelines; 2. the ‘inner frames’ or ‘jurisdiction of change’ (as demarcated by the dashed line), indicating the levels within a local school context where teachers can actively design instruction and assessment. This is the domain is pivotal for local curriculum and classroom development (see Siepmann 2024b, forthc. for illustration). local conditions which may impose limits on the preparation and conduction of an OCE such as the available time, staff, materials, as well as the support from the school community; the level of teaching and learning in the foreign language classroom, which, ideally, is closely connected to 3. the level of assessment and evaluation (centre), which comprises: the assessment task (see Sect. 4.2), assessment criteria/ scales (see Sect. 4.3), and the assessment procedures (‘scoring & grading’, see Sect. 4.3). the ‘human factor’, that is, teachers/ raters and students/ testtakers as the main stakeholders of the OCE (not discussed in detail in this article). 4. The test construct, which spans all levels of the framework. It ensures that all elements of the OCE fit together (see Sect. 4.1). 1 While the framework is modelled on the German education system it should, with minor adjustments, be applicable to most education systems and local contexts. Philipp Siepmann 140 Fig. 1: Design of the conceptual framework for an OCE. Oral communication exams in the EFL classroom 141 The framework integrates the key components of previous theoretical models of oral language performance assessment (Fulcher 2003; McNamara 1996; Milanovic & Saville 1996; Skehan 1998; see also Zhao 2013 for a detailed comparison of these models): test construct, raters, test-takers, assessment tasks, assessment criteria/ scales, and procedures. Since these models were not primarily designed with classroom-based assessment in mind, elements were borrowed from other models: Ishii & Baba’s (2003) model of locally developed oral skills evaluation foregrounds the alignment of teaching, learning, and assessment (see ‘teaching and learning in the classroom’, Fig. 1). Another notable extension found in the present version is the emphasis on the societal and institutional embeddedness of classroom-based assessment (see ‘outer frames’ in Fig. 1). The outer frames symbolise different levels of school policy and administration, which determine the legal and formal guidelines for educational assessments. McNamara (2001: 340f.) defines two sets of external demands that often compete with the needs of teachers and students in classroom-based assessment: validity demands (e.g., aspects of reliability and validity, as well as the consequences of assessment) and managerial demands (e.g., reporting and accountability). These demands concern different levels, from the school itself to local/ national school authorities, depending on the structure of the respective educational system. The outer frames are also a reminder that educational assessment always reflects social and cultural values (McNamara 2001; Messick 1994). Finally, the ‘jurisdiction of change’, encompassing the inner circles, is an original aspect of the framework at hand. It delineates the area where individual teachers or a school’s collective staff can take effective action. Within the jurisdiction of change, some practical and logistical aspects of planning and conducting OCEs are included which became salient through the exchange with teachers in the empirical study, such as the availability of staff, rooms, and time for providing the assessment, as well as the support from the wider school community, including parents. These external conditions may impose tight restrictions on the assessment and may be detrimental to its quality, for instance, if not enough teachers are available to provide for a second examiner or if there is not enough time allocated for the examination. This part of the framework draws on O’Sullivan’s (2021) concept of a Comprehensive Learning System (CLS), which considers the stakeholders and local conditions in what the author calls the delivery system, that is the “process by which the formal curriculum is operationalised in specific learning contexts or domains” (10). It encompasses not only the teacher and classroom instruction but several other factors that shape the implementation of a classroom-based assessment: the physical environment of the school, the school staff, the available materials and (technical) equipment. The extent to which the elements described in section 4 can be actively shaped by educators ultimately depends on the conditions provided in the local school context. Philipp Siepmann 142 3. Defining ‘quality of assessment’ The conceptual framework of the OCE is intended to help policymakers and teachers make informed decisions about how to improve the quality of assessment. Therefore, the notion of assessment quality will be discussed before elaborating further on the design elements constituting the OCE. The first question which naturally arises is why quality matters in assessment. As Shohamy (1993: 2), who has written extensively on the impact of tests, emphasises, [t]he power and authority of tests enable policymakers to use them as effective tools for controlling educational systems and prescribing the behavior of those who are affected by their results administrators, teachers, and students. It is therefore imperative for educators to use this power responsibly and to pursue a high-quality standard of quality in classroom-based assessment in EFL, which includes a clear view of the potential impact of test on the individual learner. The second question relates to what constitutes a quality standard for a classroom-based OCE. Most empirical research on (large-scale) language assessment revolves around psychometric criteria such as reliability and validity. When it comes to classroom-based OCEs, however, quality is more difficult to grasp. The complex statistical operations required to determine, for example, interrater reliability, are hardly practicable for most language teachers, who lack both the time and the expertise. However, some of these criteria can be adapted for use in the classroom and provide guidance in approaching a concept of assessment quality. A wellestablished concept which is still cited in many recent textbooks is Bachman & Palmer’s (1996) notion of test usefulness. It consists of six interrelated qualities, namely reliability, validity, authenticity, interactiveness, impact/ washback, and practicality. Reliability refers to the dependability and consistency of test scores, i.e. a test should produce the same results when administered repeatedly (re-test reliability) or when scored several times by the same rater (intrarater reliability) or once by different raters (interrater reliability). Reliability, therefore, depends, among other things, on clearly communicated expectations that are communicated to test-takers and co-raters through transparent assessment criteria/ scales. Construct validity in classroom-based assessment means that the test covers the very skills or competences it claims to test. Authenticity emphasises that the assessment should reflect real-world language use rather than artificial ‘exam talk’, in order to be a useful measure of a learner’s ability to deal with everyday communicative tasks. Interactiveness describes the demand a test makes on the test-taker’s Oral communication exams in the EFL classroom 143 communicative/ interactional competences as well as cognitive, critical thinking and problem-solving abilities. Test impact (or washback) is the extent to which the test influences learning and teaching in the classroom. Ideally, the learner is provided with useful feedback on how to improve on the competences covered in the test. Practicality means that the test can be administered economically under the given personal, spatial, temporal, and material conditions. Because the qualities subsumed under the umbrella of test usefulness are interdependent, the design of a classroom-based language assessment such as the OCE must take into account the complex interplay of all its components. One way of approaching the complexity of assessment design is through the concept of constructive alignment (Biggs 1996), which forms the backbone of the conceptual framework for an OCE: From this perspective, a classroom-based assessment is of high quality if it is closely aligned with the learning objectives as well as instruction/ learning activities in the classroom. While this may seem simplistic, it provides a lens through which the quality of an assessment design can be quickly and easily evaluated, as well as a solid starting point for the design process of an OCE. 4. Design elements of the OCE framework Having arrived at a broad understanding of assessment quality in the development of OCEs, this section will review popular textbooks and empirical studies on oral language performance assessment and discuss their specific implications for designing quality tests and providing optimal conditions to test-takers. 2 It will approach the design elements of the OCE as depicted in the conceptual framework (Fig. 1) from the outer frames to the inner frames, starting with the definition of the test construct and specifications, before moving on to the assessment tasks, assessment criteria/ scales, and the assessment procedure. 4.1. Test construct To establish constructive alignment, the competences tested in an OCE must reflect the learning objectives and learning activities in the classroom. The first step in designing an OCE is to define the test construct (see DP 1.2, Sect. 5). As indicated by the bold arrow (see Fig. 1), the test 2 This section builds upon the extensive literature review carried out during the phase of initial problem analysis in the design-based research project (i.e. the first cycle of analysis & exploration as part of the generic model of educational design research; McKenney & Reeves 2019). Philipp Siepmann 144 construct is influenced by a) the target language use domain (Bachman & Palmer 1996), b) by guidelines and regulations at different levels, c) by what is taught and learnt in the classroom, and d) the specific test purpose. The test construct can be thought of as the fabric that weaves together all the elements of the test. Thus, construct definition, “is the process of defining what it is we intend to measure” (Cheng & Fox 2017: 104). The test construct will guide the development of both the assessment task and the assessment criteria. Before delving deeper into the process of construct definition, it should be noted that the term ‘test construct’ has a different meaning in classroom-based assessment than in psychometric language proficiency testing. As Baird et al. (2017: 15) explain, educational attainment constructs set out what students should learn. Unlike psychological constructs, they are goals. Within an education system, they help to generate the very attributes that they assess by making transparent what students should know and be able to do. Applied to the test construct of an OCE, this means that it must conform to the curriculum guidelines, examination regulations, and the learning objectives laid out in the school curriculum (see outer frames in Fig. 1). Defining the test construct starts with a reflection on the target language use domain, that is, real-world language use in a particular social and cultural context (Bachman & Palmer 1996). This helps to specify the language skills needed to cope with this communicative situation and to devise assessment criteria. Construct definition takes some theoretical knowledge about the nature of spoken discourse and how it differs from written discourse (see Bygate 1987; Goh & Burns 2012; Luoma 2004 for elaboration). In very broad terms, the complex processing conditions of oral communication (Kormos 2006; Levelt 1989; Vandergrift & Goh 2021) are characterised by time pressure and reciprocity. This explains some of the linguistic features specific to spoken discourse, such as the use of deictic expressions (‘you’/ ’me’; ‘here’/ ’there’) and the high proportion of formulaic expressions (Bygate 1987; Goh & Burns 2012). Language teachers may argue that they lack both the time and the expertise to deal in detail with the theoretical test construct of an OCE. However, there are some accessible concepts of oral communication that can form the basis of an educational test construct. As many curricula and standards in the countries of the European Union (and far beyond) are based on the scales of the CEFR (Common European Framework of Reference; CoE 2001), the subscales provided in the companion volume (CoE 2020) can facilitate the process of construct definition as well as the development of corresponding assessment scales. The CEFR provides specific scales for different task types, such as ‘informal discussion (with friends)’ Oral communication exams in the EFL classroom 145 or ‘goal-oriented collaboration’. As the language of curricula or the CEFR may be too general and abstract for students, the test construct should be described in simpler terms. In the main study mentioned in Sect. 2.1, the Oracy Skills Framework (OSF) introduced by Mercer et al. (2019) and adapted for English language education by Siepmann (2024b) was used as the basis and proved to be very practicable and accessible to students. It encompasses the various resources that students can draw on in oral communication, including physical (e.g., voice and body language), linguistic (e.g., lexis, grammar, genre), cognitive (e.g., organisation of talk, strategies), and social-emotional (e.g., confidence in speaking, interaction, active listening) aspects. The OSF covers elements of communicative and interactional competence but, unlike the CEFR scales, also puts a strong emphasis on non-verbal (e.g. use of voice and body language) and affective (e.g., emotions, motivation, anxiety) components of oral communication. In classroom-based assessments, the test construct often covers non-linguistic elements which are an integral part of the EFL curriculum but usually not covered in oral proficiency tests. In the German context, for example, these include intercultural communicative competence (KMK 2012; 2023) and discourse competence as an overarching goal of foreign language education (Legutke 2010). Discourse competence implies that language learning is an essential prerequisite for students’ social and cultural participation in a globalising world. A controversial aspect of the test construct is whether an oral assessment can be used to measure an individual’s language skills. This question has been raised by proponents of interactional competence (IC). IC has received increasing attention in the language testing community in recent years (Plough et al. 2018) and is worth considering in the development of classroom-based OCEs. Since meaning is largely co-constructed in oral interaction, and communicative success depends on all participants of a conversation, it is argued that a paired or group oral should be regarded as a collective performance and therefore be assessed as such. This idea is expressed in the term ‘confluence’, which denotes “the act of making spoken language fluent together with another speaker” (Walsh 2012: 3). Thereby arguing against a ‘monological bias’ of speaking constructs (McCarthy 2005), IC “is concerned with the ways in which interactants construct meanings together, as opposed to looking at features of individual performance which lie at the heart of communicative competence” (Walsh 2012: 3). An important implication of IC is that more attention is given to (active) listening both in language teaching and assessment (Lam 2021, 2024). A shift from communicative competence, which has shaped communicative language teaching and assessment for decades, to IC could go as far as to include a collective component in the overall grade for a learner’s performance in oral assessments (May 2009). Philipp Siepmann 146 A potential positive washback of such a feature could be that it encourages collaboration in interaction inside and outside the classroom. 3 Whatever aspects the test construct includes, if carefully defined, it can help strengthen the links between teaching and learning in the classroom and the OCE. Moreover, it also makes it easier to communicate expectations to students. To further enhance transparency and constructive alignment, textbooks on classroom-based language assessment (e.g., Bachman & Damböck 2018; Bachman & Palmer 1996; Cheng & Fox 2017; Luoma 2004) recommend writing detailed test specifications, for example, in form of a table (see DP 1.3). Test specifications state which competences are tested (test construct) and for which purpose, how the competences are tested (e.g., number of items, test duration, test/ task format, etc.), and how the test will be graded (e.g., criteria, scale). In this way, detailed specifications “will help the developers create a coherent system whose parts fit together” (Luoma 2004: 115). Writing test specifications for an OCE is important not only to meet external accountability demands (compliance with educational standards, examination regulations, curriculum guidelines, etc.), but also to make the goals and subject matter of assessment transparent to students. Ideally, these test specifications are formally agreed by the EFL teaching staff and formalised in the school curriculum (see DP 1.1). 4.2. Assessment task In language assessment, tasks are “the means by which we can elicit a sample of language that can be scored”, and thus “strengthen the inference we can make from scores to construct” (Fulcher 2003: 50). In terms of the overall quality of an OCE, tasks play a crucial role in the type and complexity of communication and interaction they elicit from test-takers. This section will outline key aspects of a ‘genuine’ speaking task, that is, one that is based on authentic genres of oral communication, as opposed to tasks adapted from written assessments. The following factors relevant to task design will be addressed: • task format • task complexity • participant structure • topic • planning time • authenticity • quality assurance. 3 Exemplary assessment scales for IC at various proficiency levels are provided by Galaczi (2014) and Barth-Weingarten & Freitag-Hild (2021). Oral communication exams in the EFL classroom 147 Task format Tasks operationalise the test construct. A crucial step in task design is therefore to decide which general task format best reflects the test construct, that is, elicits the desired type of communication from the testtakers. According to Fulcher’s (2003) typology, tasks may have an open, guided or closed orientation which determine the range of options that test-takers have in responding to a task. Tasks may be non-interactional (monologic) or interactional (dialogic, multilogic), they can be goaloriented, they can present interactants with convergent (e.g. goaloriented collaboration/ planning) or divergent (e.g. pro/ contra debate) goals, and they relate to specific topics and situations. Goh and Burns (2012) distinguish between monologic, communication-gap, and discussion tasks. Communication-gap tasks are further divided into informationgap tasks where students have different sets of information that they share to achieve a common goal, and context-gap tasks, in which students have the same set of information and are asked to create new content for their audience (e.g., present their opinions). Ultimately, the assessment task should confront students with a communicative problem which they are asked to solve through collaboration or by negotiating diverging viewpoints (see DP 2.4). The choice of monologic or dialogic tasks has implications for the nature and quality of test-takers’ discourse in the examination. Monologic tasks include narrative/ descriptive tasks (e.g., picture description, telling a story based on an image sequence), instruction (e.g., giving directions), retelling (e.g., of a story read during preparation time), explanation/ prediction (e.g., interpreting a diagram) or a prepared speech (Luoma 2004). There are some limitations as to the discourse elicited by monologic tasks: Test-takers tend to respond to monologic tasks in a more structured and predictable way than to dialogic tasks (Fulcher 2003). As they produce less variability in test-takers’ responses, they are more suitable for testing specific linguistic forms or language functions. Another limitation of monologic speaking tests is that they provide less information about a test-taker’s real-world interactional competence than do dialogic formats (Roever & Ikeda 2022). Dialogic or interactive speaking tasks include planning or decision-making (e.g., choosing between different options) and role-plays or simulations (e.g. resolving a conflict with an exchange student) (Luoma 2004). A combination of monologic and dialogic tasks and different degrees of openness allows the students to demonstrate a wide range of their communicative abilities in an OCE (see DP 2.5). Philipp Siepmann 148 Task complexity The assessment task largely determines the interactiveness of an OCE, which is an aspect of test usefulness (see Sect. 2). As Bachman & Palmer (1996: 39) put it, interactiveness refers to “the extent and type of involvement of the test-taker’s language ability [...], topical knowledge, and affective schemata, in accomplishing a test task”. In other words, interactiveness is a function of the cognitive and linguistic complexity of an assessment task; balancing these features is crucial in task design (see DP 2.5). In general, the focus of an OCE task should be on communication rather than content (see DP 4.1). Grabowski (2007) demonstrated a ‘writing superiority effect’ in a series of experiments, which means that written assessments have higher content validity than oral examinations in the verbal recall of declarative/ factual knowledge due to the higher cognitive load of speaking compared to writing. As a rule, therefore, the cognitive demand of an oral assessment task should be lower than that of a written assessment task. In order to match the cognitive and linguistic demand to the overall level of the students, Robinson (2001) proposes three dimensions of complexity: (cognitive) task complexity, (interactional) task conditions, and (relative) task difficulty. For example, open tasks involving cognitively demanding processes such as logical reasoning require more complex linguistic forms (such as causal or chronological sequencing) and are therefore more complex than closed-ended question-answer tasks or guided narrative tasks. Cognitively complex task formats that require argumentation and problem-solving such as decision tasks tend to produce more syntactically complex communication but may be detrimental to fluency and may even lead to communication breakdown. Guided, more structured tasks such as descriptive or narrative tasks benefit fluency and accuracy, but tend to result in shorter, less complex responses (Skehan & Foster 1997). The findings on the correlation between task structure and accuracy were confirmed by a study by Tavakoli & Skehan (2005), although it did not confirm a consistent effect of task structure on complexity. At the interactional level, various factors such as the participant structure (see paragraph below), their personality or gender, or their familiarity with the interlocutor affect the overall complexity of a task. Finally, individual learner differences in motivation, anxiety, ability, etc. influence the relative difficulty of a task. Participant structure If the task requires interaction, the question of who the students will be talking to - the teacher or another student/ a group of students - is central. Leaving aside the practical problems that arise when the teacher acts Oral communication exams in the EFL classroom 149 as both interlocutor and examiner, there is evidence that the students’ performance on an OCE improves when they interact with their peers (see DP 4.1). In Nakatsuhara’s (2013) study, test-takers not only performed better in paired peer-to-peer oral examinations than in interview examinations, but also showed more complex patterns of interaction and negotiation of meaning. Such paired assessment formats also tend to elicit more spontaneous, richer speech (Brooks 2009) and “provide opportunities for students to demonstrate ‘real-life’ interactional abilities to relate to each other in spoken interaction” (Gan et al. 2009). Paired assessment tasks should allow test-takers to demonstrate collaborative skills and use communication strategies (Galaczi 2008). Mutuality and equality are key qualities of such a task design: Mutuality means that test-takers coconstruct meaning and build up on each other’s ideas, while equality refers to balanced opportunities for participation (ibid.). Group size has tangible implications for the interaction between the group members. Comparing different group sizes (three versus four participants), Nakatsuhara (2011) found that a group size of three provided a more collaborative atmosphere than a group size of four (see DP 4.3), which she attributed to a higher degree of equality and mutuality, suggesting that the test-takers collaborated more successfully in smaller groups. The larger group size also encouraged avoidance behaviour from the more introverted members. Another finding of the study was that larger groups were more prone to mechanical turn-taking, where students make their contributions in a fixed order rather than in a dynamic way as in naturally occurring conversation. Although research suggests that pair and group assessment formats tend to benefit the quality of examination discourse, there is still a risk of asymmetric conversation caused by dominant behaviour by one partner (May 2009). May’s study also revealed that when a collaborative grade (or part grade) is awarded, different interaction styles of test-takers can pose a considerable challenge to the raters. Davis (2009) provides evidence that student proficiency is an obvious criterion when assembling groups of test-takers for an OCE; differences in ability tend to benefit lower-proficiency test-takers while the higher-proficiency test-takers were not disadvantaged if the differences were moderate. Topic Another important variable in task design is the topic of the task and the test-takers’ familiarity with it. In a study by Khabbazbashi (2021) a general topic effect - a significant influence of the topic of an assessment task on the test-taker’s score in the assessment - could not be verified. However, it provides evidence that topic familiarity and the level of abstraction influence the quality of discourse elicited by a task. In a previ- Philipp Siepmann 150 ous study, the author found that the test-taker’s individual background knowledge influences the cognitive demand of an assessment task and thus the linguistic complexity and fluency of the performance (Khabbazbashi 2017). As Jennings et al. (1999: 451) show, giving testtakers a choice of topics to talk about does not have a statistically significant effect on test scores, but qualitative data collected from test-takers suggests that it has favourable psychological effects as it helps “shift the balance of power from the tester to the test-taker” (see DP 2.3). In classroom-based OCEs, care should be taken to ensure that the topics covered in the test reflect the contents of the previous teaching unit in a balanced way (content/ curricular validity). Giving students some choice of topic - for instance, when giving a prepared speech in the OCE - is one way to ensure that the topics discussed in the OCE are meaningful and relevant from the students’ perspective (Rogge 2012, see DP 2.3). This aspect should be considered when beginning to define the target language use domain the test refers to: Where and how - through which media, and in which genres - would the students (or their peers) discuss the topic(s) covered in the OCE? What aspects would matter most to them? If the teacher is not familiar with these discourses, students could be actively involved in this process and thus take part in designing the task. Planning time Depending on the type and complexity of the assessment task, teachers must decide whether to allow the students planning time to prepare for the task. While several studies have found no significant impact on overall test scores (Elder & Wigglesworth 2006; Inoue & Lam 2021; Lampropoulou 2023), there is evidence that planning benefits the quality of the language output by increasing fluency, leading to more coherent and persuasive discourse, greater lexical variety, and syntactic complexity (O’Grady 2019; Yuan & Ellis 2003). In Bamanger & Gashan’s (2015) experimental study of pre-task planning conducted on high school students, planners outperformed non-planners in fluency, accuracy, and complexity measures, and, in contrast to the studies cited above, scored higher overall. The authors stress that to make effective use of planning time, teachers should prepare students to use it purposefully. Inoue & Lam (2021: 1) demonstrated that extended planning time in a listening-to-speaking test led to greater cognitive and metacognitive engagement on the part of test-takers, that is, “enhanced cognitive validity of the task”. Pre-task planning time, however, may negatively impact some aspects of performance, for instance, collaboration in interaction (Nitta & Nakatsuhara 2014). It should be emphasised that, in addition to aspects of performance, task complexity and authenticity should be considered when deciding about (the amount of) pre-task planning time. O’Grady (2019), who found in his study that planning time has little influence on test Oral communication exams in the EFL classroom 151 scores, recommends that it should only be provided where it reflects realworld language use and should not be offered where spontaneous speech is desired (see DP 4.4). In any case, it should be avoided that students make lengthy notes that they may be tempted to read out instead of speaking freely. Task authenticity Authenticity is a highly contested concept in language assessment. Task authenticity refers to how closely an assessment task resembles realworld communicative tasks in the target language use (TLU) domain, see Sect. 2.1) and whether it allows students to use language in a meaningful, contextualised way (Bachman & Palmer, 1996). The authors acknowledge that authenticity is a subjective category, “as different test takers may have different perceptions about their TLU domains” (ibid., 24). Therefore, what a test developer may consider an authentic assessment may not necessarily be accepted as such by the test-taker. Moreover, it may be difficult, if not impossible, to precisely define the characteristics of a realworld communicative task and to transfer them to the assessment context (Lewkowicz 2000). It may be added that in classroom-based assessment, the question arises as to whether an ‘authentic’ task will automatically lead to greater student engagement and motivation - after all, the degree of authenticity of speaking tasks in test settings is always limited by the fact that they take place in a test context (Bo 2007). As Stokoe’s (2013) analysis reveals, speakers’ actions in simulated role-plays tend to be more elaborate, interactionally visible and thus more easily ‘assessible’ than in their real-world counterparts. In other words, test-takers tend to give the raters what they are looking for rather than satisfying their own communicative needs, as they would in naturally occurring conversation. Thus, even if the task is (perceived to be) authentic, the communication in the examination need not be. For these and other reasons, the concept of authenticity is highly contested in the academic discussion and there is very little guidance on how to practically achieve authenticity in assessment. It might be sufficient for language teachers developing an OCE task to look for inspiration in real-world communication rather than to adapt speaking tasks from their written counterparts (see DP 2.1). Quality assurance in task design As with every element of the OCE framework, quality assurance is key to task design. In order to establish a constructive alignment between instruction and assessment, writing a blueprint of the task is recommended (see DP 1.4). Bachman & Damböck (2017) point out that a blueprint “provides a link between the objectives of the language classroom and the content to be assessed, and between the contents of the instructional task and the assessment task” (139). This is essential to ensuring the as- Philipp Siepmann 152 sessment covers the instructional content and that meaningful and generalisable interpretations can be made (cf. ibid.). The process of assessment task design does not end with the first administration of an OCE. As Biggs (1996: 356) points out, the crucial point in designing assessment tasks is “to judge the extent to which they embody the target performances of understanding, and how well they lend themselves to evaluating individual student performances”. This implies that any task should be carefully piloted, evaluated by teachers and students, and refined to ensure that it allows students to demonstrate the full range of their abilities in spoken communication and allows valid assessment. A final step in task design before implementing a new task is to carefully (peer-)review the task. It should clearly communicate the communicative context, situation, purpose and goals of communication, speaker roles and addressee(s) (see DP 2.5). 4.3. Assessment criteria/ scales Like assessment tasks, assessment criteria and scales have a strong influence on the overall quality of an OCE. If no mandatory assessment scales are provided, teachers can adapt existing scales or develop their own from scratch. This is a challenging task which requires expert knowledge of test constructs and test development. Insights from empirical studies on the assessment scales may provide orientation to policymakers and teachers. Another important decision concerns the assessment procedure, that is, how the test is scored and graded, especially if there is more than one rater involved. Scale types The term assessment scale or, as often used synonymously, rubrics (see Kuiken & Vedder 2020) refers broadly to “any tool that allows raters to distinguish among a number of levels of language performance” (Baker & Turner 2022: 1). Assessment scales can be divided into two main categories, holistic and analytic. Holistic scales integrate all aspects of assessment in one scale, which allows a quick and intuitive assessment, but they lack reliability and validity (Fulcher 2003). Moreover, they are of limited use to students to make inferences about their individual strengths and weaknesses e.g. in fluency or accuracy. Analytic scales allow different aspects of performance (such as fluency, coherence, pronunciation, etc.) to be scored separately. They thus allow for a more differentiated assessment with balanced reference to different criteria and provide for individual feedback on the individual criteria to the students (Knoch 2009). Although the use of analytic scales is more cognitively demanding for raters, as it requires them to attend to different aspects of Oral communication exams in the EFL classroom 153 the test construct while observing and evaluating the test performance, analytic scales permit a more reliable and valid assessment (Hamp-Lyons, 1991); they are thus recommended for use in an OCE (see DP 3.1) A third category, the part-marking model, is mentioned by Khabbazbashi & Galaczi (2020), where individual scores are assigned to different parts (e.g., interview, monologue, dialogue) of a test. Their study found that part marking models provide greater differentiation and precision in assessments of speaking abilities (see DP 3.1). Part-marking analytic scales which differentiate between monologic and dialogic (interview/ peerpeer) parts are recommended in the official guidelines for OCEs of most federal states of Germany, with some exceptions such as Lower Saxony where a combination of holistic and analytic scales is suggested (Niedersächsisches Kultusministerium 2014: 124 ff.). For all three models, a further distinction can be made between task-independent scales (which can be used for similar tasks) and task-dependent scales (which only apply to one specific task) (Kuiken & Vedder 2020). In classroom-based OCEs, it may help students to describe fulfilment criteria in relation to the specific demands of a task. Defining assessment criteria Assessment criteria should reflect the learning objectives of classroom instruction to ensure constructive alignment and content validity. In particular, assessment criteria for an OCE should consider the characteristics of spoken discourse (see DP 3.2). As (synchronous, face-to-face) oral communication occurs under time pressure, speakers have very little opportunities to plan or edit what is said. This may partly be compensated for by the co-constructed, reciprocal nature of oral communication, as meaning is negotiated in discourse (Bygate 1987). These characteristics need to be reflected in the assessment criteria devised for an OCE, which is not always the case in practice: Matz et al. (2018), for example, criticise the separation of content and language, as well as of grammar and lexis in the assessment scales for OCEs issued by the Ministry of Education of North Rhine-Westphalia, Germany (MSW NRW 2014). The separation of content and language, they argue, ignores the fact that the quality of content in oral communication depends largely on whether it is communicated coherently and appropriately to a particular audience. As for the separation of grammar and lexis, they point out that oral discourse relies heavily on the use of lexical chunks. In practice, it may be difficult to distinguish between lexical and grammatical errors when these chunks are used incorrectly. The authors add that an individual scale for grammar focuses the examiners’ attention on grammatical accuracy and thus on the students’ deficits than their communicative success. Philipp Siepmann 154 Transparency of assessment criteria Assessment criteria should be communicated transparently to the students so they can draw conclusions about their learning from their scores in each category. This means that rather than using generic criteria and descriptors to assess students’ performance, task-specific criteria should be developed as they “increase discussion about the different components of a specific assignment” (Rosenow 2014). To further increase their usability for students, criteria and level descriptors should therefore be written in a student-friendly, non-technical way to help students set learning goals and monitor their progress (Andrade et al. 2021; see DP 3.3). Students can even be actively involved in the development of assessment criteria and scales (see DP 3.4). While teachers may be reluctant to give up control, this is a crucial step towards democratic assessment as well as shared power and responsibility (Shohamy 2001). The co-construction (or co-creation) of assessment criteria has been found to increase learner autonomy as it improves self-regulation, strategy use, and overall performance (Carless 2009; Fraile et al. 2017; Panadero & Romero 2014). Moreover, it benefits their capability of evaluative judgment as well as the overall quality of peer feedback (Yan 2024). A study by Zhao & Zhao (2020) on co-constructing writing assessment criteria established that coconstruction not only strengthens constructive alignment of learning objectives (or educational standards), learning, and assessment, but also promotes students’ metacognitive competences. In their case study, criteria based on the CEFR scales were provided by the instructor and refined by the students. This approach of guided co-construction seems more promising in school contexts than developing criteria from scratch. One way to establish and refine assessment criteria is to work with exemplars, that is, “carefully chosen samples of student work used to illustrate dimensions of quality” (Carless et al. 2018: 108). It should be noted that most studies cited above were conducted in higher education and on writing tasks. However, it is reasonable to assume that co-construction of criteria has similar benefits in secondary education and in the context of preparing for an OCE. Rogge (2018), for example, outlines a method for co-constructing assessment criteria for OCEs in the EFL classroom which further strengthens the constructive alignment between teaching and testing. Establishing agreement between raters The grading process is a potential source of variability in test scores, and particularly so in oral assessments. In an ‘ideal’ setting - with fully objective examiners and optimal conditions for examinees - it would be possible to identify a learner’s ‘true score’ in an OCE. While complete objectivity is rarely achieved in language assessments - except, maybe, in Oral communication exams in the EFL classroom 155 multiple-choice tests where an answer is either right or wrong -, oral assessment is particularly prone to subjectivity (e.g., harshness) and rater bias (Lumley & McNamara 1995). This can lead to low reliability of test scores. There are steps that can be taken to improve the quality of assessment procedures. The presence of a second (or even third) rater, ideally someone unfamiliar with the students and their previous performance, is recommended where practically feasible. If two or more raters are available, there seems to be a broad consensus in the research literature that scores assigned by the raters should at least be adjacent (e.g., ‘3’ and ‘4’ on the German six-point grading scale; see Penny & Johnson, 2011 for an overview). Penny and Johnson (2011) list five models to resolve non-adjacent scores (in writing assessments), two of which seem practical for classroom-based oral assessments such as OCEs. The first model is based on the rater mean, i.e. the test score is calculated by combining (summing up or averaging) the scores awarded by the individual raters. The second model, discussion, requires raters to review their ratings and agree on a consensus score. Since the rater mean model is much less time-consuming than the discussion model, it should be preferred for reasons of efficiency (see DP 4.4). The discussion model is also vulnerable to power imbalances, especially if the raters have different professional status or experience. However, discussion of divergent scores can help to calibrate raters’ standards and can be used when two raters consistently disagree. 5. Preliminary design principles for OCEs Based on the research findings on the design elements of the OCE, this section will sum up the implications of the available evidence for classroom-based oral assessments in the EFL classroom in four design principles (DP). In design-based research, which typically pursues a dual goal of solving practical pedagogical problems while deepening theoretical understanding about these problems, design principles act as a “bridge between scientific knowledge production and practice design” (Euler 2017: 1). They often take the form of “prescriptive statements [which form the] the basis for designing practical action concepts to achieve the defined practice goals”. (ibid.: 2). The DP presented in this article are preliminary or ‘alpha’ versions. That is, they have not yet been fieldtested, evaluated, and further refined to empirically based and more advanced ‘beta’ and ‘gamma’ versions 4 (see McKenney & Reeves 2019: Ch. 6). Each of the four principles focuses on improving different aspects of 4 To learn more about their evolution over the course of the collaborative design-based research project, see Siepmann (2024a; forthc.; Siepmann & Bruns forthc.). Philipp Siepmann 156 the quality of an OCE, such as validity or authenticity. The first design principle concentrates on the steps to be taken in the early planning stages to achieve constructive alignment (DP1). The second and third design principle provide guidelines for the development of assessment tasks (DP2) and assessment criteria/ scales (DP3). The fourth principle (DP4) focuses on enhancing conditions and procedures to improve the practicality of the OCE. Preliminary design principle 1 (DP1; alpha): constructive alignment Quality focus: constructive alignment, validity To systematically foster students’ oral competences, learning objectives, learning activities in the classroom, and assessment tasks and criteria of the oral communication exam (OCE) should be constructively aligned. 1.1. When introducing a new OCE, the school curriculum should be thoroughly revisited and re-designed to focus on oral communication and oral competences. 1.2. The test construct should be defined and described in detail before planning the teaching unit and the OCE. It should be reflected in the learning objectives, the learning tasks as well as in the assessment tasks and criteria for the OCE. 1.3. Writing test specifications ensures compliance with curriculum guidelines and examination regulations and helps to operationalise the test construct. Moreover, test specifications facilitate documentation and reporting (e.g., to students, parents, school principal, and school authorities) to meet external demands to accountability. 1.4. Design features of assessment tasks should be laid out in a blueprint which makes transparent the operationalisation of the test construct in the task. Preliminary design principle 2 (DP2; alpha): assessment task Quality focus: constructive alignment, interactiveness, validity, authenticity, washback To enable the students to demonstrate a range of their oral communicative competences in the oral communication exam, the assessment task should be conceptualised as a genuine speaking task. 2.1. The task should generate an authentic communicative context modelled on authentic language use and genres of oral communication. 2.2. The task should refer to real-world topics/ discourses that are relevant and meaningful from the students’ perspective. The test-taker’s individual perception of relevance and meaningfulness of the task topic may be enhanced by providing (limited) choice of topic. 2.3. The task should require solving a communicative problem, e.g., bridge a communication/ information gap, by collaborating to reach a common goal or by presenting and negotiating diverging viewpoints. Oral communication exams in the EFL classroom 157 2.4. The cognitive and linguistic complexity of the task should allow students to demonstrate a range of their (monologic and interactional) competences in oral communication at different levels of cognitive processing. 2.5. Task instructions should describe in sufficient detail the communicative context (situation, genre, speaker roles) and clearly state the communicative goals. Preliminary design principle 3 (DP3; alpha): assessment criteria/ scales Quality focus: constructive alignment, reliability, validity, impact/ washback, practicality/ economy The assessment criteria/ scale for an oral communication exam should allow for a reliable and valid assessment of the students’ competences in oral communication/ interaction. 3.1. When designing an assessment scale, an analytic scale type is to be preferred to a holistic scale to allow for a differentiated evaluation of a learner’s performance and diagnostic use of the scores for individual feedback to students. If the OCE consists of more than one part (e.g., monologic and dialogic), a part-marking model is recommended, where each part is scored individually. 3.2. The assessment criteria should consider the linguistic features of oral communication, for example by combining vocabulary and grammar into one category and specifying a higher degree of tolerance for deviations from linguistic correctness. 3.3. The criteria should clearly and transparently define the conditions for task fulfilment and are communicated to the students in concrete, comprehensible terms. 3.4. [optional] In order to foster learner autonomy, the assessment criteria may be co-constructed with the students, provided the overall level of the learning group allows it. In most contexts, an approach of guided coconstruction is to be preferred where criteria introduced by the teacher are refined by the students. Preliminary design principle 4 (DP4; alpha): optimising examination conditions and procedures Quality focus: practicality/ economy, validity, reliability To provide optimal test-taking conditions for students in an oral communication exam (OCE) and to enhance test economy and fairness, the following measures are advised: 4.1. An oral communication exam should focus on learners’ communicative abilities in the target language. For any other focus - such as declarative knowledge or the analysis of written texts - written assessments are to be preferred. 4.2. Paired or group assessments with peer-peer interaction should be preferred to interview formats, where the teacher acts as both rater and in- Philipp Siepmann 158 terlocutor. 4.3. When using group assessment, a group size of three has been proven to benefit collaboration and interaction between test-takers. 4.4. (Pre-task) planning time may benefit the learners’ performance. However, it should only not be granted if the test construct requires spontaneous interaction or where planning time would be unusual in the target language use domain. 4.5. When assembling pairs or groups for the OCE, the language proficiency of the students should be considered. Differences in proficiency may benefit lower-proficiency students; however, the difference in proficiency should be moderate to avoid disadvantaging higher-proficiency students. 4.6. To enhance scoring reliability and efficiency, calculating the rater mean (i.e. sum or average of all ratings) is recommended. If the raters’ scores are non-adjacent (i.e. differ by more than one grade/ level), a consensus score should be agreed upon by reviewing and discussing the rating (i.e. raters discuss their reasoning) to help raters align their standards. 4.7. Interrater reliability can be increased through regular, targeted rater training to raise their awareness of potential biases and calibrate rater standards. This should also entail drawing attention to the differences between spoken and written discourse features in assessment. 6. Conclusion This article has introduced a conceptual framework to guide the design of classroom-based oral communication exams (OCEs) in the EFL classroom. The framework integrates elements of previous models of oral performance assessment but also contains some original features. For the individual design elements, including the test construct, assessment tasks, as well as assessment criteria and procedures, evidence-based recommendations were made and summarised in four preliminary/ theory-based design principles. There are, of course, important factors that contribute to the quality of an OCE but were not discussed as they went beyond the scope of this article, such as the ‘human factor’: Differences in, for example, rater personality or harshness are a cause of variation in test-scores (Lumley & McNamara 2015). In addition, test-takers’ performance is affected by factors such as their personality (e.g., extraversion, Nakatsuhara 2011) or anxiety (Horwitz et al. 1986). While these factors are hard to control, it is nevertheless important to raise awareness of their influence and to promote language assessment literacy of teachers in pre-service and in-service teacher training programmes (Inbar-Lourie 2017; Vogt & Tsagari 2014). Another element which was not addressed this article, but which had been emphasised by the teachers participating in the empirical study (see Sect. 2) are the limiting external conditions, such as the lack of Oral communication exams in the EFL classroom 159 examination rooms, time, or staff. There is currently very little empirical research done on such aspects of local, classroom-based assessments and hence deserve further investigation. In addition, the current version of the conceptual framework needs elaboration regarding aspects of digitality (e.g. using digital media and artificial intelligence in learning and for assessment) and learner diversity (e.g. special needs education/ inclusive education). Finally, it should be mentioned that it may not be applicable to some national education systems or specific educational contexts. Despite these limitations, this conceptual framework will provide a useful basis for educational policy regarding the implementation of new oral assessment formats (currently underway in several federal states, including Saxony-Anhalt and Thuringia), curriculum development and the improvement of local assessment practices. Furthermore, it can be utilised to foster the assessment literacy of language teachers at all stages of their professional development. References Andrade, Heidi L., Brookhart, Susan M. & Yu, Elie C. (2021). Classroom assessment as co-regulated learning: a systematic review. Frontiers in Education 6: 1- 18. https: / / doi.org/ 10.3389/ feduc.2021.751168 Bachman, Lyle F. & Palmer, Adrian S. (1996). Language testing in practice: designing and developing useful language tests. Oxford: University Press. Bachman, Lyle F. & Damböck, Barbara (2017). Language Assessment for Classroom Teachers. Oxford: University Press. Baird, Jo-Anne, Andrich, David, Hopfenbeck, Therese N. & Stobart, Gordon (2017). Assessment and learning: fields apart? Assessment in Education: Principles, Policy & Practice 24 (3): 317-350. https: / / doi.org/ 10.1080/ 0969594X.2017.1319337 Baker, Beverly A. & Turner, Carolyn E. (2022). Rating scales and scales in language assessment. In: Carol A. Chapelle (Ed.). The Encyclopedia of Applied Linguistics. Hoboken: Wiley. https: / / doi.org/ 10.1002/ 9781405198431.wbeal1045.pub2 Bamanger, Ebrahim M. & Khalid Gashan, Amani (2015). The effect of planning time on the fluency, accuracy, and complexity of EFL students’ oral production. Journal of Educational Sciences 27 (1): 1-15. Barth-Weingarten, Dagmar & Freitag-Hild, Britta (2021). Assessing interactional competence in secondary schools: issues of turn-taking. In: M. Rafael Salaberry & Alfred Rue Burch (Eds.), Second Language Acquisition: Vol. 149. Assessing Speaking in Context: Expanding the Construct and its Applications. Bristol: Multilingual Matters. 237-264. https: / / doi.org/ 10.21832/ 9781788923828 Biggs, John (1996). Enhancing teaching through constructive alignment. Higher Education 32 (3): 347-364. https: / / doi.org/ 10.1007/ BF00138871 Bo, Jian-Lan (2007). An analysis of authenticity in CET-4 and TEM-8. Sino-US English Teaching 4 (2): 28-33. Philipp Siepmann 160 Brooks, Lindsay (2009). Interacting in pairs in a test of oral proficiency: coconstructing a better performance. Language Testing 26 (3): 341-366. https: / / doi.org/ 10.1177/ 0265532209104666 Bygate, Michael (1987). Speaking. Oxford: University Press. Carless, David, Chan, Kennedy K. H., To, Jessica, Lo, Margaret & Barrett, Elizabeth (2018). Developing students’ capacities for evaluative judgement through analysing exemplars. In: David Boud, Rola Ajjawi, Phillip Dawson & Joanna Tai (Eds.). Developing Evaluative Judgement in Higher Education: Assessment for Knowing and Producing Quality Work. London: Routledge. Cheng, Liying & Fox, Janna (2017). Assessment in the Language Classroom. London: Palgrave. Council of Europe (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment (CEFR). https: / / rm.coe.int/ 1680459f97 [March 2025]. Council of Europe (2020). Common European Framework of Reference for Languages: Learning, teaching, assessment (CEFR). Companion Volume. https: / / rm.coe.int/ common-european-framework-of-reference-for-languageslearning-teaching/ 16809ea0d4 [Mar. 2025]. Davis, Larry (2009). The influence of interlocutor proficiency in a paired oral assessment. Language Testing 26 (3): 367-396. https: / / doi.org/ 10.1177/ 0265532209104667 Elder, Catherine & Wigglesworth, Gillian (2006). An investigation of the effectiveness and validity of planning time in Part 2 of the IELTS speaking test. IELTS Research Reports 6. Euler, Dieter (2017). Design principles as bridge between scientific knowledge production and practice design. EDeR - Educational Design Research 1 (1): 1- 15. http: / / dx.doi.org/ 10.15460/ eder.1.1.1024 Fraile, Juan, Panadero, Ernesto & Pardo, Rodrigo (2017). Co-creating scales: the effects on self-regulated learning, self-efficacy and performance of establishing assessment criteria with students. Studies in Educational Evaluation 53 (June): 69-76. http: / / doi.org/ 10.1016/ j.stueduc.2017.03.003 Fulcher, Glenn (2003). Testing Second Language Speaking. London: Longman/ Pearson Education. Galaczi, Evelina D. (2008). Peer-peer interaction in a speaking test: the case of the First Certificate in English examination. Language Assessment Quarterly 5 (2): 89-119. https: / / doi.org/ 10.1080/ 15434300801934702 Galaczi, Evelina D. (2014). Interactional competence across proficiency levels: how do students manage interaction in paired speaking tests? Applied Linguistics 35 (5): 553-574. https: / / doi.org/ 10.1093/ applin/ amt017 Goh, Christine C. M. & Burns, Anne (2012). Teaching Speaking: A Holistic Approach. Cambridge: University Press. Grabowski, Joachim (2007). The writing superiority effect in the verbal recall of knowledge: sources and determinants. In: Mark Torrance, Luuk van Waes & David Galbraith (Eds.). Writing and Cognition: Research and Applications. Amsterdam: Elsevier. 165-179. http: / / dx.doi.org/ 10.1108/ S1572- 6304(2007)0000020012 Hamp-Lyons, Liz (1991). Scoring procedures for ESL contexts. In: Liz Hamp-Lyons (Ed.). Assessing Second Language Writing in Academic Contexts. Ablex. 241-277. Hessisches Kultusministerium (2020). Handreichung für die mündliche Kommunikationsprüfung in den modernen Fremdsprachen der gymnasialen Oberstufe. https: / / kultusministerium.hessen.de/ infomaterial/ Handreichung-fuer-die- Oral communication exams in the EFL classroom 161 muendliche-Kommunikationspruefung-in-den-modernen-Fremdsprachen-der [Mar. 2025]. Horwitz, Elaine K., Horwitz, Michael B. & Cope, Joann (1986). Foreign language classroom anxiety. The Modern Language Journal 70 (2): 125-132. https: / / doi.org/ 10.1111/ j.1540-4781.1986.tb05256.x Inbar-Lourie, Ofra (2017). Language assessment literacy. In: Elana Shohamy, Iair G. Or & Stephen May (Eds.). Language Testing and Assessment. Encyclopedia of Language and Education. Cham: Springer. 257-270. https: / / doi.org/ 10.1007/ 978-3-319-02261-1_19 Inoue, Chihiro & Lam, Daniel M. K. (2021). The effects of extended planning time on candidates’ performance, processes, and strategy use in the lecture listening-into-speaking tasks of the TOEFL iBT® test. ETS Research Report Series 1: 1-32. https: / / doi.org/ 10.1002/ ets2.12322 Ishii, David N. & Baba, Kyoko (2011). Locally developed oral skills evaluation in ESL/ EFL education: a checklist for developing meaningful assessment procedures. TESL Canada Journal 21 (1): 79-96. Jennings, Martha, Fox, Janna, Graves, Barbara & Shohamy, Elana (1999). The test-takers’ choice: an investigation of the effect of topic on language-test performance. Language Testing 16 (4): 426-456. https: / / doi.org/ 10.1177/ 026553229901600402 Khabbazbashi, Nahal (2017). Topic and background knowledge effects on performance in speaking assessment. Language Testing 34 (1): 23-48. https: / / doi.org/ 10.1177/ 0265532215595666 Khabbazbashi, Nahal (2021). On topic validity in speaking tests. Studies in Language Testing 54. Cambridge: University Press. Khabbazbashi, Nahal & Galaczi, Evelina D. (2020). A comparison of holistic, analytic, and part marking models in speaking assessment. Language Testing 37 (3): 333-360. https: / / doi.org/ 10.1177/ 0265532219898635 Knoch, Uta (2009). Diagnostic assessment of writing: a comparison of two rating scales. Language Testing 26 (2): 275-304. https: / / doi.org/ 10.1177/ 0265532208101008 Kormos, Judit (2006). Speech Production and Second Language Acquisition. Lawrence Erlbaum Associates Publishers. Kuiken, Folkert & Vedder, Ineke (2020). Scoring approaches: scales/ rubrics. In: Paula Winke & Tineke Brunfaut (Eds.). The Routledge Handbook of Second Language Acquisition and Language Testing. New York/ London [etc.]: Routledge. 125-134. Kultusministerkonferenz (=KMK) (2012). Bildungsstandards für die fortgeführte Fremdsprache (Englisch/ Französisch). Beschluss der Kultusministerkonferenz. https: / / www.kmk.org/ fileadmin/ Dateien/ veroeffentlichungen_beschluesse/ 20 12/ 2012_10_18-Bildungsstandards-Fortgef-FS-Abi.pdf [Mar. 2025]. Kultusministerkonferenz (=KMK) (2023). Bildungsstandards für die erste Fremdsprache (Englisch/ Französisch) für den Ersten Schulabschluss und den Mittleren Schulabschluss. https: / / www.kmk.org/ aktuelles/ artikelansicht/ kmkentwickelt-bildungsstandards-fuer-erste-fremdsprache-weiter.html [Mar. 2025]. Lam, Daniel M. K. (2021). Don’t turn a deaf ear: a case for assessing interactive listening. Applied Linguistics 42 (4): 740-764. https: / / doi.org/ 10.1093/ applin/ amaa064 Philipp Siepmann 162 Lam, Daniel M. (2024). Listening and interactional competence. In: Elvis Wagner, Aaron O. Betty, Evelina Galaczi (Eds.). The Routledge Handbook of Second Language Acquisition and Listening. New York/ London [etc.]: Routledge. 319-331. Lampropoulou, Leda (2023). The use and impact of pre-task planning time in the monologic task of languagecert speaking tests. Language Education & Assessment 6 (1): 1-18. Legutke, Michael K. (2010). Kommunikative Kompetenz und Diskursfähigkeit. In: Wolfgang Hallet & Frank G. Königs (Eds.). Handbuch Fremdsprachendidaktik. Seelze: Klett/ Kallmeyer. 70-75. Levelt, Willem (1989). Speaking: From Intention to Articulation. Cambridge: MIT Press. Lewkowicz, Jo (2000). Authenticity in language testing. Language Testing 17 (1): 43-64. https: / / doi.org/ 10.1177/ 026553220001700102 Lumley, Tom & McNamara, Tim F. (1995). Rater characteristics and rater bias: implications for training. Language Testing 12 (1): 54-71. Luoma, Sari (2004). Assessing Speaking. Cambridge: University Press. Matz, Frauke, Rogge, Michael & Rumlich, Dominik (2018). What makes a good speaker of English? Sprechkompetenz mit mündlichen Prüfungen erfassen. Der Fremdsprachliche Unterricht Englisch 153: 2-7. May, Lyn (2009). Co-constructed interaction in a paired speaking test: the rater’s perspective. Language Testing 26 (3): 397-421. https: / / doi.org/ 10.1177/ 0265532209104668 McCarthy, Michael J. (2005). Fluency and confluence: what fluent speakers do. The Language Teacher 29 (6): 26-28. McKenney, Susan & Reeves, Thomas C. ([2012] 2019). Conducting Educational Design Research. New York/ London [etc.]: Routledge. McNamara, Tim F. (1996). Measuring Second Language Performance. London: Longman. McNamara, Tim F. (2001). Language assessment as social practice: challenges for research. Language Testing 18 (4): 333-349. https: / / doi.org/ 10.1177/ 026553220101800402 Mercer, Neil, Mannion, James & Warwick, Paul (2019). Oracy education. The development of young people’s spoken language skills. In: Neil Mercer, Rupert Wegerif & Louis Major (Eds.). The Routledge International Handbook of Research on Dialogic Education. London: Routledge. 292-305. https: / / doi.org/ 10.43249780429441677 Messick, Sam (1994). Validity of Psychological Assessment: Validation of Inferences from Persons’ Responses and Performances as Scientific Inquiry into Score Meaning. Milanovic, Michael & Saville, Nick (1996). Introduction. Performance testing, cognition and assessment. Studies in Language Testing 3. Cambridge: University of Cambridge Local Examinations Syndicate. 1-17. Miles, Matthew B. Huberman, A. Michael & Saldaña, Johnny (2018). Qualitative Data Analysis. A Methods Sourcebook. Sage. Ministerium für Schule und Weiterbildung des Landes Nordrhein-Westfalen [MSW NRW] (2014a). Mündliche Prüfungen in den modernen Fremdsprachen in der gymnasialen Oberstufe. Handreichung September 2014. https: / / www.standardsicherung.schulministerium.nrw.de/ cms/ upload/ angebo te/ muendliche_kompetenzen/ docs/ 1503_Handreichung_Muendliche_Pruefunge n.pdf [Mar. 2025]. Oral communication exams in the EFL classroom 163 Nakatsuhara, Fumiyo (2011). Effects of test-taker characteristics and the number of participants in group oral tests. Language Testing 28 (4): 483-508. https: / / doi.org/ 10.1177/ 0265532211398110 Nakatsuhara, Fumiyo (2013). The Co-Construction of Conversation in Group Oral Tests. Peter Lang. Niedersächsisches Kultusministerium (2014). Materialien für den kompetenzorientierten Unterricht in der Gymnasialen Oberstufe. Sprechprüfungen. http: / / www.nibis.de/ nibis.phtml? menid=2182 [Mar. 2025]. Nitta, Ryo & Nakatsuhara, Fumiyo (2014). A multi-faceted approach to investigating pre-task planning effects on paired oral test performance. Language Testing 31 (4): 147-175. https: / / doi.org/ 10.1177/ 0265532213514401 O’Grady, Stefan (2019). The impact of pre-task planning on speaking test performance for English-medium university admission. Language Testing 36 (4): 505- 526. https: / / doi.org/ 10.1177/ 0265532219826604 O’Sullivan, Barry (2021). The Comprehensive Learning System. British Council Perspectives on English Language Policy and Education. www.britishcouncil.org/ exam/ aptis/ research/ publications/ british-council-perspectives-englishlanguage-policy-and-education [Mar. 2025]. Panadero, Ernesto & Romero, Margarida (2014). To scale or not to scale? The effects of self-assessment on self-regulation, performance and self-efficacy. Assessment in Education: Principles, Policy & Practice 21 (2): 133-148. https: / / doi.org/ 10.1080/ 0969594X.2013.877872 Penny, James A. & Johnson, Robert L. (2011). The accuracy of performance task scores after resolution of rater disagreement: a Monte Carlo study. Assessing Writing 16 (4): 221-236. https: / / doi.org/ 10.1016/ j.asw.2011.06.001 Plough, India, Banerjee, Jayanti & Iwashita, Noriko (2018). Interactional Competence: genie out of the bottle. Language Testing 35 (3): 427-445. https: / / doi.org/ 10.1177/ 0265532218772325 Robinson, Peter (2001). Task complexity, task difficulty, and task production. Applied Linguistics 22 (1): 27-57. https: / / doi.org/ 10.1093/ applin/ 22.1.27 Roever, Carsten & Ikeda, Naoki (2022). What scores from monologic speaking tests can(not) tell us about interactional competence. Language Testing 39 (1): 7-29. https: / / doi.org/ 10.1177/ 02655322211003332 Rogge, Michael (2012). Sagen können, was man zu sagen hat. Mündliche Kompetenz mit Sprechaufgaben fördern. Der Fremdsprachliche Unterricht Englisch 116: 2-6. Rogge, Michael (2018). Beurteilungskriterien im Unterricht erarbeiten. Der Fremdsprachliche Unterricht Englisch 153. 8-10. Rosenow, De (2014). Collaborative design: building task-specific rubrics in the honors classroom. Journal of the National Collegiate Honors Council 15 (2): 31- 34. Shohamy, Elana (1993). The Power of Tests: The Impact of Language Tests on Teaching and Learning. Washington, DC: The National Foreign Language Centre at John Hopkins University. Shohamy, Elana (1998). Critical language testing and beyond. Studies in Educational Evaluation 24 (4): 331-345. https: / / doi.org/ 10.1016/ S0191- 491X(98)00020-0 Shohamy, Elana (2001). The Power of Tests: A Critical Perspective on the Uses of Language Tests. Harlow: Longman. Shohamy, Elana (2001). Democratic assessment as an alternative. Language Testing 18 (4): 373-391. https: / / doi.org/ 10.1177/ 026553220101800404 Philipp Siepmann 164 Siepmann, Philipp (2024a). Tasks matter! Insights from a design-based research project on oral communication exams in the english language classroom. In: Julia Reckermann, Philipp Siepmann & Frauke Matz (Eds.). Oracy in English Language Education. Insights from Practice-Oriented Research. Cham: Springer. 147-166. https: / / doi.org/ 10.1007/ 978-3-031-59321-5_9 Siepmann, Philipp (2024b). Re-framing oracy in english language education. In: Julia Reckermann, Philipp Siepmann & Frauke Matz (Eds.). Oracy in English Language Education. Insights from Practice-Oriented Research. Cham: Springer. 17-36. https: / / doi.org/ 10.1007/ 978-3-031-59321-5_2 Siepmann, Philipp (forthc.). Designbasierte Forschung als Theorie-Praxis- Partnerschaft. Ein Kooperationsprojekt zur mündlichen Kommunikationsprüfung. In: Georgia Gödecke & Larena Schäfer (Eds.). Entwicklungs- und gestaltungsorientierte Fremdsprachenforschung. Forschungsmethodische Grundlagen und Praxisbeispiele. Trier: WVT. Siepmann, Philipp & Bruns, Janine (forthc.). Fostering and assessing oracy in the efl classroom: introducing the oracycle. In: Frauke Matz & Dominik Rumlich (Eds.). Examinations in Foreign Language Teaching: Theory and Practice. Frankfurt: Peter Lang. Skehan, Peter (1998). A Cognitive Approach to Language Learning. Oxford: University Press. Skehan, Peter & Foster, Pauline (1997). Task type and task processing conditions as influences on foreign language performance. Language Teaching Research 1 (3): 185-211. https: / / doi.org/ 10.1177/ 136216889700100302 Stokoe, Elizabeth (2013). The (in)authenticity of simulated talk: comparing roleplayed and actual interaction and the implications for communication training. Research on Language & Social Interaction 46 (2): 165-185. https: / / doi.org/ 10.1080/ 08351813.2013.780341 Tavakoli, Parveneh & Skehan, Peter (2005). Strategic planning, task structure, and performance testing. In: Rod Ellis (Ed.). Planning and Task Performance in a Second Language. Amsterdam: John Benjamins. 239-273. https: / / doi.org/ 10.107/ lllt.11.15tav Taylor, Lynda & Wigglesworth, Gillian (2009). Are two heads better than one? Pair work in L2 assessment contexts. Language Testing 26 (3): 325-339. https: / / doi.org/ 10.1177/ 0265532209104665 Vandergrift, Larry & Goh Christine C. M. ([2012] 2021). Teaching and Learning Second Language Listening. New York/ London [etc.]: Routledge. Vogt, Katrin & Tsagari, Dina (2014). Assessment literacy of foreign language teachers: findings of a european study. Language Assessment Quarterly 11 (4): 374-402. https: / / doi.org/ 10.1080/ 15434303.2014.960046 Walsh, Steve (2012). Conceptualising Classroom Interactional Competence. Novitas-ROYAL (Research on Youth and Language) 6 (1): 1-14. Wigglesworth, G., & Frost, K. (2017). Task repetition and fluency development in L2 oral production. Language Teaching Research 21 (4): 480-500. Yan, Da (2024). Rubric co-creation to promote quality, interactivity and uptake of peer feedback. Assessment & Evaluation in Higher Education 49 (8): 1017-1034. https: / / doi.org/ 10.1080/ 02602938.2024.2333005 Yuan, Fangyuan & Ellis, Rod (2003). The effects of pre-task planning and on-line planning on fluency, complexity and accuracy in L2 monologic oral production. Applied Linguistics 24 (1): 1-27. https: / / doi.org/ 10.1093/ applin/ 24.1.1 Zhao, Huahui & Zhao, Beibei (2020). Co-constructing the assessment criteria for EFL writing by instructors and students: a participative approach to construc- Oral communication exams in the EFL classroom 165 tively aligning the CEFR, curricula, teaching and learning. Language Teaching Research 27 (3): 765-793. https: / / doi.org/ 10.1177/ 1362168820948458 Zhao, Zhongbao (2013). An overview of models of speaking performance and its implications for the development of procedural framework for diagnostic speaking tests. International Education Studies 6 (3): 66-75. https: / / doi.org/ 10.5539/ ies.v6n3p66 Philipp Siepmann Leibnitz University Hannover