Qualitative Corpus Analysis

Encyclopedia of Applied Linguistics
Qualitative Corpus Analysis
Manuscript ID:
Wiley - Manuscript type:
Date Submitted by the Author: Complete List of Authors:
Keywords:
r Fo
Journal: article
The Encyclopedia of Applied Linguistics AL-2010-0424.R1
13-Sep-2011
Hasko, Victoria; University of Georgia, Language and Literacy Education corpus, Methods, Research Methods in Applied Linguistics
Pe
John Wiley & Sons
er
Re vi ew
Page 1 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Qualitative Corpus Analysis Victoria Hasko University of Georgia vhasko@uga.edu Word count: 2,474 Reference word count: 355 [A]Qualitative Corpus Analysis: Definition and Philosophy Qualitative Corpus Analysis is a methodology for pursuing in-depth investigations of linguistic phenomena, as grounded in the context of authentic, communicative situations that are digitally stored as language corpora and made available for access, retrieval, and analysis via computer. Researchers using Qualitative Corpus Analysis as the methodological basis for their investigations adopt an exploratory, inductive approach to empirically-based study of how the meanings and functions of linguistic forms found in the corpus interact with diverse ecological characteristics of language used for communication (speaker age, gender, level of education, and socio-economic background; place and time of a communicative event; relationship between interlocutors; speech modality, etc.). A common belief shared by all corpus linguists is that it is important to base linguistic investigations on real data, i.e., actual instances of oral or written communication as opposed to contrived or madeup data. Unique goals of Qualitative Corpus Analysis include facilitating simple, quick computeraided retrieval of authentic examples of the language phenomena under investigation, interpreting these empirical data in depth, and applying the ensuing insights to a broad range of intellectual explorations in language studies. The following entry summarizes the evolution of the field, outlines the methodological foundations and principles of Qualitative Corpus Analysis, and enumerates major areas of application of the methodology.
[A]Evolution of Qualitative and Quantitative Approaches to Corpus Analysis Although today we tend to associate all branches of corpus linguistics with a machine-readable format, language corpora were assembled and analyzed B.C (before computers) by numerous language luminaries, such as Samuel Johnson, Alexander J. Ellis, Joseph Wright, James Murray, Harold Orton, and Sir Randolph Quirk, in application to grammatical, lexicographical, and
Fo
rP
ee
John Wiley & Sons
rR
ev
iew
Page 2 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
dialectological endeavors (see Francis, 1992 for details). Thus, when working on entries for his Dictionary of the English Language published in 1755, Samuel Johnson collected a large, handwritten corpus of illustrative literary, philosophical, and scientific examples (ca. 150,000 for ca. 40,000 entries) with the goal of capturing and recording the contextual riches of the collected texts in addition to their linguistic significance. Later, field and structural linguists, such as Franz Boas, Edward Sapir, Leonard Bloomfield, and Kenneth Lee Pike, adhered to the principle of basing studies on samples of attested observable data (McEnery, Xiao, & Tono, 2006). Although early corpora consisted of slips of paper, the work of the scholars collecting them was conducted in what we can identify today as the spirit and philosophy of Qualitative Corpus Analysis.
The emergence and firm establishment of corpus linguistics as a methodology, in its modern, computer-based form, are associated with the ground-breaking efforts of W. Nelson Francis and Henry Kuera in developing the Brown Corpus of English in the 1960s. Their compilation of a onemillion word corpus (an impressive size at the time of its first release in 1964) paved the path for quantitative corpus research but also undeniably created the foundation for methods underlying the modern approach to Qualitative Corpus Analysis. Thus, the Brown Corpus was not compiled as a bank of unconnected words or sentences but rather was designed to include 2,000-word samples of meaningful, cohesive discourse, which were chosen based on their representation of a wide range of diverse styles and varieties of prose. Although primarily syntactic, the tagging system taxonomy applied to the corpus was a step forward from raw corpora in that it illustrated the possibilities of carrying out more refined linguistic and meta-linguistic analyses through corpus encoding. In terms of corpus retrieval options, the Brown corpus employed a variable-length record format, allowing researchers to use word + context as criteria to retrieve not only isolated word forms but also fullsentence citations. These characteristics reflect the most basic utility requirements for a properlycompiled corpus suitable for qualitative analysis.
Since the 1980s and onward, the field of corpus linguistics has been dominated by quantitative approaches fueled by the development and propagation of super-corpora, such as the British National
Fo
rP
ee
John Wiley & Sons
rR
ev
iew
Page 3 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Corpus (BNC), 100 million words; the Collins Birmingham University International Language Database (COBUILD), 525 million words and growing; and the Cambridge International Corpus (CIS), 900 million words. At the same time, smaller, parallel second-wave corpora have been springing up in the form of specialized databases designed by qualitative corpus linguists for meticulous, in-depth exploration of select linguistic phenomena or language varieties. Such corpora range from published collections of 1 to 3 million words (Helsinki Corpus of Historical English; Michigan Corpus of Academic Spoken English (MICASE); International Corpus of Learner English (ICLE)) to smaller and highly-specialized corpora, such as the Manuscript-Based Diachronic Corpus of Scottish Correspondence or the Corpus of Written Creole (see Beal, Corrigan, & Moisl, 2007a, 2007b). A review of most recent qualitative research studies reveals a growing trend of conducting indepth analyses on small, scrupulously collected samples of language in a specific context which enable researchers to concentrate on the fine details of linguistic interaction and then to relate these details to the specifics of the speech event and to the larger community of which they are a part (Waugh et al., 2007). The prospects for conducting thorough and multifaceted qualitative corpus investigations are looking even brighter with the emergence of multimodal corpora which allow for automatic analysis of not only speech but also such communication modalities as hand gesture, facial
expression, body posture, etc. (Kipp et al, 2009).
[A]Methodological Foundations
Qualitative Corpus Analysis is dually informed by the tradition of qualitative linguistic research as well as research methods specific to corpus linguistics. The methods and principles for conducting Qualitative Corpus Analysis are summarized below.
Fo
rP
ee
rR
ev iew
[B]Corpus Design A corpus suitable for fine-grained linguistic analysis should consist of complete, naturally-occurring texts (oral or written) whose origins and provenance are well-documented (see Sinclair, 1996). The requirement for naturally-occurring speech acknowledges the importance of analyzing actual, authentic, and attested data, as opposed to invented examples. The focus on complete texts and the
John Wiley & Sons
Page 4 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
importance of social and communicative context in analyzing language is central to Qualitative Corpus Analysis and qualitative research paradigm in general, as the context plays a part in determining what we say; and what we say plays a part in determining the context (Halliday, 1978, p.3). Therefore, texts are expected to be carefully selected to offer a satisfactory representation of modalities, genres, discourse communities, settings, and/or other language varieties that the corpus is designed to reflect.
[B]Corpus Mark-Up and Annotation Corpus mark-up represents an important method of documenting the aforementioned descriptive meta-data pertaining to the collected linguistic samples. Corpus mark-up is carried out by inserting standardized codes or tags in each of the documents of a raw corpus, with the codes kept separate from the corpus data per se. A number of mark-up schemes have been developed for encoding such specifications as document-wide information (document length, distributor, date, etc.), structural elements mark-up (e.g., chapter, titles, headings), and sub-paragraph structure (e.g., quotations,
abbreviations, terms).
Annotation is a corpus-encoding method similar to mark-up, except that the former involves the encoding of linguistic information (for in-depth discussion of corpus annotation, see Garside, Leech, & McEnery, 1996). Corpus annotation can be characterized by varying degrees of specificity and may address different areas of linguistic analysis, such as part-of-speech-tagging (the most common type); lemmatization; parsing; morphological, phonetic, prosodic, pragmatic, semantic, discourse, metaphorical annotation; and error tagging. The latter types of annotation and error tagging are more commonly associated with Qualitative Corpus Analysis research, due to the complex manifestations of the linguistic phenomena covered by these types of annotation. For example, differences between teasing and verbal duelling, discursive construction of hip-hop identity, or comparisons of television dialogue vs. natural conversation strategies (del & Reppen, 2008; Partington, 2006) are contextualized phenomena not immediately observable on the surface or available for automatic retrieval from raw texts, because their linguistic instantiations are realized across morphemes or even
Fo
rP
ee
John Wiley & Sons
rR
ev
iew
Page 5 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
thousands of words of running text. A variety of part-of-speech taggers are available for automatic annotation of large quantities of data, but when the encoding of finer linguistic categories characteristic of qualitative analysis requires a greater capacity for subtle judgment and the drawing of inferences, which are often cornerstones of rich and thorough qualitative investigations, manual annotation must be implemented by a human encoder. Accuracy and consistency are crucial factors for ensuring the reliability of the largely interpretive process of corpus annotation. Such solutions as development of a coding manual as a reference, thorough descriptions of the codes, instructions for applying these codes, and flowchart-style decision algorithm trees to assist coders are being utilized to minimize any arbitrary and individual variation in the manual annotation of complex phenomena.
Fo
rP
[B]Data Retrieval
The drudgery of painstaking corpus mark-up and annotation during Qualitative Corpus Analysis starts paying off at the stage of data retrieval. The more complex the nature of the analyzed phenomena, the more corpus utility depends on the proper compilation and annotation of the corpus prior to carrying out the query; for this reason, corpora complied for Qualitative Corpus Analysis are typically conceptualized and designed with fairly specific research goals in mind. A variety of retrieval software applications are available for processing corpora, i.e., for searching through it for particular words or surface structure, for displaying parts of it, or for analyzing specific linguistic features they contain (e.g., WordSmith Tools, Mono Conc Pro, and XAIRA). While such programs can operate on plain text files, they take full advantage of annotated corpora, allowing for retrieval of all instances of surface structures containing the target tag(s) in the sampled corpus within a few seconds. Data searches can be as sophisticated as the annotation scheme applied to the corpus data.
The capabilities for automatic data retrieval during Qualitative Corpus Analysis enable the scholarly community to replicate searches with the purpose of reproducing and verifying outcomes of linguistic investigations when corpora are publicly available and corpus mark-up, annotation, and problemoriented tagging schemes are made available along with the published corpus. This is a significant
ee
John Wiley & Sons
rR
ev
iew
Page 6 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
benefit for a qualitative methodology, as qualitative approaches are often criticized for the difficulty of scientific reproduction and verification of their analyses.
[B]Data Interpretation Qualitative Corpus Analysis is a relatively new research enterprise with unique methods; yet at the same time, it draws on a variety of previously-established methods of linguistic enquiry. Text selection, corpus size, mark-up, and annotation schemes are typically pre-determined in Qualitative Corpus Analysis before corpus compilation and are informed by such qualitative methodologies as narrative inquiry, genre studies, (critical) discourse and conversation analysis, ethnography of communication, contrastive analysis, and semantic and pragmatic analysis. The methodological merger between Qualitative Corpus Analysis and traditional qualitatively-oriented approaches is organic in that they are oriented towards telling the story by accounting for the richness of the contextual factors that situate quantitative findings in the ecology of human communication. The merger is also mutually beneficial. On the one hand, without the insights accumulated by qualitative linguistic methodologies from various interdisciplinary areas over the last century, corpus methodology alone would lack the sophistication of the methodological apparatus, as well as the subject matter base for describing and interpreting the complex nature of human communication. For these reasons, in qualitative corpus research, corpus evidence is often supplemented with further experimental data (interviews, self-reports, elicitation, etc.), introspection, and inventions (see Chafes principles, 1992). On the other hand, non-corpus-based qualitative approaches are typically constrained by rather limited size of text and a painstaking approach of accounting for each individual example of analyzed phenomena every time data analysis is conducted, whereas Qualitative Corpus Analysis methodology creates affordances for (collaborative) compilation of larger datasets, computeraided storage, annotation, and automatic retrieval, as well as replication and sharing of empirical evidence.
Although Qualitative Corpus Analysis is often construed as not being concerned with frequencies and statistical classification of linguistic features identified in the data, the value of mixing qualitative and quantitative approaches to corpus research is uncontestable. Rich insights that stem from Qualitative Corpus Analysis can serve as a precursor for the latter, which allows for quantification and
Fo
rP
ee
John Wiley & Sons
rR
ev
iew
Page 7 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
classification of the linguistic forms, i.e., for generalizing the findings of the qualitative analysis of a sample corpus to a larger population (Schmied, 1993). Or, vice versa, results of Quantitative Corpus Analysis can be explicated, explained, and illustrated through the interpretive power of qualitative methodology beyond the bare statistics of occurrence. In practice, Qualitative Corpus Analysis is almost invariably used alongside quantitative approaches.
[A]Practical Applications A number of subfields of linguistics have benefited significantly from the application of insights and findings that stemmed from Qualitative Corpus Analysis investigations. Corpora designed and executed in accordance with the aforementioned principles of Qualitative Corpus Analysis methodology have had a major impact in the areas of sociolinguistics, discourse, pragmatics, semantics, and forensic linguistics. The field of lexical studies is a particularly robust example: modern dictionaries are able to offer significantly more precise, comprehensive, and up-to-date lexical entries because lexicographers have gained access to vast yet methodically annotated corpora, which enables them to tease apart and illustrate usage differences attributed to the wealth of contextual and interpersonal variables. Similarly, grammarians routinely rely on corpora both to verify probabilities of occurrence of grammatical elements in quantitative terms and to qualitatively hone, illustrate, and fine-tune grammarians claims. Today, major publishing houses support corpus development and the publication of corpus-based dictionaries, grammars, and reference books (e.g., see Collins COBUILDs extensive catalogue of text-based and electronic dictionaries and reference materials).
Qualitative Corpus Analysis has revolutionized the field of language education by providing an empirical basis for intuitions about what authentic interactions look like and which language structures, strategies, and patterns should be highlighted in language courses and analyzed for effective pedagogical treatment (OKeeffe, McCarthy, & Carter, 2007). Corpora comprised of crosssectional and longitudinal samples of learner speech have allowed L2 language researchers to investigate the dynamics of L2 development at different proficiency level and with regard to such variables as L1-L2 distance and transfer effects, instructional context, and age of acquisition;
Fo
rP
ee
John Wiley & Sons
rR
ev
iew
Page 8 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
accomplishments in this area are exemplified in the prolific work of Sylviane Granger and her associates (2002 and elsewhere).
[A]Conclusions Qualitative Corpus Linguistics is a methodology that has made a significant contribution to language studies by enabling researchers to access, highlight, and methodically explore attested linguistic phenomena that range from frequent to rare, simple to complex, and easily discernable to stretched over thousands of words. Fueled by technological innovation, informed by the breadth of multimethod approaches, and built upon the successes of various subfields of linguistic research, Qualitative Corpus Linguistics offers a unique and promising path to the continued discovery of the complexities of human communication with richness, precision, and appreciation for its multifaceted ecology. In the future, its influence and growth is likely to be spurred by both the social turn in many subfields of linguistics favoring in-depth, contextually-grounded methodological approaches to data analysis and the breakthroughs in computer technologies alleviating the laboriousness of the processes of manual corpus compilation and retrieval.
SEE ALSO: Applied Corpus Linguistics; Corpora in the Language-Teaching Classroom; Corpus Software for Applied Linguistics; Corpus Linguistics: Quantitative Methods; Multimodal
Corpus-Based Approaches.
Fo
rP
ee
rR ev iew
References del, A., & and Reppen, R. (Eds.). (2008). Corpora and discourse: The challenges of different settings. Amsterdam: John Benjamins. Beal, J., Corrigan, K., & Moisl, H. (Eds.). (2007a). Creating and digitizing language corpora: Vol 2. Diachronic databases. New York: Palgrave Macmillan. Beal, J., Corrigan, K., & Moisl, H. (Eds.) (2007b). Creating and digitizing language corpora: Vol 1. Synchronic databases. New York: Palgrave Macmillan.
John Wiley & Sons
Page 9 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Chafe, W. (1992). The importance of corpus linguistics to understanding the nature of language. In J. Svartvik (Ed.), Directions to corpus linguistics. Proceedings of the Nobel Symposium 82, Stockholm, 4-8 August 1991 (pp. 79-97). Berlin: Mouton de Gruyter. Francis, W. N. (1992). Language corpora B.C. In J. Svartvik (Ed.), Directions in corpus linguistics. Proceedings of the Nobel Symposium 82, Stockholm, 4-8 August 1991 (pp. 17-34). New York: Mouton de Gruyter. Garside, R., Leech, G. N., & McEnery, T. (1997). Corpus annotation: Linguistic information from computer text corpora. New York: Longman. Granger, S. (2002). A birds-eye view of computer learner corpus research. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language teaching. Amsterdam: John Benjamins. Halliday, M. A. K. (1978). Language as social semiotic: The social interpretation of language and meaning. Baltimore: University Park Press. Kipp, M., Martin, J.-C., Paggio, P. and Heylen, D. (Eds.). (2009). Multimodal corpora: From models of natural interaction to systems and applications. New York: Springer. McEnery, A., Xiao, Z., & Tono, Y. (2006). Corpus-based language studies: An advanced resource
book. London: Routledge.
OKeeffe, A., McCarthy, M., & Carter, R. (2007). From corpus to classroom. Cambridge: Cambridge
University Press.
Partington, A. (2006). The linguistics of laughter. London: Routledge.
Schmied J. (1993). Qualitative and quantitative research approaches to English relative constructions. In C. Souter & E. Atwell (Eds.), Corpus-based computational linguistics (pp. 85-96). Amsterdam: Rodopi. Sinclair, J. M. (1996). EAGLES. Preliminary Recommendations on Corpus Typology. Available from http://www.ilc.cnr.it/EAGLES/corpustyp/corpustyp.html Waugh, L. R., Fonseca-Greber, B., Vickers, C., & Erz, B. (2007). Multiple empirical approaches to a complex analysis of discourse. In M. Gonzalez-Marquez , I. Mittelberg, S. Coulson, & M. J. Spivey (Eds.), Methods in cognitive linguistics (pp. 120148). Amsterdam: John Benjamins.
Fo
rP
ee
John Wiley & Sons
rR
ev
iew
Page 10 of 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Suggested Readings: Adolphs, S. (2008). Corpus and context: Investigating pragmatic functions in spoken discourse. Amsterdam: John Benjamins. Aston, G., Bernardini, S., & Stewart, D. (Eds.). (2004). Corpora and language learners. Biber, D., Connor, U., & Upton, T. A. (2007). Discourse on the move. Using corpus analysis to describe discourse structure. Amsterdam: John Benjamins. Fitzpatrick (Ed.). (2004). Corpus linguistics beyond the word: Corpus research from phrase to discourse. New York: Rodopi. Hoey, M., Mahlberg, M., Stubbs, M., Teubert, W. (2007). Text, discourse and corpora: Theory and analysis. London: Continuum. Leistyna P., & Meyer C. F. (Eds.). (2003). Corpus analysis: Language structure and language use. Amsterdam: Rodopi. Meyer C.F. (2002). English corpus linguistics: An introduction. Cambridge: Cambridge University Press. Stubbs, M. (1996). Text and corpus analysis. Malden, MA: Blackwell. Wilson, A., Archer, D., & Rayson, P. (Eds.). (2006). Corpus linguistics around the world. Amsterdam: Rodopi.
Fo
rP
ee
John Wiley & Sons
rR
ev iew

Qualitative Corpus Analysis

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Qualitative Corpus Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

Encyclopedia of Applied Linguistics