Sie sind auf Seite 1von 184



A thesis submitted to the University of Manchester Institute of Science and

Technology (UMIST) for the degree of Doctor of Philosophy


Abdel-Hamid Elewa

Centre for Computational Linguistics

No portion of the work referred to in the thesis has been submitted in support of
an application for another degree or qualification of this or any other university,
or other institution of learning

` 1
First and foremost, I thank God Almighty, Who teaches man what he does not know.
Then, I would like to express my gratitude to my supervisor Dr. Paul Bennett who
throughout the years I have spent doing my research showed me an unequivocal
perseverance, gave me so much time and enriched my work with his invaluable

I am deeply grateful to Mona Baker, Professor of Translation Studies and Director of

CTIS, University of Manchester, who provided me with the first drops of genuine

I am also indebted to Paul Johnston, Department of Computation, University of

Manchester for all the technical support he gave me and also for the statistical
programs he wrote specifically for my research.

During this work I have collaborated with many colleagues for whom I have great
regard, and I wish to extend my warmest thanks to all those who have helped me
with my work in the Department of Language Engineering, particularly, Sattar Izwaini
and Amin Almuhanna; we managed together through our discussions and
commentary on Arabic language to raise a lot of interesting points.

I would like also to thank my examiners, Prof. Harold Somers, Dept. of Informatics,
University of Manchester, and Dr. James Dickens, Dept. of Middle Eastern Studies,
University of Durham, for their criticism and helpful comments that gave my thesis its
academic form.

My thanks are also due to my wife, Iman Refaey who helped me in assembling the
electronic corpus for use in this research.

This work was funded by the Egyptian Government.

` 2
Table of Contents

TABLE OF CONTENTS...........................................................................................................


NOTES ON TRANSLITERATION.........................................................................................

ARABIC TRANSLITERATION CHART...............................................................................

CHAPTER ONE: INTRODUCTION......................................................................................

1.1 THE RATIONALE BEHIND THE STUDY.........................................................................................
1.2 GOALS...................................................................................................................................
1.3 CORPUS-DRIVEN OR CORPUS-BASED...........................................................................................
1.4 LEXICAL COLLOCATION............................................................................................................
1.5 SYNONYMY............................................................................................................................
2.1 INTRODUCTION........................................................................................................................
2.2 THE STATUS OF ARABIC...........................................................................................................
2.3 FACTORS IN THE SURVIVAL OF THE CLASSICAL ARABIC.................................................................
2.4 THE DEVELOPMENT OF ARABIC LINGUISTICS...............................................................................
2.4.1 Recent Contributions to Arabic Linguistics.................................................................
2.5 SOME FEATURES OF ARABIC GRAMMAR......................................................................................
CHAPTER THREE: CORPUS LINGUISTICS.....................................................................
3.1 INTRODUCTION........................................................................................................................
3.2 INTUITION VS. EMPIRICISM........................................................................................................
3.3 HISTORICAL SURVEY................................................................................................................
3.3.1 Pre-computational Corpus Linguistics.........................................................................
3.3.2 Computational Corpus linguistics................................................................................
3.4 CORPUS DESIGN......................................................................................................................
3.4.1 The purpose of the corpus............................................................................................
3.4.2 Text Sampling...............................................................................................................
3.4.3 Text Typology...............................................................................................................
3.5 TECHNICAL REQUIREMENTS......................................................................................................
3.6 CORPUS PROCESSING...............................................................................................................
4.1 INTRODUCTION........................................................................................................................
4.2 ARABIC FOR COMPUTATIONAL ANALYSIS......................................................................................
4.2.1 Progress in machine-readable Arabic language............................................................
4.2.2 Arabic Language resources........................................................................................... Available Arabic Corpora....................................................................................... Arabic Online Texts................................................................................................

` 3
4.2.3 Tagging Arabic Texts....................................................................................................
4.2.4 Tools for Processing Arabic..........................................................................................
4.3 DESCRIPTION OF THE CORPUS....................................................................................................
4.3.1 The rationale behind this selection...............................................................................
4.3.2 Why these texts?...........................................................................................................
4.4 Conclusion.......................................................................................................................
CHAPTER FIVE: LEXICAL COLLOCATION....................................................................
5.1 INTRODUCTION........................................................................................................................
5.2 DEFINITION OF COLLOCATION....................................................................................................
5.3 COLLOCATION AND COLLIGATION...............................................................................................
5.4 TYPES OF COLLOCATION...........................................................................................................
5.5 SPANS....................................................................................................................................
5.6 SEMANTIC PROSODY................................................................................................................
5.7 EXTRACTION OF COLLOCATION..................................................................................................
5.7.1 Using statistics in collocation extraction...................................................................... Lemmatisation........................................................................................................ Concordances......................................................................................................... Frequency............................................................................................................... T-test: a measure of difference...................................................................................
CHAPTER SIX: SYNONYMY: AN OVERVIEW..................................................................
6.1 INTRODUCTION........................................................................................................................
6.2 DEFINITION.............................................................................................................................
6.2.1 Synonymy - Four Approaches......................................................................................
6.2.2 Degrees of Synonymy.................................................................................................. Absolute synonymy:.............................................................................................. Propositional synonymy....................................................................................... Near-synonymy......................................................................................................
6.3 SYNONYMY IN ARABIC.............................................................................................................
6.4 THE REPETITION OF SYNONYMS IN ARABIC.................................................................................
6.5 CONCLUSION...........................................................................................................................
7.1 INTRODUCTION........................................................................................................................
7.2 DATA CHOICE.........................................................................................................................
7.3 DATA ANALYSIS......................................................................................................................
7.4 A CASE STUDY: THE WORD PAIR JAA’A AND ATA ‘COME’....................................................................
7.4.1 Summary.......................................................................................................................
7.5 A CASE STUDY: THE WORD PAIR ITHM AND DHANB ‘SIN’....................................................................
7.5.1 A Few Remarks.............................................................................................................
7.5.2 Summary.......................................................................................................................
7.6 A CASE STUDY: THE WORD PAIR H}ASIBA AND Z}ANNA ‘THINK’ ....................................
7.6.1 Summary.......................................................................................................................
7.7 A CASE STUDY: THE WORD PAIR H}BB AND WDD ‘LOVE’.............................................
7.8 CONCLUSION...........................................................................................................................

` 4
CHAPTER EIGHT: CONCLUSION.......................................................................................

APPENDIX 1: COPYRIGHTS..............................................................................................................
APPENDIX 2:................................................................................................................................
APPENDIX 3:................................................................................................................................
GENRES AND TEXTS INCLUDED IN CAC............................................................................................
APPENDIX: 4................................................................................................................................
APPENDIX 5:................................................................................................................................

` 5
I am concerned in this study with applying the corpus linguistics methodology that
concentrates on investigating language use, with particular reference to Classical Arabic. I do
not wish to undermine what has been done on the basis of intuition, but the time is now
opportune to use modern tools to discover new facets of linguistic behaviour in relation to
Classical Arabic and to demonstrate the potential impact of computational methods on Arabic
linguistic studies.
One of our main aims will be to demonstrate the usefulness of the corpus methodology in
describing Classical Arabic by examining lexical collocations. To do this, I have assembled a
classical Arabic corpus which covers the early period of Islam, because the available Arabic
corpora are only limited to Classical Arabic of today which is called Modern Standard
This study is also an attempt at explaining some issues in semantic relations, particularly
synonymy, which can be accounted for in terms of collocations by using a computerized
concordancer that enable large quantities of text to be searched for all occurrences of a
particular lexical item. Through lexical collocational analysis I can compare and contrast the
characteristic uses of semantically related words such as synonyms. According to Cruse
(1986) two lexical units would be absolute synonyms (i.e. would have identical meanings) if
and only if all their contextual relations were identical. Through corpus analysis we can show
whether two items are indeed absolute synonyms or not by checking their relations in all
available contexts.
By this technique, it is possible to compare seemingly synonymous words and find out
whether they are real synonyms or not. I will argue that absolute synonyms do not exist in
terms of their collocational patterns. Through collocation we can distinguish one sense of a
word from another and know whether a seemingly synonymous pair are real synonyms or
not. Collocation is, therefore, a device with which a particular sense of a word is activated.
In order to prove that subtle differences can be brought out by collocation, the collocates
for a list of synonymous pairs are analysed. This will be explored through the analysis of
these seemingly synonymous Arabic words, aiming to show that many synonyms are partial
or incomplete, and none can be called true (absolute) synonyms.

` 6
Notes on Transliteration
There are two common ways to represent the Arabic script in the Roman script: transliteration
and transcription. The former is based on graphemic mapping and the latter is phonemic.
There are some Arabic consonants and vowels which have equivalent letters in the Roman
alphabet. These are easy to transliterate or transcribe; it depends on what purpose one has for
rendering them in either way.

For the letters which have no Roman equivalent, linguists or Arabic users sometimes adopt a
set of symbols which are mainly transcriptions. Such a process yields a mixed system of
transliteration and transcription. ‘This leaves plenty of scope for scholarly debate, with the
result that there are now many supposedly international standards’ (Whitaker, 2002).

Among the most common systems are the one adopted by the International Convention of
Orientalist Scholars in 1936, the British Standard, BS 4280, the US Library of Congress and
the American Library Association. The latter have issued “Romanisation tables” for more
than 150 non-Roman written languages and dialects including Arabic (ibid).

One of the reasons given by Whitaker (ibid) for the inefficiency of these Romanisation
systems is that they are not easy to key due to the sophisticated figures they use like dots,
lines and other marks.

For a practical reason, I tried to use a transliteration system which makes the utmost use of
the English alphabet. This is dependent to a great extent on the one adopted by the US
Library of Congress with some modifications as shown below:

` 7
Arabic Transliteration Chart
Name of letter Arabic letter shape Symbol in Transliteration
hamza ‫ء‬ ‘
ba: ‫ب‬ b
ta: ‫ت‬ t
θa: ‫ث‬ th
ji:m ‫ج‬ j
ha: ‫ح‬ h{
xa: ‫خ‬ kh
da: ‫د‬ d
dh:l ‫ذ‬ dh
ra: ‫ر‬ r
za: ‫ز‬ z
si:n ‫س‬ s

shi:n ‫ش‬ sh
sa:d ‫ص‬ s}
da:d ‫ض‬ d}
ta: ‫ط‬ t}
z{a: ‫ظ‬ z{
cayn ‫ع‬ c

ġayn ‫غ‬ gh
fa: ‫ف‬ f
qa:f ‫ق‬ q
ka:f ‫ك‬ k
la:m ‫ل‬ l
mi:m ‫م‬ m
nu:n ‫ن‬ n
ha: ‫ه‬ h
wa:w ‫و‬ w
ya: ‫ي‬ y

Such a chart is easy to use because it is familiar to both Arabic and English speakers. For
Arabic consonants that do not have equivalents in English we used the most common system.
This applies with two types of sounds: emphatic and pharyngeal. For the former we put a dot
under the symbol to show emphasis and for the latter we used two symbols (c and ‘). This
makes it difficult to represent the doubling of consonant like dhdh or khkh. We would rather
ignore doubling with such consonants. This is much easier than struggling with new symbols.
The Arabic definite article al ‘the’, which sometimes takes another form when assimilated
with the following sound is represented as is without showing any sort of assimilation. The
long vowels are marked by doubling the short vowel to avoid putting more figures on the
symbols, except for Proper Nouns which are commonly used among Arabs and Arabists.

` 8
Chapter One: Introduction
1.1 The Rationale Behind the Study
A general motivation for many recent linguistic studies has been the desire to automate some
descriptive processes and to employ scientific observation in the study of language.
Linguistic studies in Arabic were first introduced and established by Al-Khalil, who was the
first lexicographer to give lexical order in the collection of his dictionary (cf. Haywood
1965), and his outstanding pupil, Sibawayh in the late 8th century. What Al-Khalil and
Sibawayh did was to investigate language use to formulate rules and describe linguistic

Although Arab lexicographers were the first to integrate corpus-analysis into the dictionary-
making process, with Al-Khalil’s manual corpus discussed below in Chapter Two, a corpus-
based approach is certainly not used in contemporary lexicography in the Arab world. The
mainstream lexicography is undoubtedly intuition-based.

Employing modern technology in investigating language use should enable us to research

more aspects of linguistic behaviour, in more detail. We can investigate how people exploit
the resources of their language and how they use it to achieve their communicative goals.

1.2 Goals
The current study will provide the resources for accurate descriptions of the way words co-
occur in classical Arabic. For that purpose, the major activity of the study has been the
assembly and analysis of a corpus comprising samples of different types of written Arabic:
biography, religion, poetry, etc.

With this in mind, I decided to work toward the compilation of a comprehensive corpus of
written Classical Arabic in order to facilitate research in a range of disciplines concerned with
Arabic and with the general methodology of Corpus Linguistics. I would like to emphasise
that the Classical Arabic Corpus will be available for any potential user for her or his needs.

` 9
1.3 Corpus-driven or Corpus-based
Two approaches can be at play when working with corpora: corpus-based and corpus-driven
(Tognini Bonelli, 2000). When a linguist in describing a language using this methodology
observes a phenomenon without a prior knowledge on the validity of a particular theory, i.e.
when he/she finds out something unexpected to him/her, it is called corpus-driven. For
instance, the subtle differences that occur between synonymous pairs and the semantic
features extracted for every word that distinguishes it from another (as shown in Chapter
Seven) are not obvious by casual observation nor available in the literature I have examined.

On the other hand, when we use corpus linguistics methodology to support or invalidate an
existing hypothesis or a theory, then it is called corpus-based. For example, in Chapter Five
we test a collocation assumed to be fixed and find out that it is not a collocation at all.

1.4 Lexical Collocation

Lexical collocation has become trendy in linguistic research. This phenomenon gained such
currency after computational corpus-based methodology had been adopted as an accurate and
effective way of text analysis.1

Collocation was recognised early by Arab linguists, but the phenomenon was just referred to
between the lines and did not get an extensive study. Al-Sakkaki, for example, in Miftah al-
Ulum defined it as ‘likull kalimah maca s}aah}ibatiha maqaam’ (every word has with its

companion a position [lit. trans.]). This roughly means that every word has a different sense
with a different adjacent word. Emery (1988: 51) regards this quotation as equivalent to
Firth’s (1957: 179) definition of collocation, which is the company that a word keeps. He
also considers the classification of Thacalibi’s lexicon, Fiqh Al-Lughah2, as showing his
awareness of how significant collocational relations are.

Linguistic units can be combined with each other phonologically, morphologically,

syntactically, lexically, or semantically. We are only concerned with combinations on the

1 Corpus-based methodolgy has been widely used for other linguistic fields (Biber, 1998, Meyer, 2002).
2 This lexicon, which was written ten centuries ago, classifies the types of actions with their specific doers and
the types of words with their specific predicates. So, it can be considered like Benson’s (1997) work on
collocation, The BBI Dictionary of English Word Combinations.

` 10
lexical level. This is what is traditionally called collocation3. In this sense, ‘collocation is
restricted to idiosyncratic relationships between words’ (Wouden, 1997: 24).

1.5 Synonymy
One of the main goals of this study is to check the synonymy or non-synonymy of a given
pair of items. We will use the corpus-based analysis and the computer technology that can
help us identify easily the relative frequency of words, whether throughout the whole corpus
or in a particular genre. Subsequently, we can explore the collocates of words and further
isolate the various meanings, or senses, a word has. This is especially interesting for words
which are considered synonyms, since an investigation may reveal differences in syntactic
and/or stylistic distribution. Such research might show that near synonymous words or
structures are used in different ways.

Synonymy is understood as a gradual cline along which we may locate different degrees of
synonymy: near, cognitive and absolute. However, there is a widely held opinion among
semanticists that strict or absolute synonymy is rare in human languages (see Cruse: 1986). A
further step is taken here in this study to demonstrate that absolute synonymy does not exist
in Arabic. The study will argue that Arabic never has two words that mean nearly the same
thing and are used in the same range of grammatical and lexical patterns.

Chapter Two discusses Arabic linguistics scope and pinpoints some technical problems in
digitising Arabic. Chapter Three gives a brief account about the methodology of corpus
linguistics and surveys its historical background. Chapter Four describes the corpus compiled
especially for this study and gives an account of the tools used for analysis. Chapter Five
discusses lexical collocations with a particular emphasis on Arabic. Chapter Six addresses the
concept of synonymy in English and Arabic. Chapter Seven tries to find differences between
seemingly synonymous word pairs by studying their collocation and suggests that applying
corpus linguistics methodology to Arabic can help us become aware of lexical matters.
Chapter Eight is dedicated to findings and conclusions.

3 Extensive definitions and explanation of collocation will be given in Chapter Five.

` 11
Chapter Two: Some Aspects of the Arabic Language
2.1 Introduction
The Arabic language originated in Arabia in pre-Islamic times, and spread rapidly across the
Middle East. Today it is spoken as an official language by almost 200 million people,
Muslims and Christians, in more than twenty two countries, from Morocco in Africa to Iraq
in Asia, and as far south as Somalia and Sudan. As the language of Qur’an, the Holy book of
Islam, it is to some extent familiar throughout the Muslim world, rather as Latin was in the
lands of the Roman church. It is taught as a first language in all Arab countries and as a
second language in non-Arab Muslim states. It is the liturgical language of about one billion
Muslims. In addition, Modern Standard Arabic is the lingua franca used and respected by
educated Muslims throughout the entire world.

2.2 The Status of Arabic

Arabic4 is the oldest language which is still used for communication and culture in the Arab
world. There are many varieties of Arabic: Classical Arabic (CA), Modern Standard Arabic
(MSA), and colloquial Arabic, which differs from country to country. Classical written
Arabic, however old, has changed little over the centuries. Classical Arabic is still employed
today as the written language, but it is restricted to formal usage as a spoken tongue. It differs
considerably from its descendant, the modern colloquial Arabic that is the medium of general
conversation. Modern Standard Arabic is the variety of Arabic which is essentially a
continuation of Classical Arabic as it was passed down to us throughout the ages and which is
partly a modernised form of expression of contemporary ideas, concepts, science and

Although it is widely used throughout the Arab world, with different vernaculars, in everyday
language, language of communication and entertainment, the Modern Standard Arabic is still
4 The term ‘Arabic’ is applied to a number of speech-forms which, in spite of many and sometimes substantial
differences, are reckoned as dialectal varieties of a single language. The term Classical Arabic is sometimes used
as a synonym of Standard Arabic. However, I will use the former to refer to the early Classical Arabic which
extends over the first four centuries of Islam, i.e. until the early eleventh century, whereas the latter is used to
refer to the modern Classical Arabic. These two varieties are sometimes interchangeable; they can be used in
formal situations such as schools, universities, textbooks, lectures (whether religious or academic), mass-media
and personal writing as in letters and autobiography.

` 12
adopted as the formal language of press, writing and speeches. Because Qur’an is revealed in
Arabic, most Arabs think that this language must be perpetuated and kept alive (Haeri, 2003).
They always emphasise that Classical Arabic, as a living language should be used in formal
written and spoken language. Bakalla (1983) argued that ‘living’ language is by definition the
language acquired by children in their early age and this is not the case with Classical Arabic.
However, the general desire among the educated Arabs is to write and read literary works,
Islamic and general books in an elegant language and nothing can be more beautiful than
Classical Arabic. ‘In that sense Classical Arabic is [a] ‘living’ language, but it is not a ‘living’
in the sense of colloquial’ (Bakalla 1983: xvii).

2.3 Factors in the Survival of the Classical Arabic

One of the main characteristics of language is change. If a language does not change through
time, it is likely to become obsolete, or extinct in terms of its usage. This could make one
wonder how Classical Arabic has been preserved over so many centuries. The obvious
connection between the Holy Qur’an and the language in which it was revealed to Prophet
Muhammad explains the preservation of this language. Below we will give three reasons that
made the Classical Arabic language survive throughout the past centuries.

1. Belief in its divinity

Most Arab grammarians and theologians regarded Arabic as a divine language. Explaining
Allah’s saying, “And He taught Adam all the names (of everything)” (Qur’an: Sura 2, 31,
trans. by Mohamed Khan), Ibn Abbas [a well-known exegete of the Qur’an] said, ‘Allah
taught him all common names [i.e. all generic nouns] such as animal, earth, valley, mountain,
donkey etc.’ (Ibn Faris, s}ah}ibi:33).

This is an important question in linguistic study because if we believed that Arabic is God-
given, we would stick to the Qur’anic language and the expressions used by the ancient Arabs
and the early Muslims. Ibn Faris (s}ah}ibi, p.17) said, ‘We are not entitled to-day to
innovate, to use expressions which they did not use, or to develop analogies which they did
not know; for this would mean corrupting the language and annihilating its essence.’

` 13
Unlike English and other languages, there was no detailed discussion in Arabic literature
concerning the origin of speech. Arab linguists did not concern themselves with this question
because, owing to the aforementioned Qur’anic verses, they thought that Arabic is revealed
by Allah. This question was considered as theological rather than linguistic. Even those who
thought that Arabic is not revealed by Allah gave up investigating this question since there
was no conclusive evidence for either position. Most grammarians, however, regarded Arabic
as God-given language. Therefore, Arabs had to stick to the usage of their predecessors to
whom the Qur’an was revealed. All they could do was to describe this usage for Arab and
non-Arab people in order to stick to the genuine Arabic, the language of the Qur’an.

As a point of departure, we can realise how Islam influenced the study of language. Arabic
itself was very limited before the advent of Islam in terms of use by a large number of people.
The introduction of Arabic grammar was motivated by Islamic incentives to protect the
language from being corrupted by converts.

2. Belief in its Supremacy

As a God-given language, Arabs believe that Arabic is the most perfect, the noblest, the
clearest and the richest language. In the introduction of his Lisan Al-Arab, Ibn Manzur says,
“Allah made the Arabic language superior to all other languages and enhanced it further by
revealing the Qur’an through it and by making it the language of the people of Paradise. The
Prophet was reported to have said, ‘I am an Arab; the Qur’an is Arabic; and the language of
the people of Paradise is Arabic.’” This is why Arabs believe in the supremacy of Arabic as a
God-given language.

Arabic is of supreme and great importance for all Muslims and for those who are interested in
study of the orient; for the former it is their religious language which contains the Qur’an, the
Prophetic traditions and the early Muslim works and for the latter it is the medium of the
Arabic culture.

Ibn Faris (Sahibi p. 17) noted that Arabic is the most eloquent language. Attempting to

` 14
translate the word sayf (sword), for example, into Persian we would have only one word as
equivalent. In Arabic, we can have many words for ‘sayf’, each with a specific connotation.

To most Arabs, Arabic has a magical effect on their souls. Hitti (1958: 90) said,

No people in the world, perhaps, manifest such enthusiastic

admiration for literary expression and are so moved by the word,
spoken or written, as the Arabs. Hardly any language seems capable of
exercising over the minds of its users such irresistible influence as
Arabic. Modern audiences in Baghdad, Damascus and Cairo can be
stirred to the highest degree by the recital of poems, only vaguely
comprehended, and by the delivery of orations in the classical tongue,
though it be only partially understood. The rhythm, the rhyme, the
music produce on them the effect of what they call ‘lawful magic’
(sih{r h{alaal).

3. It has a long standing and genuine linguistic heritage

After the expansion of the Muslim Empire and the increase in the number of foreign people
who embraced Islam, Arabic became corrupted in the course of being used by the new
converts. Those new converts made mistakes when reading the Qur’an. Muslim scholars
began to fear lest the language become completely corrupted. They had to put an end to such
a situation to protect the Holy Qur’an. On the one hand, they wanted to preserve their
language from the distortion and the solecism introduced by non-Arabic speakers and, on the
other hand, to teach those converts Arabic to help them perform their Islamic rituals properly,
since prayers can only be performed in Arabic. Thus, the main motivation for the introduction
of Arabic descriptive models was to preserve the knowledge of Classical Arabic.

There is no consensus among Arab or foreign linguists with regard to who is the founder of
Arabic grammar. Some argued that Ali (the fourth Caliph) is the true founder of Arabic
grammar as a science. He gave the first glimpse by dividing the word classes into a ‘noun’, a
‘verb’ or a ‘particle’; others said that Abu Al-Aswad Ad-Du’ali was the first one to write the

` 15
first treatise of Arabic grammar on the basis of what Ali or Ziyad Ibn Abihi, who was the
governor of Iraq by then, supposedly told him.

Although people differ as to who introduced Arabic grammar, they are unanimous in
asserting that it was introduced to preserve the language of the Qur’an. Al-Anbari (Nuzhat:
11) concluded that the first founder of grammar was Ali ibn Abi Talib, because all stories
referred to him and Abu al-Aswad referred to Ali ibn Abi Talib. Abu al-Aswad himself
admitted that he learned grammar from Ali ibn Abi Talib.

The first written treatises in Arabic grammar appeared at the end of the eighth century when
Al-Khalil ibn Ahmad and his outstanding pupil Sibawayh wrote their influential and
pioneering books describing the Arabic language. The former wrote his dictionary of Arabic
Al-cAyn and the latter wrote his grammatical description of Arabic.

The science introduced by Abu al-Aswad dealt with all branches of modern linguistics as a
whole. There was no separation among the different fields of linguistics as in the modern
time. Many of the early Arab scholars had the ability to write in all branches of linguistics.
For example, Sibawayh’s Kitab, dealt with phonetics, syntax, morphology and phonology.
Moreover, Al-Zamakhshari had outstanding works in the field of syntax and lexicography, in
addition to his pioneering work in the exegesis of the Qur’an.

2.4 The Development of Arabic Linguistics

It is well known that Arabic linguistics emerged in the seventh century for a religious
motivation: to preserve the language of the Holy Qur’an from the mistakes made by the new
foreign converts. Some modern linguists assumed that the beginning of Arabic linguistics was
influenced by Indian or Greek linguistics, but there is no concrete evidence for such a theory.
The science was founded before the beginning of the great movement of translation from
other languages into Arabic in the Umayyad and Abbasid eras. Therefore, Arabic linguistics
was introduced by Arabs since Ali Ibn Abi Talib, the true founder of Arabic linguistics, had
no contact with Indian or Greek culture at that time.

` 16
The golden age of Arabic linguistics was between the eighth and the eleventh century. Chejne
(1969: 170) notes that “in the 12th and 13th centuries Arabic was looked upon with
admiration by the West, in the same manner the Arab of today looks at the more developed
Western languages.”

Owens (1998, ch. 9) argued that Arabic linguistics reached its highest methodology and its
most sophisticated level with Jurjani (d. 1078). There are many contributions made by later
linguists until the end of the eleventh century, but they were mainly interested in reworking
what had been done by their predecessors.

Little contribution has been made in the past millennium. Linguists throughout this period
used only to remodel or to add relatively slight changes to what has been done in the early
ages of Islam. However this little contribution, based on the same corpus used by their
predecessors, was still within the general framework introduced by the early linguists as
‘...the major preoccupation of grammarians… (after 1077)… was to find ever new ways of
saying the same thing’ (Carter 1985a: p. 270, quoted by Owens, 1988: p. 8). In other words,
‘Sibawayh had, in fact, laid down the basic rules and methods of grammar, while the later
grammarians’ contribution consisted only in expounding his theory in a more explicit and
systematic form, or in finding new applications for it’ (Bohas, Guillaume and Kouloughli:
1990, p.5). They were mainly concerned with codifying and preserving the literature of their

2.4.1 Recent Contributions to Arabic Linguistics

There is still something to be done in the study of Arabic language especially with the
introduction of scientific approaches and modern technology in the field of linguistic
investigation. The early Arab linguists felt that their contribution was not enough. Al-Khalil
ibn Ahmad for example said, “If someone has in mind another cause for grammar than the
one I mentioned, let him come forth with it!” (Al-Iid{aah{, p. 66 quoted in Versteegh:
1997: 74).
In the early 20th century the current trend was to rely totally on what has been formulated
during the early period of Arabic linguistics. On one hand, this approach was more interested

` 17
in verifying and editing the grammatical manuscripts left by the Arab grammarians. On the
other hand, it tries to explain and interpret such work in modern linguistic terms.

During the last four decades the study of Arabic language has increased dramatically. The
current tendency has been to enrich Arabic with modern theories of linguistics through
comparative or applied linguistic studies. There are two main features which characterise
modern Arabic linguistics of the last decades. First, the tendency towards the application of
linguistic theories and methodologies, especially to the teaching of Arabic as a first language.
Secondly, the use of modern techniques in linguistic research, as in computational linguistics
and corpus linguistics.

Much of the work in this field was done in thesis or dissertation form, both in the universities
of the Arab world and abroad. Very few of these studies have been published. Straley (1989)
listed the dissertations done in the American universities in the field of Arabic linguistics
from 1967 to 1987 in an annotated bibliography. He noticed that these dissertations, in
general, cover a wide variety of topics: phonology, grammar, comparative linguistics,
language planning, sociolinguistics and pedagogy. Bakalla (1983: p. xxxvii) pointed out that
much of the work on Arabic linguistics ‘has been influenced by developments within
linguistic theory and that many studies have been formed in, and reflect, contemporaneous

There are also indications of the same interest in engaging with the development in linguistic
theory as it is a very dominant paradigm in all branches of science represented by the
establishment of some Arabic teaching centres in the Arab world and abroad and the
appearance of some periodicals and journals interested in Arabic linguistics like the Journal
of Arabic and Islamic Studies (JAIS), Journal of Arabic Linguistics (in Germany), Arabica
and Al-cArabiyya (Arabic). Moreover a number of the big universities all over the world are
now engaged in organising conferences, workshops and seminars devoted to Arabic
linguistics for many purposes: scientific, commercial, or others.

` 18
With the introduction of computational techniques into the field of linguistics in USA and
Europe, a corresponding interest in the use of computers to investigate the Arabic language
grew, as was also the case for the theoretical linguistics. Academic centres, companies and
conferences specialised in Natural Language Processing flourished in the Arab countries and
abroad5. Research in this domain is currently under development.

2.5 Some Features of Arabic Grammar

So far I have briefly outlined some aspects of the status and development of Arabic, Classical
Arabic in particular, in order to acquaint the reader with the variety I am going to use in this
study. To pursue the notion, I will illustrate the main features of Arabic grammar to help
those who are to construct a computational system for Arabic know what kind of
complexities they may face. More importantly, this section serves as an introduction to the
problems encountered when attempting to search the Arabic texts by lemmas. Below are
some of these features:
1. Unlike English, Arabic is written from right to left.
2. Arabic script has twenty-eight letters representing the consonants in addition to three long
vowels; the shape of each letter depends on what position it occurs in a word: initial, middle,
or final.
3. Arabic short vowels are written in a diacritical form, under or above the preceding
consonant. ‘For technical reasons the diacritisation is impossible when using the computer.
This results in compound cases of morphological-lexical and morphological-syntactical
ambiguities’ (Khalid et al 1974: 29). This has been sorted out recently with programs that can
handle all diacritics in Arabic (c.f. 4.2.1).
4. Arabic, like Latin, is a synthetic (inflectional) language. English, on the other hand, is non-
synthetic. Arabic has three cases: nominative, accusative and genitive. The use of cases in
Arabic is complicated by the fact that they are mainly represented by short vowels and the
Arabic script only allows the writer to show consonants and long vowels. Diacritics which
are traditionally used for case endings are computationally problematic.

5 The Institute for the Languages & Cultures of the Middle East, University of Nijmegen, focuses nowadays on
Arabic Natural Language Processing. It managed lately to produce an Arabic/Dutch dictionary based on a large
Arabic corpus. Also, some companies like Sakhr (based in Egypt) are involved with developing solutions for
Arabic computationally, and there are also conferences which are specialised in Arabic worldwide.

` 19
5. Arabic words are formed from roots, based on fixed morphological patterns, where vowels,
suffixes, prefixes, or infixes can be added to form new words. Once we know these patterns,
it is easy to form any possible word without making mistakes. More interestingly, we can add
to the base form other linguistic units such as person, tense, mood, participles case, and
verbal noun. English words, on the other hand, are generated from stems. Therefore, the key
word for searching the traditional lexicon in Arabic is the root6, whereas in English it is the
stem (the basic word form).
6. As Arabic is a synthetic language, it allows pronouns to combine with words forming one
single word. Such personal pronouns can be suffixed to nouns, verbs or particles. We may
form an Arabic word representing a whole sentence. Consider the following word in (1)
(1( ‫ ضربوك‬d{arabuuka (they hit you).

This property raises another problem of analysing Arabic computationally. When searching
for a word in an electronic text, we have to search for every possible form of this word. This
is because, if we look for the stem of this word, like in English, we will find a huge amount
of results which are not needed. In Arabic we can form different roots by adding more
characters. For example, cam (year) can include camer (populated), nacam (ostrich), camel
(worker) are derived from different root words. All the occurrences of each word in a simple
word search program which is not trained on Arabic idiosyncrasies can give a good result
which won’t need a laborious hand-editing.
7. Word order in Arabic is more flexible than in English. There are two types of word order in
Arabic: VSO and SVO.

6 By the word ‘root’ I mean the three or four nuclear conosonantal letters from which we can generate all
possible word forms in Arabic by adding suffixes, prefixes or infixes.

` 20
Chapter Three: Corpus Linguistics
3.1 Introduction
Corpus is a Latin word which means ‘body’, hence any collection of texts, linguistic or non-
linguistic, can be called a corpus, such as the Corpus Juris Civilis which was a collection of
early Roman laws and legal principles in the sixth century and the corpus Manuscript of
Chaucer (1400) which included Chaucer’s works. In 1731 Alexander Gruden used the Bible
(King James Version) as a corpus to show that the Bible is consistent (Kennedy 1998: 14). In
modern linguistic terms, a corpus is a designed collection of written, spoken or a mixture of
written and spoken data which can be used for linguistic investigation. In this sense, not any
collection of texts can be called a corpus since there is a big difference between a corpus and
a text database; the former has to be ‘a systematic, planned, and structured compilation of
text’ (ibid: 4).

Linguists throughout the history of linguistic research used to rely on textual resources as a
source of evidence, at least, to prove the correctness of their theories about language. ‘It is
obvious that if someone sets about writing a grammar of English, he must have a suitable
body of material from which he is to elicit his rules, whether they be purely descriptive, or, as
is more common, prescriptive or even pedagogical. These bodies of material may be
considered corpora, with some extension of the term’ (Francis 1992: 28).

The study of language in general, whether in the context of modern linguistics or in the
context of earlier linguistic studies has also been largely based on empirical research. This
empirical approach to language is basically dominated by the observation of naturally
occurring data, as linguists tended to gather evidence for the grammaticality of a given word
or a sentence. This is partly what corpus linguistics deals with. However, corpus linguistics
goes beyond the use of corpora as a source of evidence in linguistic description. ‘Corpus
linguistics, like all linguistics, is concerned primarily with the description and explanation of
the nature, structure and use of language and languages and with particular matters such as
language acquisition, variation and change’ (Kennedy 1998: 8).

` 21
Nowadays, two main objectives can be met via corpus collection: linguistic investigation and
language processing. As Souter and Atwell (1993: i-ii) explained,

Two primary research applications of corpora can be identified. On the

one hand, linguists hope to exploit computer technology to explore
linguistic data for the purpose of identifying linguistic trends and
developing new theories. On the other, computer scientists and
practitioners of artificial intelligence hope to use the linguistic
information (including frequencies) present in and derivable from
machine-readable corpora to develop software tools and systems for
the automatic analysis, understanding and generation of natural
languages like English. In some cases, of course, they will also
employ the frameworks developed by the linguists, but this is by no
means always the case.

3.2 Intuition vs. Empiricism

A general motivation for much of the linguistic studies before 1950s was the desire to deal
with linguistics on the ground of a positivist and behaviourist view of the science. Linguists
like Harris and Hill regarded the corpus as the ‘primary explicandum of linguistics’. For such
linguists, the corpus can sufficiently meet this approach, whereas intuition can, if need be, be
used as a second source (Leech 1991: 8).

With the advent of Chomskyan theories in the 1950s, less emphasis was placed on empirical
observations. With the authority of his works, Chomsky has directed linguistics away from
empiricism and the study of language use towards rationalism for many years. Following de
Saussure, he made a distinction between two approaches to looking at language: a theory of
language system and a theory of language use. These two approaches are drawn (1965) as
competence and performance.7 Chomsky, rejecting the corpus linguistics approach, argued
Any natural corpus will be skewed. Some sentences won’t occur

7 Competence can be defined as ‘the speaker-hearer’s knowledge of his language’ whereas performance is ‘the
actual use of language in concrete situations’ (Chomsky: 1965: 4). Competence both explains and characterises
one’s internalised knowledge of a language. The only way to investigate competence is through introspection.

` 22
because they are obvious, others because they are false, still others
because they are implicit. The corpus, if natural, will be so wildly
skewed that the description [of language] would be no more than a
mere list.
(Chomsky, 1962, quoted in Leech 1991: 8)

In the course of invalidating the corpus-based studies, he gave a lecture at the Linguistic
Society of America Summer Institute in 1964, in which he rejected any kind of quantitive
(statistical) data. To prove his argument, he gave the following examples in (1a & 1b) below:
1a. I live in New York.
1b. I live in Dayton, Ohio.
The sentence (a) above is more likely to occur more frequently, just for demographic reasons!

Following Chomsky, Horrocks (1987: 13-14) argues that although performance is the only
available evidence to the linguist, it is not a transparent reflection of competence. He (ibid:
16) expounded that an observationally adequate grammar cannot simply list all the well-
formed sentences of a given language. This is because our mind has a finite storage capacity
and the choices of language we produce are infinite. Only by positing competence can we
account for a finite system with the capacity to define the membership of an infinite set.
Therefore, Chomsky suggested that ‘the corpus could never be a useful tool for the linguist,
as the linguist must seek to model language competence rather than performance’ (McEnery
and Wilson, 1996: 5).

Horrocks (1987: 16-17) further argued that relying on a corpus to derive grammatical rules
will lead to some sort of rules which have a predictive power which can generate strings not
available in the corpus itself. However, we can only test the validity of such strings through
referring to the intuition of a native speaker.

In fact, the approaches based on Chomsky’s theories, which were considered mainstream in
linguistics, do not cope with vast areas in language study, most notably register variation
where probability plays a major role in selecting certain combinations of meaning with
certain frequencies. However, the bitter criticism of corpus data arising from the tradition

` 23
which Chomsky established has led corpus linguists to remedy the drawbacks of corpus data
such as balance and representativeness. To pursue the premise, I would suggest, following
Francis (1992), if someone sets about writing a grammar of a given language, he must have a
corpus from which he is to derive his rules. Hence, the grammatical rules are derived by
analysis and generalisation of a corpus.

Makkai (1987) considers the total reliance on intuition a serious disease that affects modern
linguistics, which he called textphobia, that needs a radical surgery. A useful cure for this
disease, he proposes, is reading Malinowski, Firth and Halliday.

It is worth stressing that eliminating observation from the study of language was fervently
criticised by linguists even before Chomsky. Criticising de Saussure’s approach, Malinowski
in 1936 suggested overlooking the question of langue and parole and paying more attention to
the living speech in a context of situation, which is the main object of linguistic study
(Roulet, 1975: 78).

Firth (1957) also discredited the introspection of the native speaker as a reliable source of
data. He observed that the language we produce is governed to a large extent by particular
conventions (social, situational, etc.).

Sinclair (1991) also criticised the reliance on intuitive data, especially in the field of word
meaning, lexis. He argued that ‘we may see formal patterns being used overtly as criteria for
analysing meaning, which is a more secure and less eccentric position for a discipline which
aspires to scientific seriousness’ (Sinclair, 1991: 6-7).

Instead of treating corpus-based and intuition-based linguistics as two contradicting

disciplines, we would rather make use of both of them in a more interactive way. Fillmore
(1992) argued that the two approaches can have interface and complement each other, since a
corpus, however large, is inadequate to cover all aspects of language. On the other hand, a
corpus, however small, can pinpoint interesting facts. He emphasised the role of the native
speaker’s introspective judgement as a subsequent step.

In conclusion, studying corpora of naturally occurring data is a very useful way to test a

` 24
theoretical model put forward through intuition or to investigate a language with an emphasis
on what is typical in this language or what is called norms of use.

3.3 Historical Survey

We have to bear in mind that the manual collection of textual resources was the regular
means before the invention of computers. With the introduction of the computer into the field,
the interest in corpora has grown and continues to increase. This is because the manipulation
of large corpora accurately is quite hard without the use of computer techniques. The
computer made the process easier and more reliable. Thus we can distinguish between two
stages of corpus collection: Pre-computational and computational corpus Linguistics.

3.3.1 Pre-computational Corpus Linguistics

The definition of corpus as a designed collection of texts for linguistic investigation
subsumes all early corpora compiled in this respect. However, most studies of corpus
linguistics are mainly focused on English, although corpora in this sense are deeply rooted in
the history of linguistics as most of the great civilizations have long traditions of the study of
language. For instance Panini’s grammar of Sanskrit, Thrax’s grammars of Greek and early
Arab linguistics were definitely based on textual resources. However, apart from Arabic, we
do not know exactly what form of corpus they used, since none of them has left an account of
the methodology used.

The early Arab linguists relied mainly on three sources of linguistic data to describe their
language: the Holy Qur’an, poetry and nomad proverbs. This is obvious in their use of
quotations from these sources as linguistic evidence. Such quotations were certainly taken
from a corpus they designed for their inquiry about language. They have postulated certain
selection criteria for designing such a corpus. Versteegh explained, ‘on the one hand, the
corpus used by the grammarians was closed, being limited to the text of the Qur’an and the
pre-Islamic poetry, but on the other hand, the grammarians upheld the fiction of native
speakers whose judgement could be trusted’ (1997: 42).

They made it as representative as possible. Ditters (1990: 130) described this corpus as
consisting of specific media, registers, genres, styles and varied topics including poetry and

` 25
prose. He (ibid: 133) pointed out the way early Arab grammarians employed the corpus they
Originally corpus-information constituted the basis for a grammar of
the Arabic language, but instead of the grammar being tested out again
and again on corpus-data in a cyclic process as is the case in modern
corpus linguistics, this grammar became the norm for language use.

As for English language corpora, Francis (1992) gives a full description of English pre-
computer corpora. He divided corpora into three types: lexicographical, dialectological and
grammatical. But he pinpointed some drawbacks in these collections due to (1) the editors of
lexicographical collections like Oxford English Dictionary and Webster’s Dictionary in
particular, encountered a big problem, as they did not have enough citations for function and
simple words like, prepositions, articles and pronouns. (2) The major difficulty with
collections assembled for grammatical investigation is that ‘they are inevitably skewed in the
direction of the unusual and interesting constructions that the readers encounter, at the
expense of the normal core of the language’ (Francis 1992: 28). Commenting on this,
Johansson (1995) suggested, ‘the natural solution to this problem is to collect texts in a
systematic manner and subject them to the principle of “total accountability”‘ (Johansson,
1995: 244).

Quirk, in an attempt to avoid the shortcomings of the other corpora, collected a more
representative corpus (spoken and written), taken from a wide range of genres, as a basis for
describing English grammar. Therefore, his Survey of English Usage is considered a
landmark in corpus-based grammatical description in the 20th century. It is important to note
that ‘the spoken part of SEU corpus was, however, later computerised yielding the London-
Lund Corpus’ (Svartvik, 1990 quoted in Kenny 1999: 32). Therefore, Kennedy (1990: 17)
pointed out that the SEU corpus, which was initially manually assembled is considered a
transitional point between a non-computerised corpus and modern corpus linguistics.

Undoubtedly, working on such large corpora was tedious and exhausting. This is because
corpora without the assistance of computer techniques are time-consuming, banal, error
prone, boring and very expensive to process (McEnery and Wilson, 1996:10). It now takes a

` 26
matter of minutes to process such corpora by computer accurately.

As a point of departure we can conclude that the methodology of corpus linguistics, however
unrepresentative of the actual use of language, was widespread in linguistics for a long time.
Corpora remained as a source of data for linguistic research in spite of the difficulties raised
above until the 1950s, when the corpus for linguistic research underwent a severe blow at the
hands of Chomsky, who invalidated it as a reliable methodology (see 3.2).

3.3.2 Computational Corpus linguistics

With the introduction of computers to the field of corpus linguistics, much attention has been
given to this methodology. The electronic corpus has become widely recognised and
exploited when Francis and Kucera launched their pioneering corpus (Brown Corpus) in
1961. Then, linguists began to realise that electronic corpora can offer a new insight and a
reliable methodology for natural language processing, as they found out that computers have
made possible the collection, storage and processing of very large and varied texts. Unlike
manual corpora, computerised corpora can provide us with well-designed and representative
corpora, which are easy to process in few minutes. This can reveal unexpected features of
language. More important, ‘the ability to examine large text corpora in a systematic manner
allows access to a quality of evidence that has not been available before.’ (Sinclair, 1991a: 4)

Computerised English Corpora

Today, there are many electronic corpora available on either punched cards or CD ROMs in
various languages such as the Lancaster/Oslo-Bergen Corpus (LOB), London-Lund Corpus
the Lancaster/IBM Spoken English Corpus (SEC), The Longman/Lancaster English
Language Corpus, and the British National corpus (BNC).

Below I am going to give a brief account of two major English corpora: Brown Corpus as the
first computerised corpus and Birmingham Collection as the first major computerised corpus
used for dictionary-making based on a thorough study of the language use.
Brown Corpus
This was, undoubtedly, a pioneering corpus not only because it was the first computerised
corpus of English, but also because it was against the mainstream, which was intuition-

` 27
oriented. The corpus consisted of about one million words of the written English printed in
US in 1961, comprising 500 text samples of about 2000 words each. The samples were taken
from a variety of genres excluding verse and drama. The project started in 1961 and only
after three years (in 1964) was the corpus ready for distribution on a magnetic tape.

Birmingham Collection
The starting point of this corpus goes back to the 1960s in the form of research carried out at
Birmingham University where Sinclair (1969) issued his early computational British corpus:
OSTI project (135000 running words of informal conversation transcribed and
computerised). The collection undertaken at Birmingham University is made up of written
texts and transcribed speech. It was intended to provide raw language data for a variety of
purposes, relevant to the needs of the learners and teachers, lexicographic in particular
(Renouf, 1984: 4-5). Since 1980 Cobuild, which is a joint venture between Collins and the
School of English at Birmingham University, has been collecting a corpus for dictionary
compilation and language study, making use of the Birmingham collection.
In October 2000 the latest release of the corpus amounted to 415 million words and it
continues to grow with the constant addition of new material. Research at COBUILD over
the last fifteen years has shown that very large samples of text are necessary for good
linguistic study, since the vocabulary of English is so large (well over half a million different
words) and there is such variety in current usage. In order to draw statistically valid
conclusions from computerised analysis of a corpus, researchers need to have adequate data
samples at their disposal (

In addition to the corpora mentioned above, there are ‘a number of initiatives that have aimed
at collecting and disseminating textual material amongst the international research
community’ (Kenny 1999: 34). Below are examples of these initiatives: The ACL/DCI (the
Association for Computational Linguistics’ Data Collection Initiative) which produced a CD-
ROM containing just plain orthographic text. It consists of the Collins English Dictionary;
selections from the Wall Street Journal; the Penn Treebank of skeleton-parsed data compiled
by Mitch Marcus and his team at the University of Pennsylvania; and a database of scientific
abstracts. There are also some other initiatives like ECI (European Corpora Initiative), LDC

` 28
(The Linguistic Data Consortium), ELRA (The European Language Resources Association).

3.4 Corpus Design

The corpora we have mentioned above are not assembled haphazardly, since a corpus is
defined as a designed collection of texts. Prior to the process of collecting a corpus there
should be theoretical research to specify what type, time period, language variety or state,
size and design method a corpus involves (Sinclair 1987; Atkins et al. 1992; Biber 1993;
McEnery & Wilson 1996; Kennedy 1998, Meyer 2002).

3.4.1 The purpose of the corpus

From the many corpora we have discussed above we can conclude that corpora can be
designed for several purposes: as a basis for a dictionary; to create a word frequency list; to
study some linguistic phenomenon; to study the language of a particular author or time
period; to study language change; to train an NLP system; as a teaching resource for non-
native speakers; to study language acquisition. Due to the diversity of corpora purposes, there
is no consensus among corpus linguists as to the procedures or the selection criteria to be
followed in corpus design. For example, the selection criteria for Cobuild excluded poetry,
drama and technical language (Renouf, 1984: 6). In addition to excluding poetry and drama,
the Brown Corpus is designed to be a synchronic corpus- it contains written texts of
American English published in 1961. If the purpose of the corpus is to highlight the features
of a language over a period of time, we will definitely need a criterion that allows that
purpose to be met. Moreover, specialist corpora may introduce different criteria to study a
certain aspect of the language.

Some of the first considerations in constructing a corpus is to specify for whom and for what
the corpus is designed: for personal research, or to serve as a general resource. Kennedy
(1998: 70) argued, ‘the optimal design of a corpus is highly dependent on the purpose for
which it is intended to be used.’ Anyhow, Atkins et al (1992) and Meyer (2002) drew up the
principal features of corpus design for whatever purpose. They discussed the practical stages
in building a corpus: selection of sources, text annotation, copyright permission, in addition
to some extra-linguistic variables.

` 29
3.4.2 Text Sampling
The next step after deciding the type, purpose and content of a corpus is to select and sample
the actual texts which will make up the corpus. Biber (1993: 243) pointed out that any
selection of texts is considered a sample, irrespective of being representative or not, but he
noted that ‘a corpus must be ‘representative’ in order to be appropriately used as the basis for
generalisations concerning a language as a whole.’ However, we have to bear in mind, in the
first place, that there may be a corpus that is designed to represent not the language as a
whole but one particular genre or the whole works of an author for example. Secondly, it is
feasible to get a grip of the complete Old English corpus or the complete Early Middle
English corpus, but a complete 20th c. British or American English corpus is not feasible.
This is because it is too difficult to access all the publications in a given language, let alone

There are two ways of sampling a language: language reception

and language production, i.e. whether to sample the audible
and readable language or the spoken and written language
(Atkins et al., 1992: 5). We can hardly achieve a
representative sample of the total language production for the
vast demographic and contextual variation among people. In
addition, a corpus, however big, is small when compared with the entire population of the
language under investigation.

Moreover, ‘the value of a corpus as a research tool cannot be measured in terms of brute size.
The diversity of the corpus, in terms of the variety of registers on text types it represents, can
be an equally important (or even more important) criterion’ (Garside, Leech and McEnery,
1997: 2).

With this in mind, Garside, Leech and Sampson, 1987: 6) noted that Sinclair (1982) defined
the problem of corpus compilation as a problem of selecting the right sample from the
existing massive quantities of machine–readable texts. The main challenge in

` 30
sampling the population8 of a given language lies in
representing all the relevant genres, topics or registers
while keeping the corpus at a manageable size. Therefore, sampling
has to be conducted according to statistical measures and thus
will be qualitatively and quantitatively representative of the
entire publication and population.

More importantly, in order to achieve

an accurate representativeness of the
samples, in general corpora, we have to ensure the diversity of the selected data.
With the diversity of the corpus, we can avoid the pervasiveness of a certain genre or the
stylistics of an author. Sampling from various genres can reduce the possibility of being
dominated by stylistic idiosyncrasies of a particular author (Atkins et al. 1992: 2).

Sampling all data randomly, where all texts have a chance to

be represented, can also reduce the stylistic idiosyncrasies of authors.
However, Biber (1993: 244) argued that the process of random
sampling is mostly used within each subgenre to ensure a
representative selection of texts.

Sinclair (1995: 27-28) made a distinction between a ‘whole text’ corpus and a ‘sample
corpus’. He noted that ‘samples are small, in relation to texts such as newspapers, books,
radio programmes, and of a constant size, hence not qualifying as texts.’ Unlike many corpus
linguists like Francis and Kucera in their pioneering corpus (Brown Corpus) in 1961, he
thinks that ‘whole text corpus’ should be a default value for anyone building a corpus. To
him, ‘the use of small samples is just a remnant of the early restraints on corpus building’
(ibid). Stubbs (1993: 11) also argues in favour of whole texts being the unit of study. He also
quoted Sinclair saying that ‘few linguistic features of a text are distributed evenly

8 To statisticians, this word does not necessarily refer to human beings as commonly used. We may have a
population of anything to be counted such as people, animals, trees, companies, books, cars, etc. (Stuart, 1968:

` 31
throughout’, which could be overlooked with use of sample texts.

3.4.3 Text Typology

Atkins et al (1992) distinguished between two criteria for constructing a corpus: external
(non-linguistic) and internal (linguistic). The former criteria are the first to look at when
compiling a corpus, whereas the latter won’t be attained until the corpus becomes available
for analysis (ibid: 5).

In sampling written texts, the designer of the corpus has to take into account some important
information about both the author and the reader who differ in regard to certain author-related
and work-related criteria. Such considerations, in addition to contextual criteria, are also
required when sampling spoken data. These criteria are by definition non-linguistic.

Atkins et al. (1992) have given a full systematic account of non-linguistic characteristics in
corpus design. Work-related criteria include, among other things, mode (written, spoken,
written to be read, written to be spoken), text origin, preparedness, participants, genre, style,
setting, factuality, topic, date of publication. Author-related criteria are those
associated with authors. These criteria are mainly
demographic: geographical, ethnic, socioeconomic, and social
(age, education, sex, profession, nationality, age and size of intended
audience or readership, etc.). Contextual criteria refer to situationally-
defined varieties such as conversation (face-to-face vs.
telephone (informal), monologue vs. dialogue, personal vs. impersonal.

3.5 Technical Requirements

In addition to the criteria mentioned above, there are also some considerations one has to
keep in mind when designing a corpus such as getting permission, data capturing, marking-

Before starting the process of creating a corpus, the designer may have to get permission
from the publishers of his selected works, national or international, to use the text in an
electronic form for language research. Having got permission, he needs to capture the data.

` 32
Written corpora are easy to capture by keyboarding, scanning or downloading from the
Internet. However, proofreading is still needed to make sure of the reliability of the data.
Spoken material, on the other hand, is difficult to capture. Spoken materials need to be
recorded and then transcribed before processing. To have a reliable transcribed text is,
undoubtedly, time-consuming, expensive and error-prone. This is because people’s perception
of speech may differ in respect of prosodic features, situations, homophonous words, etc.

Once a text, written or spoken, is captured electronically, some information can be added,
electronically, to indicate some text features such as titles, chapters, paragraphs, sentence
boundaries, headings, various types of hyphenation, etc. This process is called marking-up.
There is also some other information, which can be added to the text to show the parts of
speech of each sentence (as in tagged corpora), or the sentence structure and the function in
the sentence for each word (as in parsed corpora).

3.6 Corpus Processing

Once a corpus is available to use in an electronic form it needs to be processed by computer
for use in linguistic research. Since most corpora are incredibly large, it is nonsense to search
a corpus without the help of some software that can highlight what we look for accurately and
fast. Hence, we need tools to turn the electronic texts into databases, which can be searched.
There are a lot of tools designed for such a purpose.

Barnbrook (1996), Meyer (2002) and Kenny (2001) gave an overview of how to process such
a corpus. The first thing the computer techniques can do with texts is to provide word
frequency lists for the whole contents of the texts.
Frequency Lists
These lists can be made by identifying every word form in the text, counting identical forms
and classifying them according to a particular order: alphabetical, or according to their
frequency. This can be done in descending or ascending order. Listing words according to
their frequencies can show how often every single word form occurs in the text. Therefore,
‘by examining a list, one can get an idea of what further information would be worth
acquiring: or one can make guesses about the structure of the text, and so focus on
investigation’ (Sinclair, 1991: 31).

` 33
A concordance can be defined as listing all occurrences of search-words in the text with a
short section of the context that precedes and follows each word. Unlike word frequency lists,
the search-word is represented within its contextual environment; this can give more
information about the nature and behaviour of words. This process is also called KWIC (key
word in context). The search-word can be highlighted by putting it in the centre of each line,
with a space on each side. The arrangement of each key word is alphabetical according to the
left-hand or the right-hand context. Barnbrook (1996) describes the main features of
concordance programs in detail.

In addition to KWIC and word frequency lists, most programs also offer the possibility of
searching for word combinations within a specified range of words. Furthermore, if the
program is a bit more sophisticated, it might also provide its user with lists of collocates
based on some statistical tests. Collocation is discussed in detail in Chapter Five.

3.7 Summary
This chapter has given a brief account about the methodology of corpus linguistics and has
surveyed its historical background. We have investigated some aspects of corpus linguistics
to make it easy for the reader to be aware of the state of the art. Such aspects include the
methodology for creating a corpus, such as representativeness, size, sampling, etc., the types
of corpora as well as the technical requirements needed for utilising corpora.

` 34
Chapter Four: Description of the Corpus and Tools of Analysis
4.1 Introduction
Based on the information given in the previous chapter we embarked on building a
computerised Arabic corpus to use in our linguistic study on lexical collocations and
synonymy in Arabic, taking into consideration the state of the art of Arabic which we will
discuss below. We attempted to meet all the design criteria for corpora compilation in order
that we can conduct a methodical study based on it and to make it available as a resource for
other researchers to use in the future.

4.2 Arabic for Computational analysis

Work in Arabic computing did not start as early as European languages. Attempts have been
made, but due to some technical problems with Arabic script (orthography) and grammar
there is far less development than in English and languages written with the Roman alphabet.
This is because ‘the native Arabic grammar [which is produced by early Arab linguists],
although one of the most sophisticated systems of linguistic analysis ever devised, was
developed by scholars who lacked the concepts of consonant, vowel, and syllable’ (Koenraad
et al, 1999: 162-63). This raises some problems of digitising Arabic which require laborious
work of computation. For instance, the absence of vowels in Arabic9 makes the process of
tagging or any morphological analysis quite hard and sometimes ambiguous. Consider for
example the three letters-word ‫ ورد‬wrd which can be lexicalised as a verb َ‫ وَرَد‬warada ‘come,

be mentioned’, a noun ٌ‫ وَ ْرد‬ward ‘flower’, a noun ٌ‫ وِرْد‬wird ‘watering place’. For more details

about the difficulties of analysing Arabic computationally see Goweder and Roeck (2001),
Khoja, Garside and Knowles (2001), Van Mol (2002).

4.2.1 Progress in machine-readable Arabic language

The Sakhr Company has been working on digitising Arabic since 1985. Two years later they
managed to produce the first Arabic morphological analyser. Not until 2001 did they manage

9 A few written Arabic texts contain vowels; the most famous one is Qur’an, with a fully-detailed vowel system.
Then we can find some old Arabic poems and some primary schoolbooks with only vowels that mark the words

` 35
to launch an Arabic OCR that can handle Arabic efficiently, even the problem of diacritics.

Using the latest techniques to handle Arabic through OCR, a lot of attention has been given to
render Arabic texts, especially religious material, into machine-readable form; many such
texts are now available on the web. However, these texts cannot be considered corpora
because they lack systematicity, representativeness and proper planning. Nevertheless, there
was some work predating the widespread use of personal computers capable of handling
Arabic script at European universities, although this work used transliterated versions of the
Arabic texts. One of the pioneering projects, done on a mainframe computer, was the corpus
of early Arabic poetry assembled by Alan Jones at Oxford University. This corpus was
considered one of the major computerised sources of Arabic literary material before the
personal computer could handle Arabic.

Al-Jabouri and Knowles (1988) compiled a transcribed corpus of Arabic to investigate the
quantitative properties of cohesion in Arabic. This corpus is also transliterated. They noted
some difficulties that they encountered in the process of digitising Arabic that had not
previously been tackled. For instance, identifying orthographic words in Arabic is more
complicated than in English, because many Arabic words can be attached to the following
string of characters, like wa ‘and’ and fa ‘then’ which are always attached in writing to the
following word.

Izwaini (2000) attempted to use corpus-based analysis with respect to Arabic but he ended up
using a manual Arabic corpus; it was not manually keyed, but he used the corpus he selected
in a hard copy form. He studied the impact of translation on collocations in Arabic using two
corpora: English and Arabic. The English corpus, which is part of TEC (Translational English
Corpus)10 is electronic, consisting of translated English text from Arabic. The Arabic corpus
which is used for analysis with the naked eye11 consists of translated Arabic text from
Swedish. Later on, after the remarkable development in the field of computational Arabic,

10 This corpus consists of translated works into English; it was first suggested by M. Baker (1995), CTIS,
Manchester University.
11 He analysed the English corpus electronically using Wordsmith tool, but with Arabic corpus he could not find
at the time an efficient tool (OCR) to convert the text into an electronic form nor a tool to process it (a
concordancer). So, he used to look up the novels he selected with his naked eyes to find interesting patternings.

` 36
Izwaini in his Ph.D. thesis (in progress) used another corpus covering these three languages
(Arabic, English and Swedish) electronically.

4.2.2 Arabic Language resources

With respect to the development of tools for machine-readable Arabic that can handle Arabic,
a lot of links to Arabic linguistic and Arabic and Islamic cultural sources exist on the web.
Moreover, there is a strong tendency among Arabic newspaper publishers to post their articles
on the Internet, but most of them do so in the form of images and this is useless for the
assembling of computational corpora. For the most part, textual material in digital format can
be obtained from Arab publishing houses or companies interested in building Arabic
databases for commercial purposes. Available Arabic Corpora

Compilation of a large corpus of MSA (Modern Standard Arabic), consisting of several
million words, was completed in 2003 at the University of Nijmegen for lexicographical use
( This corpus is a raw corpus (not tagged or
lemmatized) containing a variety of genres: newspapers, books, novels, reports, etc. After a
long search, they decided to use Monoconc Concordance Program which they found
sufficient for their needs. They at last managed to publish a Dutch- Arabic dictionary (2003),
based on that corpus.

ELRA (European Language Resources Association) provide two Arabic corpora: An-Nahar
newspaper Corpus, containing around 140 million words and Al-Hayat newspaper corpus,
cotaining 18 million words. The former is just a raw corpus whereas the latter has only mark-
up notation, i.e. with more information relating to the original layout of the texts, including
sentence and paragraph boundaries, headings, deletions, and typographic features.
LDC (Linguistic Data Consortium) have also two Arabic corpora: a corpus of Arabic
newswire text, containing 76 million words and a corpus of Egyptian Arabic speech,
consisting of 60 unscripted telephone conversations, lasting between 5 and 30 minutes.

The Sakhr software Company in Egypt, is a pioneering company using the latest techniques

` 37
to fulfil the needs of the Arabic market and the Arabic speaking population in the field of
Arabic processing. It provides a large number of text collections and databases, which have
recently become available on its web site: Arabic Online Texts

There are many sites on the internet which provide Arabic books in digital format for free,
but the material is mostly religious, like and
The non-religious texts are mainly journalistic. Al-Hayat newspaper, published in Arabic,
produced a CD-ROM of all its recent issues but it is in Macintosh format. Likewise, E.J. Brill
in Leiden is going to release a CD-ROM version of the Encyclopedia of Islam (in Arabic).
There are also some other Arabic newspapers posted on the internet in text (not images) such
Al-Ahram Newspaper (an Egyptian daily newspaper),
Al-Akhbar Newspaper (an Egyptian daily newspaper),
Al-Wafd Newspaper (an Egyptian daily newspaper),
Al Bayan Newspaper (a daily newspaper from the United Arab Emirates).
Albaath Newspaper: (a daily Syrian newspaper).
And many others which can be found on:

4.2.3 Tagging Arabic Texts

A tagged corpus is a corpus which is informed with coding to indicate additional information
like Part of Speech, tense and/or aspect of verbs etc. The process of tagging an Arabic corpus
is in itself tedious and time-consuming. To tag an Arabic text, the text must be segmented into
their component lexemes (Freeman 2001). This is because we may find an Arabic word
representing a whole sentence.

Although tagged corpora are now available for many Roman languages, this sort of corpora is
lagging behind in connection with Arabic. There is an Arabic tagged corpus in Lancaster
assembled by Shereen Khoja based on an Arabic morphosyntactic tagset along with an Arabic
part-of-speech tagger in the Computing Department, University of Lancaster (Khoja, et al
2001). But it is manually tagged and is very small; it only consists of 1700 words with the
following tags (Arabic POS (N, V and Particle) plus some syntactic information (sing., masc.,

` 38
and definite common noun)). She also has a tagged corpus of 50,000 words of Arabic
newspaper text with the basic tags (N, V, Particle). Not until (2003) was Khoja able to
produce a tagger for Arabic in the fulfilment of her Ph.D. thesis (Khoja 2002). Indeed, the
Arabic language is relatively difficult to tag due to most of the problems raised in section 2.4.

The Institute of Modern Languages of the Catholic University of Leuven started with the
manual annotating of a 4-million-word Arabic corpus. They are still working hard to
elaborate this corpus which will be used in the future as a basis for a semi-automatic tagging
of raw Arabic corpora (Van Mol, 2002).

Apart from Khoja’s corpus which is very small and manually tagged in addition to the
Leuven one, which is still in progress, we do not know, at the time of writing, of any other
tagged corpus except for LDC’s and Sakhr’s. The most recent of these is the one produced by
the Linguistic Data Consortium (LDC) in 2003. They produced an Arabic Treebank: Part 1 v
2.0 consisting of 140,265 words (168,123 tokens after clitic segmentation). This is published
as part one of a 1m. words Modern Standard Arabic corpus. As for Sakhr’s, Sakhr Company
in Egypt often claims that it owns a tagged corpus, but the company said it is for their own
purposes; they did not want to share it even for academic research.

Although these years witnessed a vast stride in development of machine-readable tools that
can handle Arabic, barely can we find a public domain tagged corpus12 or a POS tagger that
can work on Arabic to disambiguate unvoweled written Arabic texts which is a very daunting
task. Almuhanna (2003), for example, had to romanise the Arabic alphabet (transliteration)
following Bulkwalter13 in an attempt to tag his Arabic corpus. He followed this process: 1)
compiling a raw corpus, 2) transliteration, (3) segmentation, (4)
tagging, (5) re-transliteration into Arabic. He used the
language-independent Brill tagger to automatically tag his
transliterated and segmented text after training it by using a
training corpus of 100,000 words, which was already tagged
manually using Freeman’s tagset (2001).
12 The tagged LDC corpus was not personally assessed; in addition we could not find in the literature a proper
description of how much computation was involved in tagging that corpus other than what is mentioned above

` 39
With Brill’s tagger Almuhanna achieved 93% accuracy in his corpus
consisting of 1-million words.

Khoja’s APT (Arabic Part-of-Speech Tagger)14 skips the above steps needed to tag Arabic
texts and works directly on Arabic script. She used a corpus of 50,000 words to train the
tagger. Her rule-based tagger arrives at word-roots by removing all affixes which are then
used to determine the grammatical position of the attached word. Some words were so
ambiguous that they did not receive any tags. So, she used a probability-based tagger with
which she managed to achieve 90% accuracy after disambiguating ambiguous words.
Nonetheless, there must be a manual tagging for all the lexical items in the training phase

The main difference between the tagset Almuhanna used and

Khoja’s is that the former is based on Latin convention in
terms of the labelling categories like N, V, Adj, Adv etc.,
whereas the latter follows the Arabic traditional
classification of the word into N, V and particle; all other
categories are treated as subcategories which can be marked by
inheritance, i.e. all the subcategories of the tripartite
division inherit properties from the parent categories.
Secondly, Khoja’s tagger arrives at the item to be tagged
directly without a need of segmentation or transliteration.
However, these two attempts would, in the first place, require
a lot of human intervention to hand-edit the results prior to
tagging; secondly, they would fail to deal with some aspects
of Arabic like seeming homonyms unless they recourse to a more
sophisticated semantically-based analyser.
4.2.4 Tools for Processing Arabic
As for the tools that can be used for processing Arabic corpora, there are a few that can
handle Arabic texts, though not as well as English. They are as follows:

14 Khoja’s tagger is not personally assessed.

` 40
XConcord, developed by Malek Baualem, Mark Leisher and Bill Ogden (1996)
MonoConc, designed by Michael Barlow (1999)
Concordance by R. J. C. Watt (2001).
aConCorde by Andy Robert (early release, 2004)

XConconrd is designed to work on texts in Unicode Standard. It supports 17 languages,

including Arabic, and allows flexible searching. It displays Arabic correctly and target word
is well aligned. However, it only works on the Solavis operating system.

MonoConc is a concordance program. This Windows tool is very easy to use, it can initiate
concordance searches for words and phrases immediately. MonoConc offers functionality and
flexibility through a variety of configurable options. This program works well for Arabic text
analysis in Arabic windows, but with one major drawback: the concordance output is
presented on the screen backwards. In other words, in the middle we find the KWIC (key-
word-in-context) as normal, then the context that is supposed to come after the key word
appears before it and that which precedes the key word follows it. However, you can save the
concordance output to a text-only file, and when you open it in a text editor (e.g. MS Arabic
Word), the text appears in the right order. Although this is a serious interface problem, the
program generally gives a good result. A sample of the search screen of Monoconc program
and the search results after saving it to a text-only file are given in Appendix 4 and 5.

This program was not designed to deal with Arabic orthography, since Arabic is not among
the list of languages it is claimed to handle. When I used it with Arabic texts, it turns out that
it can deal with them but with some discrepancies, due to the idiosyncrasies of Arabic
mentioned in section 2.4. The fact that Arabic is written with or without vowels15 requires
extra laborious work to search for all the possible forms of a given word separately. For
example, if you search using Monoconc for a voweled word, it will give you all the exact
occurrences of that word in the corpus disregarding the other possible forms of that word or
even the character variations such as alif with/without hamza and dotted/un-dotted yaa’. This
makes it extra hard to arrive at all possible forms of words under examination or work out the
various conventions of Arabic writing. A big problem that needs to be solved is

15 Vowels are diacritics to be put above or under the consonants. In Modern writing it is left to the intuition of
the reader to guess about them.

` 41
lemmatization. For example, ‫ الولد‬al-walad ‘the boy’ and ‫ ولد‬walad ‘boy’ get counted as

separate items unless the user lists all possible forms of this word. A lot of Arabic connectives
(conjunctions) are attached to their following constituents, so we need to strip off any affixes
and look for the base-form, but this program also fails to define prefixes to ignore when
sorting. Anyhow, I found this program useful in dealing with Arabic despite the problem of
interface, which is not a major impediment. Without that program I would not have finished
my work.

Concordance is a program for Windows NT 4.0, Windows 2000, Windows 95/98, and
Windows ME which makes wordlists, concordances, and Web Concordances from electronic
texts. This is an online program; it gives you a 30-day free trial and for further use the user
needs to buy a registration. Watt’s Concordance is not designed to handle Arabic. Text lines
need to be short (no more than 15 words per line), otherwise it would be awkward to trace the
full line of the ‘headword’ in the ‘view’ window. To me, the major drawback is that it creates
a large number of temporary files and one huge ‘Concordance data file’. For example, for a
text document of 120 KB, a 1.9 MB concordance data file is created. It would, therefore, be
essential to have a large storage capacity on a machine for text files containing several
million words.

aConCorde is originally developed for native Arabic concordance and support right-to-left
languages. This program, which can be downloaded free
( is written in Java and will run on
any platform that has the Java Runtime Environment installed. However, this program is
released early with shortcomings noted by the designer, like its inability to cope with markup
notation, ignoring punctuation and limited search options; it only accepts one item as a
As mentioned above, MonoConc and Concordance are not designed to handle Arabic texts,
but they happen to do so. The major problem that faces the user is that these two programs
handle the Arabic text as languages written in the Roman alphabet in terms of alignment
whereas Arabic is right-aligned language, i.e. written right-to-left. It is noteworthy to mention
that Watt is currently working to add Arabic in the language list in his program

` 42
The reason why I can use Monoconc is that the only text that has diacritics is the Holy
Qur’an which constitutes 1.8% of the corpus as discussed in 4.3 below. In addition, it is
compatible with Windows 95, 98 and 2000 and thus assures a reasonable degree of user-
friendliness with its graphical user interface. For quick concordances and word frequency
counts MonoConc is a very useful tool, and it is particularly useful for anyone involved in
Arabic lexicography and basic corpus-based linguistics.

4.3 Description of the Corpus

The Classical Arabic Corpus (CAC) is a raw corpus. Because we are concerned in
investigating lexical issues in Arabic we do not see having a tagged corpus as a major
requirement. Lexical investigations can be carried out with the aid of a raw corpus. This
corpus has currently around five million words, and entirely consists of written materials16. A
good percentage of all published material in Classical Arabic now exists in electronic form,
so it is easy to include this in my corpus without it having to be scanned or retyped. Using
electronic texts available through the Internet is useful for the following reasons: it saves time
and cost and it is more accurate.

The works I downloaded are mainly books. I also gathered some short poems written by one
poet into a collection and I treated them as a text. The time span of these writings starts as
early as the advent of Islam up to the end of the eleventh century. As for the question of
copyright, all of these materials, apart from the Holy Qur’an, go back ten centuries or more,
so they do not need copyright from their authors. However, I got a copyright permission for
academic use from the web site designers for the effort they have made in making these
books available on the Internet (see appendix 1).

To investigate lexical collocation in Arabic it is important to create a big corpus to inform our
research. Experience has shown that grammatical patterning can be identified and described
on the basis of a relatively small corpus. Lexical patterning on the other hand requires the use
of very large corpora. Sinclair (1991:100) argues, ‘fairly small corpora, of one million
words or even fewer, are adequate for grammatical purposes, since the frequency of
occurrences of so-called grammatical or function words is quite high.’ It is easy to pinpoint
16 Some of the Classical Arabic texts are originally spoken texts such as the Holy Qur’an which is Allah’s Book
and the Prophet traditions; they were transmitted orally for long time before the early Muslims put them in a
written form. However, I will count them as written texts because they reached us in such a form.

` 43
some generalities concerning function words through checking how common a word is. For
example, in LOB, as Sinclair (ibid) pointed out, the first most frequent word that one can
notice in LOB corpus is the at 68,315. Although the corpus is just one million words, we still
are able to make some investigations about the function words and other grammatical issues
which are expected to co-occur frequently in any corpus.

With regard to our corpus, a 5-million-word corpus is not large compared to the available
non-Arabic corpora. Nevertheless, it is the biggest Classical Arabic corpus assembled so far
and I have a motivation to keep on maintaining it to become bigger and much more diverse.
In addition, it is noteworthy to mention that the Cobuild dictionary was informed at the very
beginning by the observations derived from a 7.3-million-word corpus.

Since the corpus is limited to the early period of Islam, there is a possibility to include every
text that exists. By doing this, it would definitely be representative (Biber, 1993), but this
may take a long time to do. Moreover, it is enough for the purpose of my corpus to conduct a
principled selection rather than a mere accumulation of texts.

Generally speaking, the first possible dichotomy is into fiction and non-fiction. The
proportion of fiction, which is apparently a part of literature, at 11% is drastically less than
the non-fictional texts. This is because of the considerable lack of fictional materials in the
period under investigation. It is a well-known fact that the novel and drama have only
recently been introduced to Arabic literature. However, there are a variety of stories and
popular legends written to have a moralistic impact on Muslims. Although the majority of
these narratives concern the leading personalities in Islam, many of them are fictional
(Somekh, 1991: 21).

Therefore a further dichotomy for my corpus is needed which can give a close picture of the
major interests of the early Muslim writers. The corpus can be divided into four genres: belief
and thought, literature, linguistics and science as represented in table 4.1. The genres can
further be divided into subgenres as shown in table 4.2. These two tables are represented in
charts as shown in Appendix 2. Under belief and thought we have five subgenres: the Holy
Qur’an, the Prophetic Tradition, theology, biography and philosophy. Literature has also two
subgenres: poetry and fiction. Linguistics is represented in this corpus as having two

` 44
subgenres: proverbs and lexicons. Finally, under science we have geography, mathematics,
physics and medicine. For more illustration about the natural texts included in the corpus see
appendix (3).

Genre Size in Words Percentage

Thought and Belief 2,682,035 53.46
Literature 648,608 12.97
Linguistics 766,134 15.32
Science 903,205 17.86
Table (4.1): The genres of CAC.

Genre: Thought and Belief

subgenre Text Size in Words percentage
The Holy Qur’an 88,622 1.8
Prophetic Tradition (Hadith) 683,970 13.7
Biography 393,933 7.9
Philosophy 478,141 9.6
Theology 1,037,387 20.7
Poetry 69,385 1.4
Fiction 579,223 11.6
proverbs 362,054 7.2
lexicons 404,080 8.1
Geography 82,499 1.6
Physics 57,553 1.2
Medicine 736,469 14.7
Mathematics 26,684 0.5
Total 5,000,000 100%
Table (4.2): Subgenres included in CAC17.
4.3.1 The rationale behind this selection
The selection of texts to be included in a corpus can be done by chance or by choice. The
latter alternative enables the corpus builder to make deliberate selection of the texts to be
included. As Atkins et al (1992: 3) put it, ‘the selection of sources might be based on a
systematic analysis of the target population or on a random selection method’.

17 The overall total of CAC is exactly 5-million words; we arrived at that number after deleting a part from the
theology genre, particularly the Tabari’s book on Tafseer (exegisis of the Qur’an), since it is too long to include
in the CAC. Tabari’s book after that deletion constitutes about one-sixth of the corpus.

` 45
To select parts of the population as an object of research there has to be consensus among
linguists on the authenticity of the selected parts or the selection has to be based on principled
choices. This can ensure some sort of representativeness and this is what I adopted in this

As mentioned earlier, there are two ways of sampling: ‘whole text’ or ‘word text fragment’.
There are many corpora nowadays based on the approach of whole texts, like the Cobuild
corpus and the Bank of English. Likewise, I prefer whole texts to be the unit of study, as it is
more convenient for investigating Arabic, where we may come across sentences that extend
over a number of lines.

The corpus in hand includes among other things texts from the main branches of knowledge
introduced by the advent of Islam. By doing this, I tried not to skew the corpus too much in
any direction as ‘the stylistic idiosyncrasies of a particular author can be reduced in
significance if texts by many different authors are included’ (Atkins et al, 1992: 2).

4.3.2 Why these texts?

Unlike early Arab linguists whose corpus, gathered for linguistic investigation in their period,
mainly comprised first and foremost the Qur’an and the old tribal poetry in addition to the
nomads’ proverbs and sayings (Versteegh, 1997: 42), I used other genres and subgenres to
have a real representative corpus.
Below is a description of the subgenres included in the corpus:
1. The first text on which early Arab linguists relied is the Holy Qur’an. This is the primary
evidence which Arab linguists relied on to prove the correctness of any linguistic issue. To
Muslims, the Qur’an has the highest position in religion and in language.
The Qur’an consists of 114 chapters (surahs) covering the social, cultural, political, and
religious life of Arabs of the early seventh century with references to some previous peoples.
The Qur’an’s structure is neither poetry nor prose. It is not poetry because it does not observe
the metre and rhyme of poetry, and it is not prose because it is not composed in the same
manner in which prose was customarily composed.

` 46
The early Arabs privileged language; they held public fairs for poetry in Mecca, especially at
‘Ukaz, where they used to present valuable prizes for the best poet. Within this specifically
Arab context, the Prophet Muhammad was sent as a Messenger and his major evidence is the

The Qur’an is inimitable; it is unique in style and unexcelled in beauty. God challenged the
Arabs to produce even a verse (a line) like the Qur’an but they could not. This point is
repeatedly emphasised in the Holy Book itself. Thus the Qur’an says:
If the whole of mankind and the jinn were to gather together to produce the like of this
Qur’an, they could not produce the like thereof, even if they backed each other up. (17:88)

2. Next to the Qur’an, poetry has been regarded as a main and authentic source of pure
language. The selection criteria for Cobuild excluded poetry from the Cobuild collection
because, to them, poetry is unrepresentative of mainstream linguistic behaviour (Renouf,
1984). To me, poetry cannot be ignored when looking into Arabic linguistics, especially
Classical Arabic. Poetry was highly valued in Arabic cultures of the Middle Ages. The
importance of poetry as a source of data for linguistic investigation can be shown in
Sibawayh’s reliance on it as the primary type of textual evidence. In his Kitab he referred to
poetry as evidence 1050 times, to Qur’an 447 times, six times to Hadith, and 350 to prose.
Ibn Abbas in his commentary on the Qur’an relied on poetry to explain the meaning of
unclear lexical items in the text of the Qur’an. He said, ‘When you want to learn the meaning
of any weird word in the Qur’an, look for it in poetry.’ Also, we have to bear in mind that the
older the poetry, the more authority it possessed.

3. Hadith (Prophetic Tradition): It is a main source of authentic data. Hadith by definition can
be subsumed under spoken material as it includes all the recorded sayings and actions of the
Prophet Muhammad. It was transmitted orally as was the Holy Qur’an.

To me, the use of Hadith for grammatical investigation in classical Arabic is of great
importance since the Prophet Muhammad is considered one of the most eloquent speakers of
his community because of his early upbringing among Bedouins who were renowned for

` 47
their eloquence. Hadith literature also retains a lot of ancient usage. Some scholars compiled
and classified these hadiths in systematic collections; the most authentic of them are Al-
Bukhari and Muslim. These two collections, which I included in my corpus, are usually
referred to by scholars as Al-S{ah{ih{aan, i.e. the two authentic collections.

4. Proverbs and Bedouin sayings: As already mentioned the third authentic source of data
which early Arab grammarians depended on is the Bedouin proverbs. The language of the
Bedouin has changed less than other varieties because they live away from urban
communities where different people of different dialects and languages live together.

The first person to collect the Arabic proverbs was Al-Mufad}d}al ibn Salim (d. 784 AD).
Based on what Al- Mufad}d}al did, Abu Hilal Al-Askary (d. 1004) and Al-Maydani (d.
1124) compiled their collections of proverbs in a more comprehensive way. Al-Mydani’s
Majmac Al-Amthaal (the Collection of Proverbs) contains explanatory notes on poetry.

5. Theology: This type of texts flourished very early as the Muslims encouraged by caliphs
and motivated by their interest in studying their religion, introduced some sciences related to
the Holy Qur’an and Hadith such as the Qur’an exegesis, jurisprudence (Fiqh), dogmatics,
etc. The Qur’an Exegesis deals with the meaning of the verses, the reasons behind their
revelation i.e. the historical references, and comments on the syntactic and semantic structure.
One of the most famous works on Tafseer is Al-Tabari’s (d. 922), which is considered ‘the
richest repository in this branch of study containing from verse to verse everything he could
gather from earlier literature’ (Goldziher: 1966: 46). Jurisprudence was also introduced to
explain the Islamic rulings that concern all Muslims in worshipping, daily transactions,
political system and relations with other people. These rulings were derived from the Qur’an
and Hadith.

Another branch of theology was the Foundations of Creed, dogmatics. This science of
studying the basics of belief was introduced as a result of defending the Islamic belief against
heretics and other sects. This gave rise to the rational approach of presenting Islam to non-
Muslims. One of the most important works in this field is Al-Ashcari’s (d. 935) Al-Ibaanah fi

` 48
‘Us}uul al-Diyaanah (The Explanation of the Roots of Creed). Al-Ashcari was the first to
formulate the orthodox thinking of creed. His book Al-Ibaanah has influenced most writings
on theology even today.

6. Biography: The first coherent biography of the Prophet was written by Ibn Ishaq (d. 768)
whose Siirat Rasuul Allaah (The life of the Apostle of Allah) was revised and reworked by
Ibn Hisham (d. 833) to make the oldest and most classical work in this field. Ibn Ishaq was
first entrusted by the Caliph Al-Mansur with the task of writing a book for his son Al-Mahdi
on history since the first man on earth until their time. Ibn Ishaq’s work was more
comprehensive than Ibn Hisham’s, as the latter was only on the biography of the Prophet.
History then became an independent genre with works like Al-Akhbaar Al-T}iwaal (Long
Narratives) by Abu Hanifa al-Dinawri (d. 895), Taariikh al-rusul wal-umam wa al-muluuk
(the History of prophets, nations and kings) by Al-Tabari (d. 922) which reflected various
historical and cultural aspects of Islamic life.

7. Philosophy: Arabs started to try the speculative methods in order to defend or spread Islam.
Philosophy was just another aspect of religious studies. Al-Farabi, for example, held the
belief that philosophy and Islam are in harmony. One of his most important contribution is
Aara’ Ahl Al-Madiinah Al-Faad}}ilah (The Utopia). It is a significant contribution to
sociology and political science. The shift undertaken by Al-Kindi (d. 872) from writing on
philosophy as a religious tool to writing on pure philosophy is considered the beginning of
the separation of philosophy from dogmatics. Therefore he was the first independent writer
on philosophy. Then came Ibn Sina, who is known in the West as Avicenna, to inform this
genre with many works, including a book in logic. He combined Greek philosophy and
Muslim theology.

8. Linguistics: Early Arab linguists influenced linguistic investigation universally. For

example, Al-Khalil’s lexicon, Al-cayn, was considered the first systematic and comprehensive
work of its kind. Al-Khalil (d. 786) was the first to give lexical order in the collection of his

` 49
In addition to lexicography, philology was also mastered by the early Arab linguists. Al-
Thacalibi’s Fiqh Al-Lughah (The Code of Language) (d. 1037) was really a marvellous
compendium of philology.

It is noteworthy to mention that there are arguments for not including lexicons and
linguistics works in a corpus because in the first place they may contain citations from other
works which have their own grammars and stylistics. Secondly, to include such works in a
corpus could be misleading in the sense that they may use other languages to prove a
universal phenomenon or to investigate these languages themselves (Paul Bennett, personal
communication). This is not the case with the works I included in CAC since the works of
linguistics and lexicons I included are written entirely in Arabic without quoting any single
foreign word. In addition, citations from other works are only restricted to certain texts:
Qur’an, pre-Islamic poetry and nomad proverbs (cf. 3.2.2).

9. Science: The early Arab scientists paved the way for the modern scientific observation in
mathematics, medicine, physics and so on. In medicine Ibn Sina’s book: Al-Qaanuun fi Al-
T}ibb (Canon of Medicine) was considered the first comprehensive encyclopaedia in
medicine. When the Al-Qaanuun fi Al-T}ibb (Canon of Medicine) was translated into Latin,
it became the textbook for medical education in Europe in the 12th century.

Another field of science in which the West was indebted to the Arabs was mathematics. Arabs
are the inventors of the symbol 0 (zero) and this laid the foundation of positional arithmetic.
The first to write an arithmetic was Al-Khawarizmi (d. 849).

Physics was also studied by Arab scholars. Al-Biruni’s (d. 1048) contributions in physics
were pervasive during the first part of the last millennium. He was a pioneer in the study of
metals and precious stones. His book Kitaab al-Jamaahir discusses the properties of various
precious stones. In geography Al-Maqdisi (d. 977) studied most of the Islamic world and
wrote his marvellous book: Ah}san al-taqaasiim fi macrifat al-aqaliim (The best Division in
the knowledge of Climes) that made him a pioneering geographer of his time.

` 50
10. Fiction: As mentioned earlier that there is a considerable lack in Arabic fictional works,
however, some early Arabic works can be subsumed under the narrative prose such as Al-
Bukhalaa’ (The Misers) by Al-Jahiz and Arabian Nights. The Thousand and One nights
(Arabian Nights) (850) was originally written in Persian. It was translated and reworked
completely to leave no Persian traces so as not to contradict the Islamic thought during the
Abbasid period. Al-Jahiz (d.868) had contributions in a variety of genres among which are
philology and artistic prose. His book, The Misers is a collection of anecdotes that criticises
the social conditions of his time in a comic way.

4.4 Conclusion
To sum up, the 5-million word Classical Arabic corpus (CAC) is considered a pioneering
corpus for the following reasons:
1)It is an electronic corpus; this makes investigating Arabic a more accurate and faster
2)It is balanced; it covers a wide scope of written Arabic texts to be used for more than one
3)It is a monitor corpus; we will keep on maintaining it by adding more texts and genres.
4)More importantly, this corpus is synchronic, which deals with only one variety of Arabic
along a particular period of time, i.e. early Classical Arabic. This can make the study
based on it more consistent and more methodical.

` 51
Chapter Five: Lexical Collocation
5.1 Introduction
As the subject matter of this thesis is to look at synonymy in Arabic contextually through
lexical collocation, it would be sensible to give a brief account of the relationship that holds
between synonymy and collocation. Synonyms in their propositional sense can be substituted
for one another, as will be discussed in detail in chapter six, and collocation can only be
observed through repeated usage (Smadja, McKeown, and Hatzivassiloglou. 1996:5). Both
involve two different kinds of relations18: synonymy is a paradigmatic relation and
collocation is syntagmatic. We are of the position that both types of sense relations,
paradigmatic and syntagmatic, are complementary to each other because words acquire
meaning from both axes. Through collocation we can distinguish one sense of a word from
another and know whether the seemingly synonymous words (for example) are real
synonyms or not. Collocation is, therefore, a device with which a particular sense of a word is

The relationship that a linguistic element has with other elements inside the sentence is called
syntagmatic. This is mainly a syntactic relation. Let us consider the following example:

(1) The work is interesting.

The word work in (1) above is syntagmatically related with the definite article the, and the
copulative verb is is related with the adjective interesting, or the noun work with the adjective
interesting. Generally speaking, what is the first word that comes into your mind when you
come across a word like work? There are many possible answers such as is, does, place, etc.
This is called a syntagmatic reply because it provides the phrase or the sentence with a
required syntactic form; it is the next word in the phrase or the sentence. Or the answer could
be words like job or career. This is called a paradigmatic reply because it chooses another
word from a set of semantically related words, not mentioned in the sentence. Finally, if the

18 Any two words can have a relation, but there might be words which are more significant than others. Sense
relations are divided into three classes: paradigmatic, syntagmatic, derivational. The significance of a relation is
discussed by Cruse (2000:145-47).

` 52
answer uses the same word but in a different form, this is called derivational.

Collocation is a clear-cut way of looking at word meaning in a practical way rather than by
means of conceptual analysis. Firth (1957) emphasised that the meaning of a word is
determined by its co-occurrence with other words. He called this phenomenon collocation, as
will be illustrated below. Likewise, Sinclair states, ‘meaning can be associated with a distinct
formal patterning’ (1991: 6). Such a trend could be an interpretation of Wittgenstein’s
statement that ‘the meaning of a word is its use in the language’ (1953: 20). With the
possibility of carrying out linguistic contextual analysis of large quantities of electronic texts,
we become more or less able to account for the interaction between meaning and syntactic
structure in an empirical way. Hanks (2000:1) argues that ‘corpus analysis shows that
differences in meaning (metaphorical and literal alike) are associated with different
phraseological and syntactic contexts. A list of phraseological norms derived from corpus
analysis corresponds to a cognitive profile of the word’s meaning.’ Stubbs also noted that the
word meaning could be defined not only by individual words or grammatical structures but
also by collocations (Stubbs 1996: 89). As Fraas (in press, quoted in Stubbs (2001a)) noticed,
collocates provide observable evidence of word meaning.

5.2 Definition of Collocation

As mentioned above one of the relationships that hold between words on the syntagmatic or
horizontal axis is collocation. Collocation had a significant currency in linguistics from
Firth’s Modes of Meaning (1957) on. Since then the term has been extensively used by
linguists to explain how words are related to one another and for other purposes.

Collocations can be defined as the co-occurrence of words, as are idioms, compounds and
clichés. Idioms are those in which the meaning of the whole cannot be understood from the
meaning of its parts. For instance, kick the bucket, break a leg, etc. A cliché is defined as a
‘trite, stereotyped expression; a sentence or phrase, usually expressing a popular or common
thought or idea, that has lost originality, ingenuity, and impact by long overuse’ (the Random
House Dictionary of the English Language, 1967). For example, the early bird catches the

` 53
worm, life sucks, and then you die, salt of the earth, etc. Compounds are built up of two or
more free morphemes in a single lexical unit. On the other hand, a collocation is a group of
words that occur together more often than by chance.

Mitchell uses the term ‘composite element’, under which collocation, idioms and compounds
can be subsumed (1971: 57). Following Mitchell, Cowie (1981: 224) refers to them as
composite units. He then makes a distinction between collocation and idioms in terms of
substitutability of items. He points out that the former permits the substitutability of at least
one item of its constituent elements. The latter, on the other hand, cannot undergo any type of
transformational processes of substitution, transposition, expansion, etc. (ibid: 224). So, the
main distinction between collocation and idiom is that, unlike idioms, the meanings of
collocation can be predicted or deduced from the meanings of their parts. This varies from
idiom to idiom as some idioms are more frozen than others. For example, kick the bucket
which is more fixed in terms of the transformational or substitutional processes than idioms
like spill the beans, which can undergo a process of passivisation as follows: the beans have
been spilt. Specifically speaking, ‘idioms contain frozen parts that do not allow any sort of
substitutions’ (Gross, 1990: 16). Let us consider the following examples:

(2) John took the bull by the horns.

(3) John took Bill for a ride.
(4) John crossed swords with Bill.
(5) John cut the ground from under Bill’s feet.
(6) The game is not worth the candle.

The above sentences are idiomatic in the sense that we cannot understand the meaning of the
whole by understanding the meaning of its parts. In (2) above we can only change one part of
the sentence (i.e. the subject) for any thing equivalent without missing out the idiomatic
sense. In (3), both the subject and the object can be swapped or changed. In (4) and (5) only
John and Bill can be changed. In (5), only John and Bill can be changed. In (6), only the tense
is free.
Nelson (2000) calls such a phenomenon of word packaging Multi-Word Items. He gives an

` 54
interesting brief account of the definitions and types of such multi-word items since 1864.

We are going below to present some definitions and examples of collocations, as well as
methods for their extraction and classification. Let us first have a look at some of the
definitions of collocation.

There have been many diverse definitions of collocations. For example:

1. Firth, who was the first to introduce the idea, defined collocation as the company that a
word keeps (Firth 1957:179). He illustrates his point by the following example: ‘One of the
meanings of ass is its habitual collocation with an immediately preceding you silly’ (1957:

2. ‘sequences of lexical item which habitually co-occur, but which are nonetheless fully
transparent in the sense that each lexical constituent is also a semantic constituent’ (Cruse,
1986: 40).

3. ‘Collocation is the occurrence of two or more words within a short space of each other’
(Sinclair, 1991: 170).

4. ‘The habitual co-occurrence of words’ (Stubbs, 1995b: 245).

5. ‘A sequence of words that occurs more than once in identical form.... and which is
grammatically well structured’ (Kjellmer, 1987:133).

6. ‘A recurrent co-occurrence of words’ (Clear, 1993:277).

7. ‘The co-occurrence of two or more lexical items as realisations of structural elements

within a given syntactic patterns’ (Cowie, 1978: 132).

8. ‘A collocation is an arbitrary and recurrent word combination’ (Benson, 1990).

9. ‘Two words co-occur if they are in the same sentence and are not separated by no more

` 55
than five words’ (Smadja, 1993:151).

10. ‘a sequence of two or more consecutive words, that has characteristics of a syntactic and
semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived
directly from the meaning or connotation of its components’ (Choueka, 1988 quoted in
Manning 1999: 172).

These definitions seem to have three main characteristics: the co-occurrence of at least two
words, the frequency of this co-occurrence and the fact that the whole chunk should occur
within a given span of words. However these definitions do not mention how frequent a given
combination must be or whether a single occurrence in a corpus should be eliminated or not
(see 5.8.4 for more detail). Secondly, Choueka’s definition deals only with collocation of
adjacent words, which, as far as I know, is contrary to all linguistic definitions of collocation.
Thirdly, apart from Kjellmer, Cowie, Choueka, and Smadja, there is no syntactic condition
given. For Kjellmer and Cowie, the grammatical structure must be considered, the other two,
Choueka and Smadja, put it more specifically within the boundaries of sentence. To me,
collocation, in its general sense without any sort of syntactic restrictions is more likely in
conformity with the commonly asserted claim that all words and expressions, regardless of
their syntactic position, are restricted in their distribution (van der Wouden 1997: 45). This
also enables us to investigate interrupted phrases of interesting distribution which we might
not be able to account for with a restricted definition of collocation. To pursue the premise I
shall, after addressing the types of collocations, discuss in more detail the questions of spans
and frequency to arrive at a definition which could be closer to the purpose of the present

5.3 Collocation and Colligation

The company that a word keeps could be lexical or grammatical. For instance, a word may
collocate freely with another lexical item or with a particular grammatical class. The former
is called collocation and the latter is colligation. The term colligation is introduced by Firth,
who states,

` 56
The statement of meaning at the grammatical level is in terms of word
and sentence classes or of similar categories and of the inter-relation
of those categories in colligation. Grammatical relations should not be
regarded as relations between words as such – between ‘watched’ and
‘him’ in ‘I watched him’ – but between a personal pronoun, first
person singular nominative, the past tense of a transitive verb and the
third person singular in the oblique or objective form.
Firth (1957: 13)

Halliday makes a distinction between collocational and grammatical levels or lexis and
grammar but he noted that they are still interrelated. He used the term lexical as a substitute
for collocational (1966: 152). He argues that ‘collocation is outside grammar: it has no
connection with the classes of the words. It is the lexical item, without reference to grammar,
that enters into collocation’ (1966: 20). For example open in open the window, an open
window, or the opening of the window collocate with window in the same way irrespective of
its grammatical position.

To differentiate between grammatical and lexical levels, Halliday et al (1964: 32-33) note that
where there is a choice between different classes of language items at a place in structure we
have the grammatical level. For example, when we choose between this which is singular and
which is not that and between items like who, whose, what, which, we can account for
differences between such items grammatically. Other language items, on the other hand,
cannot be described in this way, as grammar cannot fully distinguish between items like table
and chair. Such items do not belong to grammar but to lexis.

Following Halliday, both McIntosh and Sinclair view grammar and lexis as separate.
McIntosh considers the distinction between grammar and lexis necessary ‘if the patternings
are to be economically stated or defined’ (1966: 183). He also states,
We can only preserve the simplicity of our grammatical description if
we are prepared from the start to let it be understood that there are
lexical factors, factors of collocational eligibility, which tend to rule

` 57
out of actual use a large number of ‘sentences’ (and smaller units)
even though they seem to conform to all the rules of grammatical
(McIntosh and Halliday, 1966: 183-84)

Sinclair (1987b: 322) emphasises that lexical collocations and grammatical collocation are
just tendencies and choices. Recently he defines colligation as ‘the co-occurrence of words
with grammatical choices’ (2000: 10).

Mitchell (1971: 53) argues in favour of Firth’s approach but with additional implications. To
Mitchell, collocation is different from colligation in the sense that the former uses words and
the latter uses word-classes. Therefore, colligation can be defined as a class of collocations,
for example, (‘motive’ verb + ‘directional particle). So the relationship that holds between
collocation and colligation is just a matter of generality.

Recently, Hoey (1998: 8) reintroduces the term colligation. He rejects the Halliday and
Sinclair approach and views colligation as necessary to account for the assemblages that
prefer to appear in a certain structure. He notes that lexical items tend to co-occur with other
lexical items in a certain grammatical position. For example, the different senses of the word
reason as meaning cause or rational faculty or logic can be accounted for grammatically.
Counting the frequency of the occurrence of each sense shows that reason in its sense of
‘cause’ occurs more frequently with the demonstrative deictics (like this, that, which(ever),
what(ever)) and not with the possessive ones (like my, your, John’s, whose) (ibid). He
therefore defines colligation as ‘the grammatical company a word keeps’ (ibid)-- more
specifically, the grammatical item or class that a word tends to co-occur with.

5.4 Types of Collocation

We have mentioned earlier that in the collocation literature distinctions are made between
grammatical and lexical collocation (see 4.3.). Further distinctions are made between upward
and downward collocations (Sinclair 1987b). When the collocates around the search term
(node) are more frequently used than the node itself, it is called upward collocation. For

` 58
example, the search term back is less frequent than words like at, from, on , he, him, get. On
the other hand, downward collocation is when the collocates around the search term (node)
are less frequently used than the node itself, like back with arrive, bring, climbed, come.

Some authors, like Emery 1988, Smadja 1993 and Lewis 1993, made another distinction
where several types of collocations can be identified according to the degree of the
collocation strength and currency.

Smadja (1993) identifies three types of collocations: 1) rigid noun phrases, 2) predicative
relations, and 3) phrasal templates. The first is the most fixed type of collocation; it is a
sequence of words that cannot be broken or interrupted without losing the meaning of the
phrase such as stock market and foreign exchange. The second, which is the most flexible
one, yet hardest to identify is made of ‘two words repeatedly used together as a similar
syntactic relation’ (ibid.: 148] such as to make a decision we can put in several ways such as
made an important decision, decisions to be made etc. The third type is characterised as long
and domain-driven collocations. ‘Phrasal templates consist of idiomatic phrases containing
one, several, or no empty slots’ (ibid 149) such as the often repeated sentence in weather
reports: temperatures indicate previous day’s high and overnight low to 8 a.m.

Emery (1988), following Cowie (1981) sees collocations as a scale at the end of which lie
idioms. He observes three types of collocation: open collocation (which is free word
combination), restricted, and bound (rigid). The last is considered ‘a bridge category between
collocations and idioms’ (Cowie 1983: 228). Open collocations contain elements which can
be used with different words without a big difference such as (bada’at/intahat
alh{arb/almacraka ‘the war/the battle began/ended’ ) (Emery 1988). Emery does not count
such combinations as collocations because they are unrestricted by usage (ibid.: 27).
Combinations of words that select each other not only in terms of semantics (like in open
collocation) but also by usage are called restricted collocations (Ainsenstadt, 1979 quoted in
Emery, 1988). Cowie (1983: xiii quoted in Emery: ibid) puts a condition for such restricted
collocation that one item of the combination should have a figurative sense. Compromising
between Cowie’s and Aisenstadt’s views, Emery says, ‘in a restricted collocation, one (but

` 59
not more) of the elements may be either literal or figurative’ (ibid: 27). For example, the verb
in explodes a myth/ a belief has a figurative sense whereas that in clench one’s teeth is literal.
Finally, in bound collocation ‘one of the elements is uniquely selective of the other’ (ibid.:
29), such as foot the bill.

In this study I will take the view that the phenomenon of collocation should be understood as
a gradual cline along which we may locate different degrees of collocation: fixed
(rigid/bound), semi-fixed (restricted) to free (flexible).

In fixed collocations the replacement of individual words is not allowed, whereas the
collocations in which individual words may be replaced by certain other words, are called
free collocations. For example, in the free collocation ‫ القاضي أمر‬amara alqaad}i ‘the judge
commanded’,19 ‫‘ أمر‬amara’ can be replaced by certain other verbs such as ‫ حكم‬h}akama

‘sentenced’ and ‫ قضى‬qad}a ‘made a judgement’. All three possibilities are collocations and
have the same meaning. But in the fixed collocation ‫ يداك تربت‬taribat yadaak ‘may your
hands become dusty’20 there is no alternative to the noun ‫ يداك‬yadaak ‘hands’. The sounds of
animals and birds, in Arabic or in English can be subsumed under fixed collocations. For
example, the sounds made by dogs or donkeys ‫ نباح‬nubaah} ‘barking’, ‫ نهيق‬nahiiq
‘braying’ have a strong bond to appear in such a context; there is no other word to describe
the dog’s or the donkey’s sound in normal language use, i.e. in non-metaphorical expressions.
On the other hand, the free collocations are words that are most likely to co-occur in
infinitely creative ways (Lewis 1993). The third type is restricted collocation, which
constitutes the majority of Arabic collocations and falls halfway between fixed and free
collocation. It is a combination of two or more words which attract one another syntactically,
semantically and by usage. For example, sakaraatu al-mawt ‘death throes’ al-dunya wa
al-‘aakhirah ‘this world and the hereafter’, al-ghiiybah wa al-namiimah ‘backbiting’, etc.
This discussion of collocation is apparently a semantic-based, which can give intuition a free
hand to identify it. In section 5.8 we will embark on a more methodical way in identifying
collocation using corpus-based methodology.

19 To show the differences between near synonyms I translate them literally.

20 This means your hand you will have nothing to sleep or sit on it and become poor.

` 60
Unlike Arabic, strong (fixed) collocations in English are relatively few (Lewis & Hill 1998,
quoted in Nelson 2001). This is because Classical Arabic has a very rich and varied
vocabulary with highly specific meanings. It is also remarkable for its abundance of near
synonyms. While some languages have a single word to describe one thing, Arabic has
hundreds. For example, there are over 500 words for ‘lion’, 200 for ‘snake’, each with a
specific connotation (Ibn Faris, al-s}aahibi, p. 21). In his investigation on Arabic
collocation, Hoogland (1993) concentrated on restricted collocation because as he argued it
constitutes a large and unpredictable category.

Such abundance in vocabulary is a treasure trove that can let words select particular words
without repetition. Therefore, it is expected for any researcher on Arabic collocations to be
swamped by a huge amount of collocations varying from free to fixed.

5.5 Spans
Jones and Sinclair (1974: 21) use the term ‘span’ to refer to the number of lexical items on
each side of the word under investigation (the collocate). They prefer that span to consist of
four words to the right of the node and four to the left. Later, Sinclair (1991) proposes a short
span of no more than five on each side of the search term. Others like Martin et al (1983
quoted in Kenny 1999: 70) think that five words to the left and five to the right are enough.
More practically, Berry-Rogghe (1970) made some experiments to arrive at an optimal span
on his corpus which consists of three works: A Christmas Carol by Charles Dickens, Each in
his own wilderness by Doris Lessing and Everything in the Garden by Giles Cooper.

When he tried three words as the span size, the collocations of the word house include words
like: sold, decorate, this, empty, buying, painting, opposite, loves, outside, full, my. Increasing
the span to six words, irrelevant words found their way as standard collocates such as
Bernard, God, etc. He intuitively found out that four words are the optimal span as it is long
enough to produce an optimal number of relevant counts.

To limit the span to the number of orthographic words, as proposed by Sinclair, does not

` 61
work in Arabic all the time. In Arabic which is different from English in terms of the
grammatical structure, we would need a careful treatment of the question of span. As
mentioned before in chapter three, the range of the Arabic sentence could be a bit bigger than
in English. For instance, we may find a big distance which may extend over a number of lines
between a verb and its subject or complement provided the verb contains a referential
pronoun irrespective of how many words intervene between them.

We can fix the span to two or more according to the mobility of our linguistic items. For
example, prepositions in Arabic show a tendency to precede their objects without any sort of
interruption, so a span of two words on each side of the search term would be enough.
Moreover, some items could be modified through the text which might extend over the
concordance line. Even though, there might still be plenty of occasions where such two items
appear close enough to each other. However, I would like to utilise all the corpus results as
much as I can because my corpus is not so big and I do not want to miss the occurrences of a
particular word because of the distance between them. At the same time, I cannot examine all
the occurrences that extend over a concordance line. Therefore, a flexible span, which ranges
from two to seven, based on the particular item we investigate, would be more realistic.

We can decide the size of the span according to the grammatical position of the category to be
examined. For example, to examine idiomatic verbs, we can easily search for their immediate
constituents to study what particles can follow such verbs. So, a span of five will be
sufficient for such study. In some cases we might extend the span to involve as many items as
we can from the concordance line such as in studying nouns, transitive verbs, etc. which more
likely tend to have relations across the text. It all depends on the first reading of the
concordance lines before getting to any analysis.

One disadvantage of Monoconc is that it is not possible to capture the frequency of

collocations consisting of more than three words. While in other programs like Wordsmith it
is possible to have up to 25 words on each side of the search-term, Monconc can maximally
provide three on either side. Therefore for a longer span, I have to use Microsoft Word to
process it, i.e. to save the concordance file into a Word document.

` 62
5.6 Semantic Prosody
As words, on the one hand, collocate with a particular grammatical class, i.e., colligation,
they, on the other hand, collocate with a semantic class of words, which is called semantic
prosody. Louw (1993: 157) defines semantic prosody as a ‘consistent aura of meaning with
which a form is imbued by its collocates.’ It is Louw who gives this phenomenon its name,
although the idea of semantic prosody was known for a long time before he coined it. Sinclair
(1987b: 322) noted that ‘many uses of words and phrases show a tendency to occur in a
certain semantic environment.’ He further argued that words seem to co-occur in a certain
semantic profile, either with positive or negative connotations. For example, the verb happen
collocates with unpleasant things such as accidents etc. (Sinclair 1991:112) and the phrasal
verb set in occurs primarily with words which refer to unpleasant states of affairs, such as rot,
decay, malaise, despair, ill-will and decadence (1991: 70ff). Hence, we can conclude that the
study of semantic prosody is more or less a useful way for employing pragmatic information
in the collocational analysis. McIntosh (1966), as mentioned above, proposes that items like
chair, seat, and sofa are all likely to occur, or collocate with, the items sit and comfortable
and so they are all members of the same class which share the same probability of
occurrence, i.e. which have the same range of collocations.

The collocational range is defined as the whole collocates of a single node grouped together
in a particular text or corpus, i.e. all collocates that a given search-term has across a particular
text. From this collocational range, on one hand, one or more of the collocates can be used as
a semantic category label for the others (comfortable for example). On the other hand,
semantic prosody is the phenomenon for which a common semantic feature among the
collocates provides evidence.

The notion of semantic prosody, later termed discourse prosody (Stubbs, 2001a), is further
enhanced by Stubbs (1995a), where he highlighted a similar tendency towards negative or
positive semantic prosody of collocates. He also noted that collocation can be simply defined
as the semantic feature which stretches over several units (2001b), describing the
phenomenon as the connotations that words have when they occur together (1996: 172).

` 63
In Arabic I studied the lemmas ‫ سنه‬sanah and ‫ عام‬caam by looking at their occurrences in

CAC, it turned out that‫ سنه‬sanah ‘year’ and‫ عام‬caam ‘a year’ which are widely regarded as

synonyms are used in different contexts. The corpus provides lots of unpleasant examples for
sanah and only one pleasant example as in (a) below:

a) punishment, inflation, hardship, drought, infertility, destruction, worse, wars, weakness,


In (a) above the examples show the most frequently recurring left collocates for the word
sanah. These collocations can be summarised as follows:

•To refer to a bad experience that happened during this year, like a drought, a plague
or common crisis.

On the other hand there is a considerable shortage of negative examples for caam. For the
positive collocates the corpus shows the following examples as in (b) below:

b) goodness, bride, provision, fertile, support

The corpus-based analysis shows us how each word has its own preferred collocates and
relatively different distributions. There are some neutral collocates, which seem negative21,
shared by sanah and caam. Such collocates are considered neutral as they refer to a certain
historical incident such as‫‘ عام الطاعون‬caam al-t}acuun ‘the year of Plague’ and‫ عام الزن‬cam al-

h{uzn ‘the year of sadness’. Such incidents became milestones in Muslim history, so they
designate a period of time and do not carry any positive or negative sense. The only real
negative word that collocates with caam is drought which is also shared by sanah.

One clear piece of evidence does come from the Qur’anic verse that states,

21 Following Stubbs we will judge negative or positive collocates intutively.

` 64
(7) ‫ف َسنَ ٍة إِلّا َخمْسِيَ عَامًا َفأَ َخذَ ُهمُ الطّوفَا ُن َوهُمْ ظَاِلمُون‬
َ ْ‫وََل َقدْ أَ ْرسَ ْلنَا نُوحًا إِلَى َقوْ ِمهِ َفَلِبثَ فِي ِهمْ أَل‬

And indeed We sent Nuh (Noah) to his people, and he stayed among
them a thousand (sanah) years less fifty caam (years) [inviting them to
believe in the Oneness of Allah (Monotheism), and discard the false
gods and other deities], and the Deluge overtook them while they
were Z{aalimuun (wrong-doers, polytheists, disbelievers, etc.).
(Qur’an: 29: 14)

where sanah and cam are used altogether to refer to different stages of the life of the Prophet
Noah22 who suffered a lot to call his people to belief in God until God destroyed them by
flood. Hence, the word sanah is used with reference to the first stage of his life which was
full of hardships and cam is used for the rest of his life. In Modern Standard Arabic the
frequent use of cam on happy occasions is quite evident. Egyptians are very likely to say
when congratulating one another with a new year: ‘caam saciid’ (happy new year), but much
less likely to say: ‘sanah saciidah’. More interestingly, sanah could encapsulate the meaning
of its collocation, as we can drop that collocation and use sanah to give the same meaning.
Consider the following example in (8) below:

(8( ‫أصابت الناس سنه‬

as}aabat al-naasa sanah
befell the-people year
The people went through an infertility of the soil, drought or a famine.

In the (8) sanah is used to describe a hard experience happened to people, which could be
infertility of the soil, drought or famine. This use found its way to the Arabic lexicons as an
equivalent to infertility of the soil, drought or famine (cf. Lisaan Al-Arab and Al-Muheet).

22 According to the Muslim literature, the Prophet Noah lived among his people for 950 years working hard to
guide them to Allah. Unfortunately, they did not believe, so Allah destroyed them by flood and saved Noah and
the believers. So Noah lived a tiring life, for 950 years, before the flood. Then, Noah with the believers
repupolated the earth in peace and serenity.

` 65
5.7 Extraction of Collocation
Collocations can be identified intuitively, semantically, lexically or quantitatively. McIntosh
(1966: 194) says that our experience of the meanings that a given word has in a certain
context sheds light on what words it collocates with and what range of collocations they have.
For example, the lexical items: chair, seat, and sofa are all likely to occur, or collocate with,
the items sit and comfortable and so they are all members of the same class which have the
same range of collocations. This is due to our experience with such items in a variety of
contexts. Firth views such a phenomenon as a relation of mutual expectancy and as an
inseparable part of the native speaker’s knowledge of his own language, i.e. competence
(Emery 1986). However such an approach cannot figure out what is more frequent or typical
in language use and we can discover interesting aspects of our language, which could not be
formed by introspection. In addition, by using advanced technology in the field of corpus
linguistics we can assess the problem more accurately and quickly. I gave a detailed account
on the credibility of intuition vs. empiricism in Chapter Three.

To assess a given collocation, we can resort to semantics. Cruse (1986) makes a distinction
between two types of semantic co-occurrence restrictions: (1) selectional restrictions which
can be defined as ‘semantic co-occurrence restrictions which are logically necessary’ (p.
278), (2) collocational restriction, which is defined as ‘co-occurrence restrictions that are
irrelevant to truth conditions’ (p. 279). For example, the verb die in John died, the tree
leaves died and *the book died needs to be preceded by a (+animate) grammatical subject;
this is called selectional restriction. Further semantic requirements are needed in sentences
like John kicked the bucket, *the cow kicked the bucket and *the tree kicked the bucket. The
lexical item kick the bucket requires in addition to the (+animate) feature another restriction,
which is (+human). Restrictions of this type are called collocational restrictions. In short, the
semantic approach tries to define collocations by the actual meanings they have and by the
usefulness of combinations of words in different contexts.

The lexical approach23 concentrates on the language as a complete unit; it does not make a
distinction between grammar and vocabulary. This approach differs from the semantic one in

23 The lexical approach not only deals with individual words, as might be understood, but also with larger units
i.e. the word combinations that we store in our minds.

` 66
that the latter tends to account for all the relations that hold among lexical occurrences ‘in a
semantically motivated way’ (as in Cruse’ collocational restriction) (Emery, 1988: ch.1.2.3).
The lexical approach, on the other hand, looks at collocation, for example, as a matter of
combinatorial process without giving any explanation. It does not explain why a given lexical
item collocates with another lexical item (Lehrer, 1974: 176). Therefore, in this approach we
can easily make use of computer analysis of large corpora to focus on high frequency
language and to highlight typical patterns of language use.

However, Lehrer (1974: 173) criticised both approaches: the lexical approach does not give
an explanation for the co-occurrence of lexical items whereas the semantic approach cannot
account for the combinations that are arbitrarily restricted. Therefore, she argued for an
eclectic view that combines aspects from both approaches.

It is not our goal to discuss in detail the various methods of extracting collocations, i.e.
intuitively, semantically or lexically. I am rather more concerned with applying the most
commonly used methodology, statistics, in extracting collocations from Arabic corpora. This
approach tries to define collocations by the frequency of certain word combinations in a text.

5.7.1 Using statistics in collocation extraction

Since the main goal of my study is to use a corpus to investigate language use in Arabic and
to demonstrate the potential impact of computational methods on Arabic linguistic studies, it
is not feasible to study all the texts manually. The whole corpus is too large to deal with in its
entirety. This is because ‘the unaided human mind simply cannot discover all the significant
patterns, let alone group them and rank them in order of importance’ (Church et al 1990: 26-

In Chapter Three we talked about the concordance as a means of processing corpora. The
available concordancing programs can do lots of applications: frequency lists, word
associations, etc. (cf. Barnbrook 1996 for more details). However, human intervention is
needed to run, edit and analyse such concordances. Concordances can only help us find the
words under examination in their environments as shown in figure (5.2) below.

` 67 Lemmatisation
When examining a word, it is often useful to consider the different forms of the word
altogether. In doing so, I faced some problems when searching for words as base-forms
(lemmas). In English we can, to a great extent, search for a word irrespective of its
grammatical change such as tense or plurality by the wild card search, which can provide all
possible forms of a given word. For example, if you search for the lemma ‘play’ using wild
card, the output will include words like ‘plays’, ‘played’, ‘playing’ and so on. On the other
hand, using wild cards with Arabic to get all related word classes, verbs, nouns, adverbs etc
reveals that the output needs an exhausting hand editing before proceeding further to any
assessment. It would be difficult to search for the lemmas without some sort of human
intervention such as editing our automatic counts. This is because Arabic is an inflected
(synthetic) language where affixes have a different function from non-synthetic languages
like English. In Arabic a lemma is actually a stem of a set of forms (hundreds or thousands of
forms in each set) that share the same morphological, syntactic or semantic features (Dichy,
2001 and Kamir, 2002). For example, if we search for the word ‫ سنة‬sanah ‘a year’ we will

have many forms such as ‫ سنوات‬sanawaat and ‫ سني‬siniin (fem. & masc. pl. ‘years’), ‫ سن‬saniy

(pl. ‘years’ in genitive case), ‫السنة‬al-sanah ‘the year’, ‫ سنته‬sanatahu ‘his year’ ‫ سنتها‬sanataha

‘her year’, ‫سنينه‬siniinahu ‘his years’, ‫ سنينها‬siniinaha ‘her years’, ‫ سنينهم‬siniinahum (masc. ‘their

years’) ‫ سنينهن‬siniinahun (fem. ‘their years’), ‫ سنتان‬sanataan (dual in nominative case ‘two

years’), ‫ سنتي‬sanatayn (dual in accusative and genitive case ‘two years’). Although this seems

a very simplistic search, yet we could not find at the time of writing this thesis a program
which can combine between the features of a concordancer and an Arabic stemmer.

Apart from Xerox’s morphological analyser24 and Buckwalter’s25

we do not have at the time of writing a public domain
lemmatiser which can work on Arabic because Arabic is a non-
concatenative word formation system and other idiosyncrasies
mentioned above. Xerox’s morphological analyser was first made
24 (

` 68
for the company’s research purposes in 1997. Only in 2002 were
they able to produce an improved commercial version for
teaching purposes and as a component in larger natural language processing

systems. This program, which supports Arabic script, works on a two­level morphological 

analysis:  (1) roots and patterns; (2) affixes, enclitics and function words which are normally 

attached to words as prefixes. The program uses a very limited Arabic dictionary of 4930 

roots. It can deal with all words with or without diacritics. However, it analyses words 

separately from their contexts, which might produce some ambiguous forms. For example, if
you search for a word like ktb, it shows all the meanings of the root ktb without a specific
reference to the word in context (

Also, Tim Buckwalter managed, using Perl, to write a

morphological analyser for Arabic. Buckwalter’s Arabic Morphological
Analyser was created for POS-tagging Arabic text. The analyser consists primarily of three
Arabic-English lexicon files: prefixes, suffixes, and stems. This program is now available
through LDC. However, the program only deals with Modern Classical Arabic. Secondly,
contrary to the main stream of Arabic writing system which ignores diacritics, Buckwalter
included diacritics in his lexicons. To him ignoring diacritics could lead to misinterpretation
and misanalysis of Arabic lexemes.

The wild card search is useful sometimes with Arabic when the search-term is not polysemic
or the base word has a limited potentiality for word building. For example, using wild card
search with z{anna, is not problematic because this word is not polysemous in the first
place. Secondly, there are only a few irrelevant instances containing the same root letters but
they do not belong to the search-term such as lafaz{ani, hafaz{ani, haz{una, ayqaz{ani.
Although the second root letter that is n, in these examples, is not originally a root letter; it is
rather a suffix added for a morphological reason. Otherwise, most of the Arabic words need
an extensive hand editing because of the absence of vowels in Arabic which makes these
forms morphologically identical. This makes the process of singling out the search-term quite
complicated and sometimes ambiguous.

` 69
Biber et al. (1998: 91) propose a statistical way for editing such data26. Their procedure is
meant to remedy the tagged corpora, where we may meet irrelevant grammatical categories
(i.e., which are not under examination). However, I think, with slight refinement, this method
is also useful for work on raw corpora to exclude the counts which are inaccurate
morphologically27. These steps, which involve a lot of hand-editing based on intuition, are as
follows: a random sample of the counts of the word under investigation,
2.edit it by hand,
3.compute the proportional use of the irrelevant counts in the sample,
4.multiply the total number of your counts, in the corpus, by the proportion computed in
step 3.
To guarantee the accuracy of this procedure Biber et al. (ibid: 91) suggest ‘more than one
random sample should be taken from each category in order to make sure that the proportions
are similar across samples’.

Let us consider the following example. In CAC I have got 10447 instances of the base-form
ktb. Let us select a random sample28 of 2402 including relevant and irrelevant hits. Having
edited the sample, I found out that 290 hits are irrelevant, although some are derived from the
same root while others do not belong to the root (see table 5.1 below). The proportion of the
irrelevant examples can be represented as follows.

290 x 100 / 2402 = 12%

Thus the total number of the irrelevant forms of ktb is:
12% x 10447 = 1253
On the other hand the proportional total number of the relevant forms is 88% x 10447 = 9193


Relevant Counts Kataba, yaktubu, …. 2112 88%
Irrelevant Counts Based kateebah 78 3%
26 By data I mean the total number of occurrences of the word under investigation.
27 This procedure is useful to know the frequency of a given word without editing the whole corpus, by
identifying the wrong and correct forms through a small sample.
28 I counted just the hits with maximum frequency of 100 and minimum frequency of 5.

` 70
on the same root
Irrelevant Counts from Yaktasib 212 9%
Other Roots
Table (5.1): This table shows how misleading it can be to search on an Arabic raw corpus
without hand-editing.

We can notice that the irrelevant counts can be calculated proportionally. Using the above
statistical methodology in editing our data will save time and effort in hand editing. In other
words, instead of going through the whole corpus to eliminate the irrelevant forms of a given
search-term, we can rather select a sample, edit it manually and run a proportional calculation
as shown above. Concordances
With KWIC (key word in context) we can search the whole corpus in a way that saves our
time and effort instead of looking up each occurrence of the word under investigation. In
figure 5.2 below, a list of the occurrences of the word ktb using wild card search.

‫ د بن داود الدينوري رحمه الله وجدت فيما [[كتب]] أهل العلم بالخبار الولى أن آدم‬.....1
.… ‫عليه‬

` 71
‫‪ ... .2‬من عاد كما قد قصه الله تبارك وتعالى في [[كتابه]] وهو أصدق الحديث‪ .‬قال‪ :‬ونشأ‬
‫في ذلك الدهر ‪...‬‬
‫‪ ... .3‬لم يرعووا فأهلكهم الله عز وجل كما نص في [[كتابه]] وهو أصدق الحديث‪ .‬ويقال‪ :‬إنه‬
‫كان بين مه ‪...‬‬
‫‪ ... .4‬إياه بتكليمه ورسالته ما قد قصه علينا في [[كتابه]] وانصرف إلى شعيب ورد أهله إليه‬
‫ومضى حتى ‪...‬‬
‫‪ ... .5‬عيب إلى قومه فكان منهم ما حكاه الله في [[كتابه]]‪ .‬أبرهة قالوا‪ :‬ثم ملك أرض‬
‫اليمن أبرهة ب ‪...‬‬
‫‪ ... .6‬هي بلقيس ما قد قصه الله تبارك وتعالى في [[كتابه]] إلى أن تزوجها وبنى بأرض‬
‫اليمن ثلثة حص ‪...‬‬
‫‪ ... .7‬خزائن من خزائنه وإن عبد الملك بن مروان [[كتب]] إلى عامله في بلد المغرب‬
‫موسى بن نصير ‪... -‬‬
‫‪ ... .8‬بن دارا تجبر واستكبر وطغى‪ .‬وكانت نسخة [[كتبه]] إلى عماله‪ :‬من دارا بن دارا‬
‫المضيء لهل ‪...‬‬
‫‪ ... .9‬وغربها ليعامل الناس على قدر فلما انتهى [[كتابه]] إلى دارا بن دارا غضب من ذلك‬
‫غضباً شديدا ‪...‬‬
‫‪ ... .10‬يزل يؤديها إلينا أيان حياته فإذا أتاك [[كتابي]] هذا فل أعلمن ما بطأت بها فأذيقك‬
‫وبال أ ‪...‬‬
‫‪ ... .11‬عذرك والسلم‪ .‬دارا والسكندر فلما ورد [[كتابه]] على السكندر جمع إليه جنوده‬
‫وخرج متوجه ‪...‬‬
‫‪ ... .12‬د فعلت‪ .‬ثم أمر بهما فرجما حتى ماتا‪ .‬ثم [[كتب]] إلى أم دارا وامرأته بالتعزية وهما‬
‫بمدي ‪...‬‬
‫‪ ... .13‬لك سرت فكتبت إليه‪ :‬إن الذي حملك على ما [[كتبت]] به فرط بغيك وعجبك بنفسك‬
‫فإذا شئت أن تسي ‪...‬‬
‫‪ ... .14‬ما ذقت من غيري والسلم‪ .‬فلما رجع جواب [[كتابه]] أرسل إليها بملك مصر وكان‬
‫في طاعته ليدعو ‪...‬‬
‫‪ ... .15‬قصته وبنائه الردم ما قد أخبر الله به في [[كتابه]] فسألهم عن أجناس تلك المم‬
‫فقالوا‪ :‬نحن ‪...‬‬
‫‪ ... .16‬ى بئر الملك فكان من قصته ما هو مشهور قد [[كتبناه]] في غير هذا الموضع‪ .‬قالوا‪:‬‬
‫ولما ابتعث ا ‪...‬‬
‫‪ ... .17‬إليه أردشير بالدخول في طاعته فلما أتاه [[كتابه]] امتل غيظاً وقال لرسله‪ :‬لقد‬
‫ارتقى ابن س ‪...‬‬
‫‪ ... .18‬ابي وفيرك الذي تدعى مرتبته مهران وجودرز [[كاتب]] الجند وجشنساذربيش كاتب‬
‫الخراج وفناخسرو‬
‫‪ ... .19‬بته مهران وجودرز كاتب الجند وجشنساذربيش [[كاتب]] الخراج وفناخسرو صاحب‬
‫صدقات الملكة‬

‫`‬ ‫‪72‬‬
‫‪ ... .20‬نصارى الهواز يقال له يزدفنا‪ .‬وأن قيصر [[كتب]] إلى كسرى يسأله الصلح ورد ما‬
‫احتوى عليه ‪...‬‬
‫‪ ... .21‬الموادعة فأجابه قيصر إلى ذلك فانصرف ثم [[كتب]] إلى عماله بأرمينية وأذربيجان‬
‫فاجتمعوا و ‪...‬‬
‫‪ ... .22‬أذن لعظماء أصحابه فدخلوا عليه ثم أقرأهم [[كتاب]] الملك إليه فلما سمع أصحابه‬
‫ذلك يئسوا م ‪...‬‬
‫‪ ... .23‬ر بمدينة همذان ارتاب بابن عمه ذلك وكتب [[كتاباً ]] إلى الملك يعلمه‪ :‬أنه قد رده‬
‫إليه ليأمر ‪...‬‬
‫‪ ... .24‬لى محبسه فإنه فاجر فتاك وقال له‪ :‬إني قد [[كتبت]] إلى الملك كتاباً في بعض‬
‫المور فأغذ ال ‪...‬‬
‫‪ ... .25‬اجر فتاك وقال له‪ :‬إني قد كتبت إلى الملك [[كتاباً ]] في بعض المور فأغذ السير به‬
‫حتى تدفعه ‪...‬‬
‫‪ ... .26‬منه الساعة حين أخبرت بإدمانه النظر في [[كتاب]] كليلة ودمنة لن كتاب كليلة‬
‫ودمنة يفتح ‪...‬‬
‫‪ ... .27‬ت بإدمانه النظر في كتاب كليلة ودمنة لن [[كتاب]] كليلة ودمنة يفتح للمرء رأياً‬
‫أفضل من ر ‪...‬‬
‫‪ ... .28‬ابزين والنخارجان وسابور بن أبركان ويزدك [[كاتب]] الجند وباد بن فيروز وشروين‬
‫بن كامجار و ‪...‬‬
‫‪ ... .29‬ر هرمزد جرابزين حتى دخل على خاقان ومعه [[كتاب]] كسرى وأوصل إليه هدايا‬
‫كسرى وألطافه فقبل‬
‫‪ ... .30‬م في بلدهم فأجابوهم إليه وكتبوا بينهم [[كتاباً ]]‪ :‬أل يتأذى أحد بأحد فأقاموا آمنين‬
‫واتخ ‪...‬‬

‫‪Figure (5.2) a sample of the concordance of the base-form (lemma) ktb in CAC.‬‬

‫‪Such a facility is useful enough when there are only a few lines to look into. But with‬‬
‫‪thousands of lines, the human mind could be overwhelmed with these large data. Statistical‬‬
‫‪techniques can help us go deeper and reveal what we might not have observed with the naked‬‬

‫‪Today, some software for analysing concordance lines statistically is available. One of the‬‬
‫‪early attempts that used statistics to analyse corpora automatically was Choueka et al (1983).‬‬
‫‪They proposed an algorithm to retrieve collocations automatically from texts. However, their‬‬
‫‪work can only deal with a particular type of collocation: uninterrupted bigrams.‬‬

‫`‬ ‫‪73‬‬
Church and Hanks (1990) proposed a measure to estimate collocations directly from
electronic corpora. This measure, which is called association ratio, is mainly based on the
Mutual Information statistic. This program is able to retrieve interrupted word pairs but
limited to retrieving collocations that contain no more than two words.

To remedy such drawbacks, Smadja (1991) designed a program called Xtract that can make
statistical observations in collocation extraction. He used statistical methods such as z-score
to identify relevant pairs of words. ‘Xtract retrieves interrupted as well as uninterrupted
sequences of words and deals with collocations of arbitrary length’ (Smadja, 1993: 150).

Statistical programs such as Collocate and Typical (Sinclair et al: 1998) can also analyse the
lexical context of words under examination. Collocate is designed to assess the significance
of collocations in a concordance file as it calculates the actual frequency of a given
collocation and normalises it with its expected frequency (ibid: 229-230). Typical is designed
to find the most typical citations for a given word in a line by assessing the significance of
co-occurring words in a line and then evaluating the whole line (ibid: 232).

In addition, there are some concordance programs consisting of a number of tools in one
package such as wordlist, concordance and key words. Such programs also make use of
statistics in a wide range. For example, Wordsmith, which is designed by Scott (1996), uses
the chi-square measure, while CobuildDirect uses mutual information and t-score (Oaks,
1998: 193).

These programs, in the first place, are designed to work on languages written in the Roman
alphabet. Secondly, problems of polyseme (words with two meanings) cannot be sorted out
automatically except in opportunistic (specialised) corpora (Smadja 1991), otherwise they
need some sort of hand-editing. For instance the words like bank, which means either a
financial body or one side of a river, could be disambiguated if we have an economic corpus
for example.

Homonyms are relatively uncommon in Arabic. However, Arabic is rather full of homographs

` 74
which are distinguished in pronunciation. Some learners of Arabic think that most Arabic
words are mainly homonymous, which is not the case. This is due to the absence of vowels in
modern orthography; the vowels are rather predicted.29 Arabic is a language in which vowels
are represented in diacritic form. Change from a vowel to vowel makes a different base-form
and ignoring these vowels produces such homographs. Such a phenomenon can make
problems for both human learners of Arabic as a foreign language and electronic processing.
This can be easily sorted out by inserting the diacritics when keying the corpus, though this is
obviously tedious and time consuming. Alternatively, one can use a tool to diacritise Arabic
text, providing the case endings according to their position in the sentence. This tool which is
called the Diacritiser, produced by Sakhr Company, helps disambiguate the seemingly
homonymous words. In theory, to diacritise Arabic texts does not work all the time since the
program is expected to make a choice from a big list of probable words. For example, ‫ورد‬

‫الرجل‬in wrd ala alrajul, ‫ ورد‬wrd can be diacritised in the following diverse ways:

ٌ‫ وَرْد‬wardun ‘flowers’, ٌ‫ وِ ْرد‬wirdun ‘portion’ َ‫ وَ َرد‬warada ‘came’, َ‫ وَرّد‬warrada ‘flowerise’

ّ‫ وَرَد‬wa radda ‘and replied’ and ّ‫ وَرُد‬wa rudda ‘and was replied’

All of the above choices can fit in the text on the syntactic level. To solve such a dilemma we
need to disambiguate these senses semantically. Moreover, we may change the positions of
the words inside the sentence for rhetorical reasons without breaching its meaning. Let us
consider example (9):

(9)ُ‫ابتلَى ابراهيمَ َرّبه‬

ibtala ibraahiima rabbuhu

tested Abraham-Acc. his Lord-Nom.
The Lord tested Abraham.

29 The vowels in Arabic are predicted according to personal intuition. In other words, an Arabic reader would
predict a certain vowel to occur in a certain position according to his own mental lexicon.

` 75
In (9) the nominative ُ‫ َرّبه‬rabbuhu ‘his Lord’ which occurs next to the verb ‫ابتلَى‬ibtala ‘tested’

was interrupted by the accusative َ‫ ابراهيم‬ibraahiima ‘Abraham’ to give precedence to Allah’s

name, because he is the Most High; this is a rhetorical device. In such sentences, a normal
morphological parser will confuse the accusative with the nominative because of the absence
of the diacritics which can distinguish between both of them. Therefore, proofreading and
hand editing is necessary to eliminate such discrepancies before doing any sort of statistics
automatically. This is apparently tedious and time-consuming as well. The tool has not been
personally assessed (see Frequency
When we have large masses of electronic data to analyse, we have to find a way to sort it out
and simplify it in such a way that it would be easy to examine and manipulate. Statistics is
considered a good way of simplifying and telling us what things we would like to highlight,
as, for instance, some combinations of words will tend to occur relatively often, while others
are rare or impossible.

Statistics has been a useful technique in all branches of language studies (cf. Miller, 1963 &
Fasold, 1984). For corpus linguistics, it is particularly very important (Allen, 1995, Charniak,
1993, Krenn and Samuelsson, 1997 and Oaks, 1998). Corpus-based statistical study, which is
an extension of traditional descriptive linguistics, can shed light on some aspects in language
which we might not be able to discern otherwise.

The starting point to analyse our corpus quantitatively to find collocations is counting. The
more frequent the word under examination (the node) with another word (or words) the surer
we are that this combination has a significant pattern. Analysing our corpus in such a way
does not work all the time since much of the output we get may not be very interesting as
shown in figure (5.3) below. The table shows the frequency of the top 30 trigrams with ktb
(write, book) i.e., the most frequently occurring three word phrases, in CAC.

R1 (search-term) L1 R1 (search-term) L1 Frequency

` 76
in…Allah ‫الله‬ ... ‫في‬ 110
from … Allah 53
‫الله‬ ... ‫من‬
what… Allah 35
‫الله‬ ... ‫ما‬
and in… Allah 32
from … in ‫ الله‬... ‫وفي‬ 32
on … like ‫من‬ ... ‫في‬ 28
to… Allah 25
‫ على‬... ‫كما‬
from… this 24
harm … and not ‫ الله‬... ‫إلى‬ 24
what … to them ‫ هذا‬... ‫من‬ 23

‫ ول‬... ‫يضار‬

‫ لهن‬... ‫ما‬
Figure (5.3) the top 10 co-occurring trigrams of the base-form (lemma) ktb in CAC.

In figure 5.3 above most of the patterns do not have a special justification to occur together.
We can notice that five of the ten trigrams with ktb significantly co-occurs with God’s name
(Allah) referring to the Qur’an whereas four occurrences are flanked with function words.
Frequency does not tell you very much, it may be misleading because ‘frequency-based
search works well for fixed phrases. But many collocations consist of two words that stand in
a more flexible relationship to one another’ (Manning & Schütze, 1999:147).

By using statistical tests we are more likely to get reliable results and test how likely two
words are to occur near each other. There are some interesting and useful statistics that one
can use to assess and enhance such counts. The most prominent ones are z-score (Berry-
Rogghe: 1970), mutual information (Church & Hanks, 1990) and t-score (Church, Hanks and
Hindle, 1991).

To extract collocations statistically we need to examine how probable it is that a certain

combination will occur. Mutual Information can help us identify interesting patterns. For
example, if a word or more shows up in our corpus a number of times around our search
term, we can examine how far such a pattern is interesting by comparing their joint

` 77
probability with chance, i.e. to count the number of the occurrences of the combination with
the number of the occurrences of each word independently. Words with large mutual
information scores are likely to be more interesting (Church et al, 1991).

The formula as introduced by Church et al. for given two words reads:

The Mutual Information compares probabilities of x and y together with probabilities of (x)
and (y) independently. Church and Hanks (1990) argue,

If p(x, y) is bigger than p(x) p(y), then it is evidence that there is more
likely a genuine association.
If p(x, y) equals or is less than p(x) p(y), then we can predict no
interesting association.

Paul Johnston30 in his web site designed a program that can do the calculation automatically
on condition that one has the number of each variable. To use these formulae to find
collocations in CAC let us have the word ‫ الدنيا‬al-dunya (the world) as our search-term and
then carry out the calculations for the word as shown in figure (5.4). The word under
investigation al-dunya is given (x) value whereas the corpus size is represented as (n).


` 78
f(x) = 1350, n = 5000000
(x,y) )f(x,y )f(y MI
the world perishable ‫الدنيا الفانية‬ 6 11 10.98
the world and its ‫الدنيا وزينتها‬ 7 17 10.57
the world and the ‫الدنيا والخرة‬ 79 571 9.00
the world good deed ‫الدنيا حسنة‬ 14 398 7.2
the world and torture ‫الدنيا وعذاب‬ 5 338 5.77
the world little ‫الدنيا قليل‬ 6 673 5.04
the world house ‫الدنيا دار‬ 6 551 5.03
the world and certifies ‫الدنيا ويشهد‬ 9 1864 4.16
the world what ‫الدنيا ما‬ 24 6939 3.67
the world without ‫الدنيا دون‬ 6 2150 3.36
the world means ‫الدنيا يعني‬ 8 4462 2.73
the world except ‫الدنيا إل‬ 18 11157 2.57
the world mentioning ‫الدنيا ذكر‬ 5 3853 2.26
the world from ‫الدنيا من‬ 45 34830 2.25
the world and for ‫الدنيا وأما‬ 7 6987 1.89
the world in ‫الدنيا في‬ 19 24009 1.55
the world until ‫الدنيا حتى‬ 10 13165 1.49
the world to ‫الدنيا إلى‬ 11 11214 1.34
the world namely ‫الدنيا أي‬ 8 13356 1.14
the world then ‫الدنيا ثم‬ 11 22956 0.82
the world said ‫الدنيا قال‬ 15 32835 0.75
the world verily ‫الدنيا قد‬ 7 24534 0.07
the world on ‫الدنيا على‬ 12 47664 0.10-
the world and not ‫الدنيا ول‬ 6 32000 0.52-
the world the statement ‫الدنيا القول‬ 5 32835 0.82-
Table (5.4) the left collocates of the word al-dunya with maximum frequency of 100
and minimum frequency of 5.
It is the nature of Arabic orthography to attach some particles, personal pronouns in either
genitive or accusative case and the definite article with the following or preceding string of
characters. For instance conjunctions like ‫ و‬wa ‘and’,‫ فـ‬fa ‘and, consequently, after’ and the

definite article al (the) are considered, in writing, as parts of the words that follow. To
decompose such combinatory units automatically may lead to a serious problem of
identifying what a word is. This is because such units can be kernel parts of base-forms in
Arabic. For example, ‫ و‬wa ‘and’ could be a conjunction and could function as the initial
letter of hundreds of Arabic words like ‫ وجد‬wajada ‘he found’, ‫ واحد‬waah{id ‘one’ ‫وليد‬

` 79
waliid ‘newborn’, etc.

To me, it would be more realistic if we stipulate from the very beginning what a word is. The
word is defined as 1) a sequence of characters with spaces in between; 2) minimal permutable
unit; 3) maximally uninterruptible (Cruse 2000). I will consider the word as what is between
spaces as it is easier and more practical. In addition, in English, the mainstream is to count
words like within, insofar and themselves as three words irrespective of how many units they
contain. To be more practical, I would consider the particle as a part of the word like an affix.

In table 5.4 the search term ‫ الدنيا‬al-dunyaa ‘world’ occurs 1350 times in CAC. The most
significant left collocates, i.e. pairs with the highest MI scores, are ‫ الخرة‬al-‘aakhirah
‘hereafter’ with MI score at 9.00, ‫ زينتها‬ziinatahaa ‘adornment’ at 10.57 and ‫ الفانية‬al-
faaniyah ‘perishable’ at 10.98. These collocations, reiterated by Muslims in religious
contexts, describe the reality of this world according to the Muslim perspective. So, Muslims
view the world as an adornment which will inevitably perish, whereas the genuine life will be
in the Hereafter.

The pair ‫ حسنة الدنيا‬al-dunyaa h{asanatan ‘the world a good deed’ appears as a strong
collocation. It is a part of an often-quoted prayer (supplication) in (10) below,

(10)‫اللهم آتنا في الدنيا حسنة وفي الخرة حسنة وقنا عذاب النار‬

allahumaa aatina fi al-dunyaa h{asanatan wa fi al-‘aakhirati h{asanatan wa

qinaa cadhaaba al-naar.
(trans.) ‘O Allah give us a good deed in this world and next and protect us from the

The collocate h{asanatan does not modify al-dunyaa in the first place, it is rather a direct
object for the verb aatinaa (give us). Moreover, the repeated citation of this prayer is not an
independent occurrence of this collocation. Accordingly, we need a mechanism to single out
real collocations from the apparent ones. The t-score, which is a general statistical measure
that can compare two probabilities, is a useful statistic to assess the relative strength of

` 80
collocation. This will be discussed in detail in the next chapter.

In addition to the usefulness of MI in finding a given collocation without any prior

knowledge of its plausibility, it can also detect whether a given combination is really a
collocation or not. Let us consider the following example.

f(x) = 11805, n = 5000000

(x,y) )f(x,y )f(y MI
Messenger of Allah ‫رسول الله‬ 11598 19246 7.99
Messenger from ‫رسول من‬ 24 34830 -1.77
Messenger verily ‫رسول قد‬ 14 24534 2.04-
Messenger truthful ‫رسول مصدق‬ 9 67 5.83
Messenger prayed ‫رسول صلى‬ 9 18843 2.30-
Messenger sent ‫رسول مرسل‬ 6 34 6.22
Messenger with what ‫رسول بما‬ 4 4566 1.43-
Messenger of the king ‫رسول الملك‬ 4 5046 1.57-
Messenger except ‫رسول إل‬ 4 12261 3.23-
Table (5.5) the left collocates of the word ‫ رسول‬rasuul with minimum frequency of 4.

Table (5.5) shows that the search term ‫ رسول‬rasuul ‘Messenger’ collocates with ‫‘ الله‬Allah’,

‫ مرسل‬mursal ‘sent’ and ‫ مصدق‬mos}addaq ‘truthful’ for their high MI scores. Of the three

collocations, the first has a strong bond with our node with MI at 7.99. This draws our
attention to the strong bond between the pair ‫ الله رسول‬rasuul Allah ‘Allah’s Messenger’ in
addition to the main traits of this Messenger which are ‘truthful’ and ‘sent by Allah’. For the
very strong bond between ‫ رسول‬rasuul ‘Messenger’ and ‫ الله‬Allah the word ‫ الرسول‬al-rasuul
‘the Messenger’ with definite article can replace the whole pair.

The major problem with MI is that it does not work very well when there is not much data,
i.e. sparse data problems; it is the problem of all statistical tests. We cannot calculate the
probability of a given pair if one of the variables has the value zero and with very low
occurrences, the measure does not work very well either. Manning and Schütze (1999: 169)
calculated the MI scores of ten bigrams that occurred once to prove the invalidity of MI with
sparse data. They found out that ‘a large proportion of bigrams are not well characterised by

` 81
corpus data (even for larger corpora) and that mutual information is particularly sensitive to
estimates that are inaccurate due to sparseness’.

Obviously such a problem can be superficially avoided by using words with a frequency of at
least four or three. Sinclair (1996) proposes a primitive test to measure the significance of a
given pattern by looking into patterns with minimum frequency of two. He noted that despite
the insufficiency of such a condition it could guarantee that such a pattern is not accidental.

In practice, for language, unlike many other areas of research, only

events that recur are worth assessing the significance of; no matter
how unusual, a single occurrence is unremarkable in the first instance.
(ibid: 81)

This is in conformity with the corpus linguistics methodology, i.e. to investigate how typical
a given pattern is.

Allen (1995: 194-5) proposes a more practical solution by adding a small amount to each
count to guarantee that there will be no zero probabilities. This process is called expected
likelihood estimator (ELE). For example, if one category of the formula happens not to occur
in our corpus, i.e. it equals (0), the Mutual Information statistic will be inapplicable, giving
no result. ‘The ELE, however, gives an equally likely probability to each possible word class’
(ibid: 195).

Let us now consider the following example to see how useful MI is in extracting collocations.
MI can reveal what is not expected or often missed out of the obvious typical patterns. When
discussing the collocations of body parts in Classical Arabic, Emery (1988) argued that the
adjectives in ‫ ضروس حرب‬h{arb d}aruus ‘fierce war’ and ‫ جرار جيش‬jayshun jarraar ‘huge

army’ uniquely collocate with their preceding nouns. Having analysed CAC, I found out that
none of them co-occur with such nouns even once. The adjective ‫ ضروس‬d{aruus ‘fierce’
rather co-occurs with ‫ مطر‬mat}arun ‘rain’. For the other adjective ‫ جرار‬jarraar ‘huge’ it
collocates with ‫‘ عسكر‬askarun ‘soldiers’. On the other hand, he was successful in ascertaining

` 82
that the verb ‫ أطرق‬at{raqa ‘bowed’ uniquely collocates with a particular body part: ‫ رأس‬ra’s
‘head’. However, Table (5.6) below shows that there are more categories, other than body
parts, which ‫ أطرق‬can collocate with such as ‫ حياء أطرق‬at}raqa h{ayaa’an ‘bowed out of
shyness’ ‫ كرا أطرق‬at}raqa kara ‘Kara bowed’.

f(x) = 57, n = 5000000

(x,y) )f(x, y )F(y MI
bowed his head ‫أطرق رأسه‬ 35 1106 11.43
bowed Kara ‫أطرق كرا‬ 3 3 16.42
bowed so not ‫أطرق فلم‬ 1 15175 2.45
bowed and thought ‫أطرق وفكر‬ 1 261 8.39

bowed but ‫أطرق وإنما‬ 1 5937 3.88

bowed then ‫أطرق فإذا‬ 1 24107 1.86
bowed namely ‫أطرق أي‬ 1 13356 2.71
bowed the young man ‫أطرق الشاب‬ 1 370 7.88
bowed to ‫أطرق إلى‬ 1 11214 2.45
bowed out of shyness ‫أطرق حياء‬ 1 32 24 .11
bowed Hasan ‫أطرق حسن‬ 1 2111 5.37
Table (5.6) the problem of sparse data. The left collocates of the word at}raqa ‘bowed’.

Nevertheless, ‫ أطرق‬at}raqa ‘bowed’ in table 5.6 strongly collocates with ‫ رأسه‬ra’sahu ‘his
head’ because of its high MI score. The second combination in the table seemingly appears
as a strong collocation despite its low frequency because the word it collocates with is rare.
All three occurrences of this word occur only with ‫ أطرق‬at}raqa. This gives an indication
that such combination is more likely to be a cliché, an idiom or any other stereotyped phrase.
In fact, all of them belong to a certain context or domain: proverbs; this leaves us no doubt
that the combination under investigation is part of a proverb.31

MI is useful only for testing similarities, which is good for finding collocations as it can
calculate the probability whether two words occur together very often in a text. But we
cannot use it to test the differences between words, which is necessary for assessing
seemingly synonymous or collocated words. It can give evidence for the closely related
words, if you find x, you are more or less likely to find y. For testing differences, I need a

31 Using corpus-based analysis to assess Emery’s results is useful in supporting or invalidating his hypothsis.

` 83
different statistic, which is t-score. T-test: a measure of difference

Differences between items, particularly synonymous words are not easy to identify on
traditional syntactic or semantic grounds. The thesauri, which are introduced for practical or
pedagogical reasons (cf. Ch. Six), are sometimes misleading and through frequent use we
may get used to accept all the entries given as synonyms of a word as absolute synonyms.

In Chapter Five we introduced Mutual Information statistic which is useful for detecting
similarity between items. We will now use another statistical technique: t-test, which is useful
in assessing the significant differences between two groups of patterns, typically pairs of near
synonyms. The main aim behind it, as suggested by Church et al (1991), is to see the more
significant words that are more likely to appear with each item of the synonymous pair. An
example of strong and powerful, as given by Church et al, can show the importance of this
test. Contrary to Mutual Information which can only make positive statements32 or what is
more likely to occur after a given item, the t-test can work the other way around, i.e. highlight
what is less likely to occur after that item. In other words, the difference between powerful
and strong in powerful support and strong support can be brought out by comparing the most
significant right collocates of both of them. By analysing the significant collocates, Church et
al (1991) managed to abstract an attribute that can differentiate between both words, namely
intrinsic vs. extrinsic.

T-test simply calculates the difference between two probabilities. The formula as given in
Church et al (1991) for the pair of words ‘strong’ and ‘powerful’ is represented as:

where w stands for the collocate and ơ for the standard deviation.

32 By positive statement I mean that in MI we can find the words which are more likely to co-occur after X but
we cannot account for the items which are more significant with Y or did not occur at all with either. T-test can
make a negative statement by looking at items which are less likely to co-occur with either X or Y altogether.

` 84
Finally, we have to bear in mind that the statistical calculation is not an end in itself in
linguistic analysis. As Sinclair (1996: 80-81) puts it.

The use of numerical methods is normally only the first stage of a

linguistic investigation, and this kind of work should be distinguished
sharply from the heavy reliance on statistical methods in some styles
of linguistic-analytical operations such as parsing or translation.

In conclusion, we can give a definition of collocation which is relatively an amalgam, with

some refinement, of the above definitions. Collocation is the significantly frequent33 co-
occurrence of two or more words.

33 “Significantly frequent” here means the statistically significant combination of words.

` 85
Chapter Six: Synonymy: An overview
6.1 Introduction
As discussed earlier, synonymy is a paradigmatic relation that holds between words on the
vertical axis (cf. Ch. 5). This type of sense relation simply means the sameness or similarity
of meaning as defined in dictionaries. When asking anybody about the meaning of a given
word, they will intuitively provide you more than a word as alternatives. In this respect, many
dictionaries are assembled to fulfil that purpose like Roget’s Thesaurus, Webster’s Synonym
Dictionary and Crabb’s English Synonyms. Such dictionaries provide for every entry a list of
words that have close meaning or descriptive detail of the concept. But is that closeness of
meaning considered synonymy?

On the other hand, some linguists like Bloomfield (1935: 145) deny the existence of
synonyms in natural languages. Therefore we need from the very beginning to explain what
exactly synonymy is and give a systematic survey of the phenomenon to put forward a more
convincing explanatory hypothesis that will be statistically applicable.

As the subject matter of this thesis is to look into synonymy in a different way and to
examine readily empirical issues that have interesting theoretical results, I will try in this
chapter to review the phenomenon from the corpus linguistic perspective. Through corpus
analysis we can show whether two items are indeed absolute synonyms or not by checking
their relations in all the available contexts.

6.2 Definition
Synonymy is defined as two or more expressions which are different in form but not in
meaning (Harris, 1973: 6). To have two different phonological words of the same meaning
can bring up some arguments as regards how much sameness do both of them have. Should it
be complete similarity, strong similarity or even a thin shade of similarity to be considered?
Below in section (6.2.1) we are going to explain how similar a word is to another in meaning
to be called a synonym. Let’s first talk about the expressions involved.

` 86
It is important from the very beginning to distinguish two notions of semantic similarity: a)
similarity between single words and b) paraphrase. For instance commence and start are
synonymous verbs, whereas human male is a paraphrase of man. We are concerned with the
first type of similarity. Synonymy in this sense is defined as a relation of similarity in
meaning between lexical items.

Synonymy can also be defined as sameness of intension or extension34 (Jones, 1986: 66). The
extension of a word (denotation) is all things referred to by a word. For example, the
extension of the word dog is the class of dogs. On the other hand, the intension is the word
property (or to put it in Lyon’s words: ‘the set of attributes which characterise any entity to
which the term is correctly applied’ (Lyons 1968:454)). For example dog entails an animal.

Palmer (1981: 88) defines synonymy as ‘symmetric hyponymy’. For instance, if we take car
and automobile as synonyms, then they have to be mutual hyponyms to each other, i.e. all
cars are automobiles and all automobiles are cars. In this respect, synonymy is considered a
type of hyponymy. ‘if X is a hyponym of Y and if Y is also a hyponym of X, then X and Y are
synonymous’ (Hurford & Heasley, 1983: 107).

A more restrictive definition of synonymy was put forward by Quine who views synonymy
as ‘two forms are synonymous if their interchange leaves their contexts synonymous’ (Jones,
1986: 70). This definition requires that the two forms under investigation are interchangeable
in every possible context. The synonymy relation between a given pair of words can be ruled
out if we spot any change in the context. Tests have been introduced to check the credibility
of any seemingly synonymous pairs as will be discussed in section 6.2.2 below.

6.2.1 Synonymy - Four Approaches

According to the definitions given above we are left with four attitudes about the treatment of
synonymy: one denying the existence of synonymy and the other three differ as to how much

34 The extensional approach is the only way to give all sorts of information. However, the intensional approach
to meaning is more general than the extensional approach since there are some words which do not have an
extension like unicorn. Unicorn entails animal; however it does not refer to anything extensionally. Therefore, it
is relatively easier to study words intensionally than extensionally. Moreover, hyponymy will not come up if we
do not take that approach (e.g. dog is a hyponym of animal).

` 87
similarity to be considered synonymy.

First Approach: Denial of the existence of synonymy

We stated in the introduction of this chapter that some linguists deny the existence of
synonymy. This approach arises from the question of why natural languages tend to have two
words which mean the same thing and are used in the same range of grammatical and lexical
patterns. The principle of economy eliminates one of these two terms as redundant.
Bloomfield (1935: 145) argues, ‘each linguistic form has a constant and specific meaning. If
the forms are phonemically different, we suppose that their meanings are also different… We
suppose, in short, that there are no actual synonyms’. This approach was also adopted by
Palmer who claimed that there are no real synonyms (1981: 89). Such an approach can be
entertained or discredited in our theoretical treatment of synonymy according to what
definition of synonymy we adopt. The definition of synonymy as complete interchangeability
is partly in conformity with this approach; as argued by Ullmann, this phenomenon of
complete interchangeability without any sort of alteration in meaning is rare in natural
language (1962: 142). The three other approaches of looking at synonymy acknowledge it but
with some different treatments.

Second Approach: Strict definition of synonymy

Quine, Ullmann and Haas view synonymy as perfect interchangeability between the items
under investigation in all possible contexts. It is enough to prove that a given pair of words is
non-synonymous if any shade of meaning (increased or decreased) alters with the change of
the context. As Jackson (1998: 65) put it, ‘two words are synonymous if they can be used
interchangeably in all sentence contexts.’ This is also the requirement given by Ullmann
(1962) when he talks about synonymy in that strict sense. Synonymy in this sense is called
absolute synonymy as will be explained in section 6.2.2. Ullmann introduced a test for ruling
out seemingly synonymous pairs called the substitution test. He (ibid: 143) states, ‘The best
method for the delimitation of synonyms is the substitution test … which is (considered) one
of the fundamental procedures of modern linguistics, and in the case of synonyms it reveals at
once whether, and how far, they are interchangeable’. He gave a few examples like broad
and wide which are used synonymously in broad sense and wide sense, but fail to keep that

` 88
synonymy in five foot wide. He did not go into more detail. He also stated that ‘one can also
distinguish between synonyms by finding their opposites (antonyms).’ For instance decline
and reject are not synonymous if opposed to rise and accept.

Haas (quoted in Cruse, 2000) comes up with a different test, i.e. the normality profile test.
The meaning of a word is its normality profile across its grammatical occurrences. He argued
that there is a normality profile for all possible words and sentences in a language. One single
occurrence where we find one item of a pair more or less normal than the other can
undermine the synonymy relation between them. ‘Every difference of meaning between two
expressions will show up as a difference of normality in some context’ (Cruse, 2000: 12). For
example: illness and disease are not synonymous as in (1a&b).

1.a. During his illness (normal)

b. *During his disease (abnormal)

We can distinguish between words through the grammatical aspects of meanings, as every
word can be more or less normal than the other. Haas used the notion of normality as a
primitive intuition.

Third Approach: a more lenient approach

This looks at synonymy in a broader sense and is adopted to accommodate as many
synonyms as possible for each item. This approach is more likely entertained for practical
purposes, such as in the pedagogical field and in the production of dictionaries. A closer look
at dictionaries of synonyms reveals that the criterion considered in the definition of
synonymy is to have two items, which at least are similar in some context. For example,
Roget’s Thesaurus is based on concepts as it gives to every concept list or lists of terms,
which describe that concept. These terms are grouped according to the degree of their
synonymy relation. Unlike Roget’s Thesaurus, Webster’s Synonymy Dictionary and Crabb’s
English Synonymy are word-based; they take each word and lists its synonyms followed by

` 89
Fourth Approach: halfway between the two extremes
The definitions given by Lyons (1969 and 1981) and Cruse (1986 and 2000) represent this
attitude. They do not see interchangeability in all texts as a requirement for synonymy
recognition. They rather made a distinction between different categories of synonymy.

Lyons (1968: 450) defines synonymy as follows:

If one sentence, S1, implies another sentence, S2, and if the

converse also holds, S1 and S2 are equivalent; … If now the

two equivalent sentences have the same syntactic structure and
differ from one another only in that where one has lexical item
x, the other has y, then x and y are synonymous.

Later on, Lyons (1981: 50-51) drew a distinction between three types of synonymy:
1.synonyms are fully synonymous if, and only if, all their meanings are identical;
2.synonyms are totally synonymous if, and only if, they are synonymous in all contexts;
3.synonyms are completely synonymous if, and only if, they are identical on all (relevant)
dimensions of meaning.

Absolute synonymy combines all these three categories. Lyons (ibid: 51) states, ‘absolute
synonyms are expressions that are fully, totally and completely synonymous’. If one of the
above criteria were missed, synonymy would be partial (ibid). Lyons made a further
distinction between partial synonymy and near-synonymy. The latter is defined as
‘expressions that are more or less similar, but not identical, in meaning’ (ibid: 50). Cruse
(1986: 292) viewed Lyons’ distinction of partial and near-synonymy as one. ‘By his (Lyons’)
definition near-synonyms qualify as incomplete synonyms, and therefore as partial synonyms
(though, of course, they represent only one variety)’. But later on he regarded them as
different degrees of similarity as shown below.

Cruse (2000:156) proposes a different classification of synonymy. He defines synonymy as

‘words whose semantic similarities are more salient than their differences’. He distinguishes

` 90
three types of synonyms according to the degree of the similarity that holds between items:
absolute, propositional, and near-synonymy. These types can be located on a scale at the end
of which falls absolute synonymy.

6.2.2 Degrees of Synonymy Absolute synonymy:

Cruse (1986: 268) states that ‘two lexical units would be absolute synonyms … if and only if
all their contextual relations … were identical’. This is a very strict definition; the two words
have to have exactly the same normality in all cases. It is enough to rule out the synonymy
relation between two items if you find a single context in which they differ. The over-quoted
example is caecitits and typhlitis (which mean inflammation of the blind gut) (Ullmann
(1963) and Lyons (1981b)). To have this sort of complete similarity is not motivated in
natural languages since one of the items would be redundant and accordingly undergo shift of
meaning or expire. The only possible reason why we have absolute synonyms is for avoiding
repetition of forms, for more profound and elegant discourse. Accordingly, there are rare
examples that satisfy this strict definition. To show that two items are not absolutely
synonymous, Cruse (1986 and 2000) used the Normality Test, introduced by Haas as
mentioned in the previous section, where any difference in meaning will be reflected in a
difference in contextual relations, i.e. collocational restriction. Let us now have a look at the
following examples (taken from Cruse 2000: 157):

2.a. Little Billy was so brave at the dentists’ this morning.

b. ? Little Billy was so courageous at the dentists’ this morning.
3.a. He is a big baby, isn’t he?
b. ?He is a large baby, isn’t he?
4.a. Apparently he died in considerable pain.
b. ? Apparently he kicked the bucket in considerable pain.

In the above examples the (a) sentences are more normal than their (b) counterparts. It should
be noted that when doing the test we have to stick to one meaning of the word under
investigation especially with words of subtle differences. For example, sad and happy, in

` 91
4 and 5 below, can modify either animate or inanimate objects.

5.a. ?It is a sad baby.

b. It is an unhappy baby.
6.a. It is a sad story.
b. It is an unhappy story.

The Haasian test is semantically based and does not work otherwise. Cruse (1986: 281) gave
some examples with no semantic explanation. For instance, one’s record can be spotless,
unblemished or impeccable, but not flawless, whereas one’s credentials cannot be but the last.
This is called idiosyncratic collocational restrictions. This is an important point that gives our
premise more credence when we talk about the treatment of synonymy through collocations. Propositional synonymy

Propositional synonymy (commonly called cognitive synonymy) is widely regarded as
synonymy. So, Lyons’ definition quoted in 6.2.1 can fit here as an appropriate definition for
this type of synonymy. To put it in Cruse’ words:

X is a cognitive synonym of Y if (i) X and Y are syntactically

identical, and ii) any grammatical declarative sentence S
containing X has equivalent truth-conditions to another
sentence S1, which is identical to S except that X is replaced
by Y.
(Cruse 1986: 88)

For instance, fiddle and violin are propositional synonyms, because the two sentences: He
plays the violin very well and He plays the fiddle very well, i) have the same syntactic
structure, ii) have the same truth-conditional properties as they entail one another.
Accordingly, Lyon’s definition of synonymy discussed in section 6.1.1 above comes out as

` 92
The key point in defining propositional synonymy is substitutability with the truth-condition
preserved, as Lyons put it ‘substitutability salva veritate’, so it is less strict than absolute
synonymy which requires, in addition to keeping the truth-condition of the substituted words,
the same contextual environment. The way to prove that two words are propositional
synonyms is to find a situation where one is more or less typical than the other, while
preserving the truth-value. Such a type of synonymy is more common than absolute
synonymy, for instance, begin: commence, car: automobile, die: pass away and brave:

Propositional synonymy allows some differences of non-propositional meaning to occur

between synonymous pairs (Cruse, 2000: 158). Arguing that ‘there are no real synonyms’,
Palmer (1981: 89) mentioned four facets that render differences between synonymous pairs
as shown below.

a.Dialectal variations, e.g. autumn and fall (the latter is used in American English).
b.Stylistic variations, e.g. begin and commence (the latter is more formal).
c.Emotive variations, e.g. politician and statesman (each show approval and
d.Collocational variations, rancid bacon or butter and addled eggs or brains.

More precisely, Cruse (1986 & 2000) discussed these differences as follows:
•Differences in expressive meaning: a sentence can be expressively neutral, positive or
negative as shown in the examples below:

7.The old man died.

8. The old man passed away.
9. The old man kicked the bucket.

The sentence (7) above is neutral, whereas (8) has an additional meaning of respect and (9)
has a sense of disrespect.

` 93
•Differences of evoked meaning
Dialect: different lexical items that are used in different dialects in the same range of
references. Geographical: autumn: fall, corn: wheat: oats, etc.; temporal: wireless: radio,
swimming baths: swimming pool; social: sofa: settee, lavatory: toilet.
Register: the change of situation, the audience or the speaker’s intention may bring up
different lexical items with same range of reference. Field: marriage: matrimony, dead:
deceased; mode: re: concerning: about; style: money: bread: dough: dosh: filthy lucre.

•Differences of presupposed field of discourse

Two propositional synonyms can differ in respect of presupposed meaning. Presuppositions
can be either logical, as in (10) below, or arbitrary as in (11).

10.a) x died.

In (10.a) above, we have two components of meaning:

i)x is an organism
ii) x became not-alive
If you negate the sentence in (10.a) as in (10.b), then you leave one meaning intact.
10.b) x did not die.

i)x is an organism.
ii) not-(x became not-alive).
Therefore, (i) is a logical presupposition of (6.a).

11.a) x passed away.

In (11.a), we have three components of meaning:

i)x is an organism.
ii) x is human.
iii) x became not-alive.

` 94
In negation, one or two of the meaning components in (11.a) are left intact.

11.b) x did not pass away.

i)x is an organism
ii) x is human
iii) not-(x became not-alive)

In (11) above, pass away is not a special way of dying, it just means ‘die’ when speaking
respectably of humans. So two synonyms can differ in respect of what is highlighting, i.e. one
item can highlight one aspect and the other highlight another. Therefore (i) and (ii) are
presuppositions (the latter is arbitrary because it depends on usage and collocation; we do not
use pass away with animals).

The above variations give rise to the significance of using synonyms in sensitive areas, like
taboo areas, such as when talking about sex, urination, defecation, etc. In other emotionally
sensitive areas like death and money one can also make use of these variations to choose
what is regarded as euphemistic (Cruse, 2000: 158). Near-synonymy
Near-synonymy (called plesionyms in Cruse, 1986) is the type commonly adopted by
dictionary-makers, so can be called dictionary synonymy. The difference between it and
propositional synonymy is that near-synonymy is not propositionally equivalent. Near
synonyms must share central aspects of meaning but are allowed to differ in peripheral
aspects. In more detail, when analysing sentences componentially, we can divide word
meanings into components or atoms. Thus, by central aspects we mean the capital
components, i.e. the heads, whereas peripheral means subordinate ones or the modifiers. For
example, pretty can be analysed as [GOOD LOOKING] [FEMALE], the former is considered
the head component and the latter is the subordinate one. In other words, the head is the first
sense that comes to one’s mind about a given word. So, pretty, handsome and beautiful are
considered near synonyms because they share the same capital component.

` 95
Near-synonymy can be easily tested by expressions like or rather, and more exactly (Cruse
1986: 287) with which we can signal the minor differences between near synonyms. Let us
consider the following example,

12.a) This is a lake, or rather a pond.

b) ?This is a lake, or rather a tree.

In (12.b) above lake and tree are not near-synonyms because of the great differences between

In conclusion, in this study I will take the view that the phenomenon of synonymy should be
understood as a gradual cline along which we may locate different degrees of synonymy:
absolute synonymy, propositional synonymy and near synonymy. This view is consistent with
the widely held opinion among semanticists that strict or absolute synonymy is rare in human
languages (see Cruse: 1986). A further step is taken here in this study to demonstrate that
absolute synonymy does not exist in Arabic. The study will argue that Arabic never has two
words that mean nearly the same thing and are used in the same range of grammatical and
lexical patterns. To prove the credibility of such hypothesis, we will apply corpus-based
analysis methodology to a list of selected Arabic word pairs which are presumed by some
Arabic linguists to be absolute synonyms to see how credible their presumption is.

6.3 Synonymy in Arabic

Synonymy was recognised early by Arab linguists, for instance in connection with rhetoric
(balaaghah) though it did not get extensive study. The main contribution of Arab linguists was
the collection of what is called lexicons nowadays. Some works on a large scale were based
on the collection of all names, or rather descriptions, that a given word has such as
Khalawayh’s The Names of the Lion, The Names of the Snake and The Wine’s Names. More
interestingly, Al-Fayrouzabadi produced a dictionary-like book called al-rawd}u al-masluuf
fi-maa lahu ismaan ila uluuf (The Best Garden of Words (or Expressions) That Have Two to
a Thousand Names). Al-Iskafi’s mabaadi’ al-lughah (Principles of Language) is considered a
classical work on Arabic Synonymy. It was arranged according to topics like stars,

` 96
constellations, time, clothes, food, weapons, etc. However, the best known of these classical
thesauri is Thacalibi. This was an Arabic dictionary based on a concept classification.
Haywood (1965: 113) described it as follows:

It is a vast storehouse of vocabulary which sometimes gives

synonyms, and at other times distinguishes between the finer shades
of meaning of words which are roughly synonymous.

Similar works were made by later writers; an example is al-alalfaaz{ al-kitaabiyyah

(Idiomatic Expressions), nujcat al-rraa’id wa shircat al-waarid fi-l-mutaraadif wa-l-
mutaawarid (The Spring of the Seeker in Synonyms and Associations) in which Arabic
words, including synonyms, were arranged under such headings as physical descriptions,
senses, good and bad manners, human behaviour, etc. They are all primarily concerned with
distinguishing apparent synonyms. Al-Askari’s Al-furuuq (The Differences) is another work
on synonymy, where the author tried to pursue the finer shades of differences that hold
between the seemingly synonymous words.

Such attempts were unsystematic by modern standards and cannot be regarded equivalent to
the modern thesauri since they were not arranged alphabetically and lack comprehensiveness.

Generally speaking, synonymy was frequently discussed, from a theoretical point of view, by
early Arab linguists. Some linguists like Sibawayhi, Al-Mubarrad and Al-Siyuti stressed that
synonymy is widespread in Arabic. On the other hand, Ibn Faris denied the existence of
synonyms because this would contradict the wisdom of Arabs, who always used words for a
reason. He argued that every word should have a specific meaning. Furthermore, Thaclab
argued that there is a difference of meaning between any given pairs of synonyms. For
example, investigating the contexts of qacada and jalasa ‘sit’ which are commonly taken as
synonyms will show that they have different meaning from each other (Versteegh et al, 1983,
p.174). Perhaps the idea of denying the existence of synonymy was introduced by Ibn Al-
Arabi (d. 802) whose apprentice Thaclab reported him saying, ‘any two forms used
synonymously by Arabs, everyone of them has a specific meaning which is missing in its

` 97
counterpart’ (Al-Anbari, Al-Addad: 7). Thaclab investigated the differences between lemmas
like qacada and jalasa manually and we can surely offer a more accurate analysis if we
investigate the phenomena computationally. However, we will not investigate this very pair,
because it has already been discussed by Thaclab. So, we will pay more attention to pairs
which are still considered as absolute synonyms.

6.4 The Repetition of Synonyms in Arabic

A general look at prose in Modern Standard Arabic shows that Arabs tend to mention two
synonyms following each other in most cases to give more rhetorical force to their
expressions. It is customarily used in situations where the speaker’s fluency is needed for
convincing the addressees especially in religious and political contexts. The speaker,
therefore, tends to use adjacent terms which share some of the semantic properties for
stylistic reasons. Ullman (1963: 193) called this phenomenon quasi-synonymy. For example,
safety and security in for the safety and security of this state.
The repetition of synonyms in this fashion is widely used in Modern Standard Arabic. Let us
consider the following examples.

13) sharah}tu al-darsa wa fas}s}altuh

I explained and elaborated the lesson.
14) takallama wa qaala
He spoke and said.
15) yujaahidu wa yuh}aaribu fi sabiili-llaah
He fights and battles for Allah’s cause.

In the above examples it is obvious for Arabic speakers that the two different verbs in every
sentence can be substituted for only one verb in English. Dickins, Hervey and Higgins (2002:
59) noted that all major parts of speech (N, V, Adj and Adv) can undergo such a phenomenon.
They also stated that the repetition of synonyms can be ‘syndetic’, when a connective is used,
particularly with the use of adjectives or ‘asyndetic’ without using connectives. This
conjunction between seemingly synonymous words is not only acceptable in Modern
Standard Arabic but is used frequently in the everyday language as well.

` 98
We may also find this phenomenon often used in Late Classical Arabic. Let us consider the
following examples from Al-Hamadhani’s Maqamat quoted by Tamas Ivanyi (1993: 52-53):

16) taraktuhu wa ins}araft

I left him and departed.
17) fit}na wa dhakaa’
intelligence and cleverness.
18( h}iss wa shucuur
Perception and consciousness.
19) ghumuud{ wa ibhaam
Obscureness and ambiguity.

Ivanyi offered an explanation for how such pairs of conjoining seemingly synonymous words
exist in Arabic. He argued that the synonymity of such pairs could be discredited by the
virtue of semantic attributes like static and dynamic. In this way, each item of the pairs in
examples (16-19) can be either static or dynamic. However, this is not an inclusive condition
since we may have pairs in which one item can be regarded as more general than the other.
In addition, in some cases the two terms of the pair could be dynamic or static, so a
refinement of Ivany’s proposition is needed. To me that proposition can be restated as the
meaning of one of the two items of the pair may be more general than the other.

Accordingly, in (16-19) above, one term of the pair tends to have more action than the other
in the sense that one expects the addressee to understand the repetitive synonymous term, i.e.
one of the two synonymous words as emphasis. This process is used merely for subtle
discourse as Ivanyi (1993: 53) put it:

The extended use of these and similar pairs of expressions in the

classical and Modern Literary Arabic (and not only in the literature,
but in everyday usage, too) indicates that this device may be more
than simply a rhetoric device and also points to the basically linguistic

` 99
(and not stylistic) roots of the phenomenon we called here semantic

6.5 Conclusion
This chapter discusses the various approaches and types of synonymy; this is very important
for our research orientation to instigate the analysis of our data in the following chapter based
on a detailed theoretical stance. With respect to absolute synonymy, the notion of
substitutability in all contexts can easily be grounded on corpus evidence, by comparing the
concordances of claimed synonymous items in order to point out all possible contextual
overlaps or disparities.

` 100
Chapter Seven: Collocational Treatment of
Synonymy in Arabic
7.1 Introduction
This chapter will discuss the semantic relation of synonymy, how synonyms behave in all
contexts, in order to highlight the subtle differences that might occur between them, to extract
semantic features which can make distinctions between them and to explore the possibility of
distinguishing such differences using statistical analysis of corpora.

I will argue that collocation is very useful to describe word meaning and is a mechanism by
which we account for seemingly synonymous pairs. According to Lyons (1995), the
collocational range of an expression can reveal the differences between apparent synonyms.
So collocation is one of the conditions he gives to consider a pair of words absolute

Following Lyons, I propose that employing collocation in the analysis of synonyms can help
distinguish their meanings and reveal the similarity and/or dissimilarity that hold between
them. By this technique, it is possible to compare seemingly synonymous words to find out
whether they are real synonyms or not. As mentioned in the previous chapter, absolute
synonyms can be ruled out if we come across one context in which one of the synonymous
pair carries more meaning, has a different distribution or is used in a different register. I will
argue that absolute synonyms do not exist in terms of their collocational patterns. Through
collocation we can distinguish one sense of a word from another and know whether a
seemingly synonymous pair are real synonyms or not. Collocation is, therefore, a device with
which words of multiple senses can be accounted for precisely. In order to prove that these
subtle differences can be brought out by collocation, I will analyse the collocates for a list of
synonymous pairs.

7.2 Data Choice

The present study is restricted to a list of some selected lexemes, as shown in table (7.1)
below, frequently used by Arab linguists when discussing synonymy; the most recent of them

` 101
is Ghali (1998). The items in this list are also used in Al-Askari (non-dated), Al-Hamadhani
(1991), Al-Yaziji (1970), and Leceibi’s (1980). The meanings of these items were examined
first in four Arabic dictionaries in order to arrive at the most seemingly synonymous pairs
which are presented in Table (7.1) below. These dictionaries are: Al-Fayruzabadi’s Qaamuus
al-Muh}iit} ‘Al- Muh}iit} Lexicon’, Al-Bustani’s Muh}iit} Al- Muh}iit}, Majmac
allughah al-carabiyyah’s Al-Wasiit}}, and Ibn Manz}ur’s Lisaan Al-cArab.

This list is selected to be general words rather than genre-specific words whose usage and
meanings may differ from one domain to another. For example, the word ‘elements’ in
physics could mean ‘the four natural elements’ and in literary texts ‘factors’ or ‘principles’.

Set POS Synonyms

1 V ‫ أتى‬ata / ‫ جاء‬jaa’a (come)

2 N ‫ ذنب‬dhanb /‫ اث‬ithm (sin)

3 V ‫ حسب‬h}asiba / ‫ ظن‬z}anna (think)

4 N ‫ حب‬h}ubb / ‫ود‬wudd (love, affection)

Table (7.1): The sets of randomly selected synonyms for our analysis.

7.3 Data Analysis

The data for the study are taken from the CAC: all forms of the words to be examined were
extracted. All irrelevant hits are eliminated manually. Then the words under investigation are
categorised syntactically and according to their frequency.

Following Barnbrook (1996: 90), I will consider words that occur at least three times within
the span to be relevant for collocational analysis. This is because words that occur just once
or twice can give spuriously high significance scores.

For practical reasons I would suggest to use the distribution of the word under investigation
represented in its collocates rather than using the whole concordance line. This helps us
decide from the very beginning what to look for in the concordances. The Mutual

` 102
Information statistic can help us observe what patterns are most distinct. This is a good step
in recognising what we are going to analyse as it summarises all the concordance lines and
enables us to make comparison and contrast to bring out the subtle differences between
seemingly synonymous items by examining their collocation.

Collocation as defined in Chapter Five does not necessarily work on adjacent words; we may
have collocation between interrupted words. Collocation can include items that habitually
collocate with other items from a definable semantic set, i.e. semantic prosody.

A semantic feature is then identified, based on their collocational distribution, to show the
difference between both items. The semantic feature that would distinguish the meaning of a
given synonym can be discovered by dividing the collocations of each item into a distinct list
according to their frequency. Then the word senses of both items are probed through their
collocation to find out the semantic attribute that makes one item different from the other, in
terms of collocation.

If the difference between a given pair of words is not brought out by a simple scrutiny of the
MI results, we will use the t-test statistic. An independent t-test compares the averages of two
samples that are selected independently of each other (the words in the two groups are not the

For more explanation, we will apply the substitution test of one word for the other (Ullmann,
1962: 143) to see if any change happens in the meaning of the sentence based on intuition. If
we can exchange one word for the other in all contexts without changing the meaning of the
sentence to any extent, these two words are definitely eligible to be called absolute

The remaining part of this chapter will examine the four case studies presented in table (7.1)

` 103
7.4 A case study: The word pair jaa’a and ata ‘come’
To prove the credibility of our methodology let us take the first synonymous pair: jaa’a and
ata, which are widely regarded as absolute synonyms and then have a look at the their
contextual distribution. But before that we give the definitions of jaa’a and ata as provided in
the most authentic Arabic dictionaries.
Table (7.2) Definitions of jaa’a and ata in Arabic dictionaries

The dictionaries above distinguish three main meanings for jaa’a: (1) ‘come’, (2) ‘arrive’, (3)
‘do’. The other meaning (4) ‘bring’ comes up because of the preposition bi ‘with’; however, it
is more or less closely related to the meaning (1). As for ata, it has the meanings (1), (2), (3),
(4) in addition to (5) ‘have sex’ which is euphemistically related to the meaning in (3). The
remaining meanings are mentioned because of the following prepositions: bi ‘with’ and cala
‘on’. Al-Wasiit} gives one more sense for ata: ‘approach’ which is also related to the
previous meanings. In table (7.2) the words are defined in terms of each other.
In order to analyse significant collocations, we took a number of preliminary decisions. First,
we discarded all combinations with a frequency lower than three as indicated in 7.3.
Secondly, we will see how frequent every item of the pair is in the corpus as a whole before
doing further analysis. Then we can compare that to the frequency of the words used with
them. n will stand for the total size of our corpus, x for our search term and y for the
collocate. Thirdly, because of the inapplicability of wild-card search with Arabic texts, as
mentioned in section 5.4.2, hits are calculated first to include all possible syntactic forms of
the pair under investigation. The significant collocations for ata are shown in table 7.3 below
whereas the collocations of jaa’a are represented in table 7.4.

n= 5000000, f(x) = 2219

( y) )f(x, y )F(y MI
mischief ‫الفاحشة‬ 48 139 9.60
with sin ‫بذنب‬ 4 17 9.05
unbelief ‫الكفر‬ 4 34 8.05
torment ‫عذاب‬ 29 590 6.79
soothsayer ‫كاهن‬ 3 64 6.72
falsehood ‫الباطل‬ 4 94 6.58
Allah ‫ال‬ 7 168 6.55

` 104
the prophet ‫النب‬ 169 6777 5.81
no ‫ما‬ 6 6939 5.07
man ‫رجل‬ 9 8646 5.06
Jibreel ‫جبيل‬ 5 386 4.86
Syria ‫الشام‬ 4 520 4.11
calamity ‫بأس‬ 3 404 4.06
the good ‫الي‬ 15 2269 3.89
the mosque ‫السجد‬ 5 924 3.60
heaven ‫السماء‬ 4 814 3.46
to ‫إل‬ 51 11214 3.35
messenger ‫رسول‬ 46 11805 3.13
command ‫أمر‬ 10 3002 2.90
Makkah ‫مكة‬ 3 914 2.88
the night ‫الليل‬ 4 1352 2.73
Moses ‫موسى‬ 5 1985 2.50
owner ‫ذا‬ 5 2023 2.47
the truth ‫الق‬ 3 1171 2.31
on ‫على‬ 74 36416 2.19
women ‫النساء‬ 4 2035 2.14
his family ‫أهله‬ 8 537 1.83
king ‫ملك‬ 8 5046 1.78
Umar ‫عمر‬ 4 3180 1.50
with it ‫به‬ 35 28912 1.44
his wife ‫امرأته‬ 6 401 1.22
people ‫بن‬ 7 4564 0.96
from ‫من‬ 24 34830 0.63
with him ‫ومعه‬ 5 8738 0.36
day ‫يوم‬ 3 5243 0.36
his tribe-men ‫قومه‬ 4 7901 0.01-
sin ‫معصية‬ 7 19246 0.50-
son ‫ابن‬ 5 16734 0.57-
in ‫ف‬ 6 24009 0.82-
father ‫أبا‬ 4 17618 0.96-
he ‫هو‬ 3 15055 1.15-
that ‫ذلك‬ 3 17455 1.36-

Table (7.3) The immediate left collocates of ata

in a span of four word-forms with minimum frequency of 3.

n= 5000000, f(x) = 2566

( y) )f(x, y )F(y MI

` 105
with fertility ‫بالصب‬ 3 3 10.92
with good deed ‫بالسنة‬ 8 11 9.97
empty ‫فارغا‬ 5 11 9.79
empty ‫فارغا‬ 5 11 9.79
visiting ‫زائرا‬ 3 7 9.70
clear proofs ‫البينات‬ 24 137 8.41
with lies ‫بالكذب‬ 3 29 7.65
nomad ‫أعراب‬ 10 244 6.31
second ‫ثانيا‬ 4 111 6.13
dragging ‫ير‬ 3 87 6.07
victory ‫نصر‬ 10 422 5.52
time ‫وقت‬ 27 1726 4.92
the truth ‫الق‬ 14 1171 4.54
knowledge ‫العلم‬ 15 1334 4.45
Ramdan ‫رمضان‬ 4 385 4.33
the night ‫الليل‬ 13 1352 4.22
man ‫رجل‬ 81 8646 4.19
to ‫إل‬ 96 11214 4.06
Islam ‫السلم‬ 12 1493 3.96
one ‫أحد‬ 22 2775 3.94
Jibreel ‫جبيل‬ 3 386 3.92
owner ‫صاحب‬ 7 1011 3.75
somebody ٌ‫فلن‬ 6 1005 3.54
The Qur’an ‫القرآن‬ 6 1117 3.38
in ‫ف‬ 117 24009 3.24
the day-time ‫النهار‬ 3 706 3.04
wants ‫يريد‬ 4 961 3.01
information ‫الب‬ 6 1486 2.97
the boy ‫الولد‬ 6 828 2.81
the Prophet ‫النب‬ 18 6130 2.51
wealth ‫مال‬ 4 1479 2.39
from ‫من‬ 67 34830 2.08
Moses ‫موسى‬ 4 1985 1.97
Umar ‫عمر‬ 6 3181 1.87
explanation ‫تأويل‬ 6 3347 1.80
the country-men ‫القوم‬ 4 2537 1.61
women ‫النساء‬ 3 2035 1.52
other ‫آخر‬ 6 4132 1.50
Messenger ‫رسول‬ 17 11805 1.48
Abraham ‫ابراهيم‬ 3 2154 1.44
on ‫على‬ 49 36416 1.39
command ‫أمر‬ 4 3002 1.37
day ‫يوم‬ 7 5243 1.37

` 106
hitting ‫يضرب‬ 3 3007 0.95
after ‫بعد‬ 7 7383 0.88
before ‫قبل‬ 3 3825 0.61
Allah ‫ال‬ 15 19246 0.60
people ‫ناس‬ 4 5882 0.40
already ‫وقد‬ 13 24534 0.04
with ‫مع‬ 4 7901 0.01-
father ‫أبو‬ 8 17618 0.17-
about ‫عن‬ 9 21153 0.27-
except ‫إل‬ 5 12261 0.33-
said ‫فقال‬ 12 32835 0.48-
that ‫ذلك‬ 4 17455 1.16-
until ‫حت‬ 3 13165 1.17-
this ‫هذا‬ 3 15501 1.40-

Table (7.4) The immediate left collocates of jaa’a

in a span of four with minimum frequency of 3.

Analysing the concordances of ‘ata and jaa’a shows that there is a wide range of overlap
between them; as we can see in the examples below, there are several instances where both
appear with words denoting place, person, time or abstract object. But jaa’a tends to be more
frequently used with time and, unlike ata, is always followed by the preposition ila ‘to’
before places as can be seen in the tables. For the sake of brevity, we have only given
translation and transliteration for the information which are relevant to our discussion.

‫فسار يشي ويتتبع آثار الطريق حت جاء إل باب الدينة‬

fasaara yamshii wa yatatabbac aathaara al-t}ariiq h}atta jaaca ila baab al-
He kept going, following the road signs until he arrived at the entrance of the city.

‫فسار حت أتى الشام فقتل أهلها‬

fasaara h}atta ata al-Shaam faqaatala ahlahaa.

He went until he arrived at Syria, then he killed its people.

` 107
.‫ث جاء النب صلى ال عليه وسلم يشي ف الصفوف‬

thumma jaa’a al-nabiyyu …yamshii fii al-s}ufuuf.

Then came the Prophet …walking between rows.

.‫أتى النب صلى ال عليه وسلم بيت فاطمة فلم يدخل‬

ata alnabiyyu … bayta faat}imah falam yadkhul.

The prophet came to Fatimah’s house but he did not enter.

‫فلما جاء الليل نام‬

falammaa jaa’a allaylu naama.

When the night came he slept.

‫ولا أتى الليل طلبته أمه فلم تده‬

wa lamma ata allaylu t}alabatuh ummuhu falam tajidhu.

When the night came, his mother looked for him but she could not find him.

ُ‫حقّ َوزَ َهقَ الْبَا ِطل‬
َ ‫َوُقلْ جَاء اْل‬

wa qul jaa’a al-h}aqqu wa zahaqa al-baat}ilu.

And say: ‘Truth has come and falsehood has vanished’.

‫وهل يأت الي بالشر ؟‬

wa hal ya’ti al-khayru bi-l-sharri?

Does the good bring evil?

` 108
Let us now study the statistics given in table (7.3 & 7.4) above to see how similar or
dissimilar the collocations of the word pair under examination are. The first obvious point we
can get is that in table (7.4) the most statistically significant collocation of jaa’a , i.e.
collocates of highest MI scores, is bi-l—khis}b ‘with fertility’ with an MI score at 10.92. As
for ata, table (7.3) shows that al-faah}ishah ‘mischievous deed’ is the strongest collocate
with MI score at 9.60. Can we say then that the semantic feature which distinguishes between
jaa’a and ata is positivity vs. negativity?

Actually, we cannot come up with an exclusive distinction between jaa’a and ata by making
such a simple analysis, simply because we should be aware of the fact that words could have
multiple senses and different syntactic forms could entail different senses. So we need to
make a more precise analysis before coming to a conclusion.
It is important to mention that jaa’a and ‘ata followed by the proposition bi (with) are
frequently used in CAC with the meaning ‘to bring’ but our pre-theoretical approach of what
a word is does not count propositions or conjunctions that are attached to the root word. ata,
in particular, has several meanings in different contexts. For example, ata followed by the
proposition cala means ‘to finish off or destroy something’ ata cal al-t}acaam (he has
finished all food), (‘ata cala al-‘akhd}ar wa-l-yaabis he destroyed everything (literally: he
destroyed the cultivated and non-cultivated land). It can also be used metaphorically to refer
to having sex. For example, ata imra’tahu/ahlahu (to have sex with his wife). Using ata in
this sense is called euphemism which is widely used in Qur’an. However we will be
restricted to analysing only one sense of ata, namely ‘come’ to set it off against jaa’a which
is mainly used in this sense. For example, we manually eliminated instances where ata means
‘commit’, which constitute about 3% of the whole occurrences of ata. To look at this sense,
i.e. ‘come’, only we have to manually proofread our counts and exclude all the instances
which have other meanings. This particular use of ata and jaa’a is interesting to analyse
because their meanings are so similar that native speakers of Arabic tend to use them
interchangeably. This gives another dimension for the use of both verbs, in addition to the
previous differences brought out between them.

A closer look at the words reveals that the two words are not synonymous all the time. We

` 109
cannot always use the two words interchangeably. I examined all the concordances of ata and
jaa’a throughout CAC which enabled me to come up with the following three major
distinctions between them. I made a further analysis of the concordances of jaa’a and ata
with a minimum frequency of three. The result of this further analysis will be tested later on
by t-test statistic as shown in table (7.5) below. Now let us have a look at the following uses
of both of them:

i. When ata is followed by a place it means that place is not a destination point.

ُ‫حَتّى إِذَا أََتوْا َعلَى وَادِي الّنمْلِ قَاَلتْ َن ْملَةٌ يَا أَّيهَا الّنمْلُ ادْ ُخلُوا َمسَاكَِن ُكمْ لَا َيحْ ِطمَّن ُكمْ ُسلَْيمَان‬

ْ ‫وَجُنُودُهُ وَ ُهمْ لَا َي‬

h}atta idhaa ataw cala waadi al-namli qaalat namlatun ya ayyuha al-namlu
udkhulu masaakinakum…
When they came to a valley of ants, one of the ants said: ‘O you ants, get into your
habitations’ (Qur’an, Al-Naml: 18).

‫فَان َطَلقَا َحتّى إِذَا أَتَيَا َأهْلَ قَرْيَةٍ اسْتَ ْطعَمَا َأ ْهلَهَا فََأبَو أَن ُيضَيّفُو ُهمَا‬

fa-nt}alaqaa h}atta idha atayaa ahla qaryatin…

Then-proceeded-they-(dual) till when came-they-(dual) people town, asked-food-
they-(dual) people-it, but-refused-(they) to entertain-them
Then they [Moses and Al-Khidr] both proceeded, till, when they came to the people
of a town, they asked them for food, but they refused to entertain them. (Qur’an, Al-
Kahf: 77)

A full translation35 of the example in (5.a) can make the meaning clearer.

35 The translation of Qur’anic verses are taken from Al-Hilali and Khan’s The Noble Qur’an, but it is slightly
amended, to omit information which is irrelevant to the main discussion and the exegetical glosses included in
the translation and marked by inverted commas or brackets. We only focused on the phrases which contain the
words under investigation, so we deleted the transliterated glosses and all extra explanatory comments rendered
by the translator for elucidation.

` 110
When they [Solomon’s army] came to a valley of ants, one of the ants said: “O you ants, get
into your habitations, lest Solomon and his troops crush you without knowing it.” (An-Naml:

The ants’ colony was not meant to be the destination point for Sulayman and his army, nor
did they stay there for a long time. The whole army was only passing by the colony when
Sulayman heard the ant warning the rest of the colony of an imminent destruction by
Sulayman and his army.

In (5.b) it is a part of Moses’ story with Al-Khidr when he set out on a journey searching for
that knowledgeable person. After Moses had found him, Al-Khidr started teaching him a
series of lessons practically. Then they passed by a town, which was not their terminal point,
where they got hungry, so they asked them for food but the people of that town refused to
host them.

Conversely, the place that follows jaa’a is meant to be a destination point where one can stay
for longer time or for ever, so it gives a sense of stability.
.‫ ث نام‬،‫ فصلى أربع ركعات‬،‫ ث جاء إل منله‬،‫صلى النب صلى ال عليه وسلم العشاء‬

s}alla al-nabiyyu… al-cishaa’, thumma jaa’a ila manzilihi,… thumma naama.

The Prophet did the evening prayer, then he came to his house where he prayed four
prostrations and slept.

‫حتْ أَْبوَاُبهَا‬
َ ‫َوسِيقَ اّلذِينَ اّت َقوْا رَّبهُمْ ِإلَى اْلجَنّةِ زُمَرًا حَتّى إِذَا جَاؤُوهَا َوفُِت‬

…h}atta idha jaa’uha wa futih}at abwaabuha

‘And those who were pious to their Lord will be led to Paradise in groups, till, when
they reach it and its gates will be opened’ (Qur’an, al-Zumar:73).

In (5.c) the prophet returned to his house after giving his prayers to sleep. His house is

` 111
therefore an end point as he did not mean to carry on going to any other place. In (5.d) the
paradise is the final abode of the pious people so when they come to it they will live therein

ii. jaa’a when followed by an event means that event has been waited for or expected.

ُ‫إِذَا جَاء َنصْرُ اللّهِ وَاْلفَتْح‬

idhaa jaa’a nas}ru Allahi wa al-fath}u.

‘When come the victory of Allah and the conquest (of Makkah)’ (Qur’an, al-Nas}r:

ْ ‫فإِذَا جَاء وَ ْعدُ الخِرَةِ ليسوءوا وُجُو َه ُكمْ َولِيَدْ ُخلُواْ اْل َم‬

fa-idhaa jaa’a wacdu al-‘aakhirati…

‘Then, when the second promise comes, (they will make your faces sorrowful and
enter the mosque (of Jerusalem))’ (Qur’an, al-Isra: 7).

In (6.a) the conquest of Makkah and the victory over the disbelievers of Makkah was
something which the Prophet and all Muslims were longing for. They were expelled from
their own hometown, Makkah, without a just cause and left behind everything. In addition,
since the advent of Islam, they were prevented from performing their pilgrimage to the Holy
House to fulfil the duty which Allah had imposed upon them. Likewise, the example in (6.b)
is mentioned in the context of the conflict between Muslims and the Jews where Allah
promises the Muslims to return to their mosque and defeat the Jews in the end. Actually,
freeing Jerusalem and the Al-Aqsa mosque is the dream of all Muslims; they are all waiting
for Allah’s promise to come.

` 112
On the other hand, ata associates with things that happen unexpectedly. For example,

َ‫قُلْ َأرَأَيُْتكُم ِإنْ أَتَا ُكمْ َعذَابُ اللّهِ َأوْ أَتَْت ُكمُ السّاعَةُ أَغَيْرَ اللّهِ َتدْعون‬

qul ara’ytakum in ataakum cadhaabu Allah…

‘Say :”Tell me if Allah’s Torment comes upon you, or the Hour comes upon you,
would you then call upon any one other than Allah?’ (Qur’an, Al-Ancam: 40).

ْ‫حَّتىَ إِذَا أَ َخذَتِ ا َلرْضُ زُخْ ُرَفهَا وَازّيَّنتْ َوظَنّ أَ ْهُلهَا أَّن ُهمْ قَا ِدرُونَ َعلَْيهَا َعلَْيهَا أَتَاهَا َأمْرُنَا لَيْلً َأو‬

َ ‫جعَلْنَاهَا َحصِيدًا كَأَن ّلمْ َتغْنَ بِا‬
َ ‫َنهَارًا َف‬

h}atta idhaa akhadhati al-ard}u zukhrufahaa… atahaa amrunaa…

‘When the earth is clad with its adornments and is beautified, and its people think
that they have all the powers of disposal over it, Our Command reaches it by night
or by day and We make it like a clean-mown harvest, as if it had not flourished
yesterday’ (Qur’an, Al-Ancam 39).

In (6.c & d) the events are not expected because Allah keeps such things hidden so that every
person is rewarded for what he does, and the people are not aware of what is hidden for them.

iii. jaa’a means ‘arrive’ as shown in (5.c & d) above, whereas ata has a sense of
approaching a place or a time.

ِ ‫أَتَى َأمْرُ اللّهِ َفلَ َتسَْت ْع‬

ata amru Allahi fala tastacjiluuh

‘(Inevitable) cometh (to pass) the Command of Allah: seek ye not then to hasten it’
(Qur’an, Al-Nah}l: 1).

` 113
Allah’s command is the Last Day (the Day of Judgement) and this apparently contradicts the
situation but it rather means ‘approached’.

The evidence mentioned in (5.a) above can also be used here.

ُ‫َحتّى إِذَا أََتوْا َعلَى وَادِي الّنمْلِ قَاَلتْ َن ْملَةٌ يَا أَّيهَا الّنمْلُ ادْ ُخلُوا مَسَاكَِن ُكمْ لَا َيحْ ِطمَّن ُكمْ ُسلَْيمَانُ وَجُنُودُه‬

ْ ‫وَ ُهمْ لَا َي‬

h}atta idha ataw cala waadi al-namli qaalat namlatun ya ayyuha al-namlu udkhulu
‘When they came to a valley of ants, one of the ants said: “O you ants, get into your
habitations’ (Qur’an, Al-Naml: 18).

The English translation by Dr. Muhsin Khan and Dr. Muhammad Al-Hilali given below
translated ata to ‘at length … came’ which is a close interpretation to the meaning of ‘come’
in this context. Indeed, Sulayman and his army have not reached the ants’ colony yet, they
were still by its outskirts because one of the ants asked the rest of the ants to go inside their

Khan and Al-Hilali’s translation of the above verse:

At length, when they came to a valley of ants, one of the ants said: ‘O ye ants, get into your
habitations, lest Solomon and his hosts crush you (under foot) without knowing it’ (Qur’an,
An-Naml: 18).

More interestingly, the slight change in the contextual use between ata and jaa’a in the
following three verses can bring out the subtle difference between them. Transliteration is
provided for the underlined Arabic words. We also marked the similar parts throughout the
following three examples with square brackets.

‫ َفَلمّا أَتَاهَا‬.َ‫قَالَ لِأَ ْهلِهِ ا ْمكُثُوا إِنّي آَنسْتُ نَارًا لّ َعلّي آتِيكُم مّْنهَا ِبخَبَرٍ َأوْ َج ْذوَةٍ مِنَ النّارِ لَ َعّل ُكمْ َتصْ َطلُون‬

.ُ‫شجَرَةِ أَن يَا مُوسَى إِنّي أَنَا اللّه‬

ّ ‫نُودِي مِن شَا ِطئِ اْلوَادِي الْأَْيمَنِ فِي الُْب ْقعَةِ اْلمُبَارَكَةِ مِنَ ال‬

` 114
qaala li-ahlihi imkuthuu innii aanastu naaran lacallii aatiikum minhaa bi-
khabarin… falammaa atahaa nuudiya…
‘[(i) He said to his family: “Tarry you;] [(ii) I perceive a fire;] [(iii) perhaps I can
bring you from there some information, or a burning firebrand, that you may warm
yourselves.”] [(iv) But when he came to the (fire)], [(v) he was called] from the right
bank of the valley, from a tree in hallowed ground: “O Moses! Verily I am Allah”‘
(Qur’an, Al-Qasas: 29-30).

‫ستُ نَارًا ّلعَلّي آتِيكُم مّْنهَا ِبقَبَسٍ َأوْ أَ ِجدُ َعلَى النّارِ ُهدًى َفَلمّا أَتَاهَا نُودِي يَا‬
ْ ‫َفقَالَ لِأَ ْهلِهِ ا ْمكُثُوا إِنّي آَن‬

‫مُوسَى إِنّي أَنَا رَبّكَ فَا ْخَلعْ َنعْلَيْكَ إِنّكَ بِاْلوَادِ اْل ُم َقدّسِ ُطوًى‬

faqaala liahlihi imkuthu innii aanastu naaran lacallii aatiikum minha bi-qabasin…
falamma ‘aataaha nuudiya…
‘So [(i) he said to his family, “Tarry you;] [(ii) I perceive a fire;] [(iii) perhaps I can
bring you some burning brand therefrom, or find some guidance at the fire.”] [(iv)
But when he came to the fire], [(v) he was called:] “O Moses! Verily I am thy Lord!
therefore put off thy shoes: thou art in the sacred valley”‘
Tuwaa’ (Qur’an, Taha: 10-11).

‫شهَابٍ قَبَسٍ لّ َعّل ُكمْ َتصْطَلُونَ َفَلمّا‬
ِ ‫قَالَ مُوسَى لِأَ ْهلِهِ إِنّي آَنسْتُ نَارًا سَآتِيكُم مّْنهَا ِبخَبَرٍ َأوْ آتِيكُم ِب‬

َ‫جَاءهَا نُودِيَ أَن بُورِكَ مَن فِي النّارِ َومَنْ َح ْوَلهَا َوسُْبحَانَ اللّهِ رَبّ اْلعَاَل ِمي‬

qaala muusa liahlihi innii aanastu naaran sa’aatiikum minhaa bi-khabarin…

falamma jaa’aha nuudiya…
‘[(i) Moses said to his family:] [(ii) “I perceive a fire;] [(iii) soon will I bring you
from there some information, or I will bring you a burning brand, that you may
warm yourselves.”] [(iv) But when he came to it], [(v) a voice was heard:] “Blessed
are those in the fire and those around: and Glory to Allah, the Lord of the Worlds”‘
(Qur’an, Al-Naml: 7-8).

` 115
As shown above we have three verses from different surahs (chapters) relating the story of
Moses when he saw the fire where Allah talked to him. The story is put in different wordings
in these three surahs, because every verse tells one aspect of the story. We can notice that
there are similar parts in each verse (as marked in i-v). The remaining parts make the
meanings of the three verses different from one another. For example, in (8.a) and (8.b.)
Moses asked his family to wait until he goes and sees the fire. The verb ata ‘bring’, marked
iii in both verses, is used in subjunctive form to express a wish but it is uncertain. In (8.c)
Moses does ask his family to wait and the verb ata ‘bring’ used is in near future which
expresses certainty. The fireplace was so remote that he had to promise his family not to give
up. He has the intention to do his best to get some information from the people around the
fire or to get a burning brand from it to warm themselves. This is to reassure his family even
if the fire is far or he takes long time. Therefore he used the verb without a modal of

Most importantly, in (8.a.) and (8.b.) ata is used to indicate that Moses is still far from the
actual fireplace, because the call following ata in (8.a.) comes from the bank of the valley,
and (8.b.) mentions that he is in the sacred valley and has not arrived at the fireplace. But in
(8.c.) the call implies that Moses arrived at the fireplace because Allah says ‘blessed are those
in the fire and those around’; ‘those in the fire’ refers to Moses and ‘those around’ are angels
as al-Razi said. In Arabic one can use the preposition fi ‘in’ to mean absolute closeness.

So, jaa’a ‘come’ marked [iv] in (8.c) is used to relate the final part of the story after Moses’
arrival at the fire-place, where he talked to Allah. It is interesting to know that the two verses
(8.a & 8.b) employ the verb ata ‘come’ marked [iv] in both to refer to a degree of nearness to
the fire-place, whereas their equivalent in (8.c) uses jaa’a to describe a state of absolute

One more piece of evidence that supports the above argument is that the word Allah occurred
in object position with ata 7 times and did not occur at all with jaa’a. Let us consider the
following example,

` 116
ٍ‫َيوْمَ لَا يَن َفعُ مَالٌ وَلَا َبنُونَ إِلّا َمنْ َأتَى اللّهَ ِب َقلْبٍ سَليم‬

yawma la yanfacu maalun wa la banuun illa man ata allaaha biqalbin saliim.
The Day whereon neither wealth nor sons will avail except him who came to Allah
with clean heart

ata is used in the above example, because Allah is not limited to a place nor can vision grasp
Him, so no one can come to a point of closeness to Allah’s entity like a physical object.

Let us now use the t-test to show what sort of differences holds between jaa’a and ata as
shown in table (7.5) below. To do the test it would be better to stick to one sense of the words
under investigation.36 We will analyse the most significant left collocates, i.e. items with the
highest MI scores. As mentioned earlier, we will be restricted to analysing only one sense of
the pair, namely ‘come’.

W )f(w )f(Jaa’a /w )f(‘ata/w t significance

time 1726 27 1 4.91 P < 0.0001

clear proofs ‫البينات‬ 137 24 0 4.89 P < 0.0001

knowledge 1334 15 0 3.87 P < 0.0001
nomad 244 13 1 3.20 P < 0.01
victory 422 10 1 2.71 P<0.01
the truth 1171 14 3 2.66 P<0.01
empty 11 5 0 2.23 P<0.05
visiting 7 3 0 1.73 P<0.20
dragging 87 3 0 1.73 P<0.10

36 Customarily, the test can be done generally without restricting it to one sense of the words under
investigation. To me, it would be easier if we chose to do the calculation inside a closed set for short cut and
quick results, i.e. I will search the items whose MI scores are significant, which co-occurred with ja’a and ata in
a particular sense.

` 117
second 111 4 1 1.34 P<0.20
the Prophet 6777 18 169 11.04 P < 0.0001
torment 401 1 29 5.11 P < 0.0001
Allah 168 0 7 2.64 P<0.01
disbelief 406 0 4 2.00 P<0.05
Syria 520 0 4 2.00 P<0.05
calamity 404 0 3 1.73 P<0.10
soothsayer 64 0 3 1.73 P<0.10
command 537 4 10 1.60 P<0.20
falsehood 94 1 4 1.34 P<0.20
‫جبيل‬ .Not sig
Gabriel 386 5 5 0
Table (7.5) the most significant ten left collocates37 with jaa’a
(the top ten words) and ata (the last ten words).

The t-scores in table (7.5) show the differences between jaa’a and ‘ata; the former has a
strong tendency to occur in positive contexts, whereas the latter has a negative sense. The
bigger the t-score, the more different the pair under examination. jaa’a gets the highest
scores with the following positive items: alcilm ‘knowledge’, al-h}aq ‘the truth’, and al-
bayyinat ‘clear proofs’. On the other hand, ata frequently co-occurs in negative contexts:
adhab ‘torment’, al-kufr ‘disbelief’, ba’s ‘calamity’, kaahin ‘soothsayer’38, ‘amr ‘command’

(meaning difficulty or torment), and baat}il ‘falsehood’. The highest scores of jaa’a and ata
in the table show that the items having this score is more likely different from each other.39
Therefore, ata and jaa’a as shown in table (7.5) above are not synonymous because they are
used in a different range of contexts.

Two points might seem contradicting to the above conclusion. In the first place, the positive
use of ata in tables (7.3) as in ya’ti ‘comes’ followed by khayr ‘good’ or h}aqq ‘truth’ is not

37 These collocates are identified by MI statistic.

38 Soothsaying is forbidden in Islamic religion and is classified as a major sin.
39 We ignore all hits which have neutral senses.

` 118
considered strong evidence because they are only used with ata in its present tense form. We
think there might be a morphological reason why ata in its present simple form is used for
both negative and positive sense. jaa’a in present tense form, i.e. yajii’, is not as easy to
pronounce as ata. jaa’a in its present form occurs 181 times whereas its corresponding ata
occurs 980, so jaa’a in present form is about five times less common in CAC than ata in
present form. Secondly, the high t-scores in table (7.5) with al-nabiyy ‘the prophet’ (11.04)
and waqt ‘time’ (4.91) are not significant because they are both neutral, so they fall in the
area of overlap between ata and jaa’a as indicated in (7.6) below.

Analysing the concordances of jaa’a and ‘ata with minimum frequency of 1 can show their
tendency to occur in negative or positive contexts. Further examples from CAC show that
‘ata is overwhelmingly used in unpleasant contexts. The main collocates concern committing
sins, trouble, and falsehood. Figure (7.6) below shows the contextual preference of both of

Figure (7.6) the collocational differences of jaa’a and ata with minimum frequency 1.

The native speakers of Arabic are themselves unaware of these collocational differences
between jaa’a and ata. The only difference brought out by Al-Askary, who belongs to the
Classical period, in Al-Furuuq is that ata requires a complement. For example,

9.a jaa’a alrajulu nafsuhu.

` 119
came the-man self-him
The man arrived himself.

9.b *ata alrajulu nafsahu.

came the-man self-him
The man came himself.

Otherwise they can replace each other without any loss of meaning. This is not consistent
with Al-Askary’s proposition that difference in form must produce difference in meaning but
that difference was abandoned as time passed (Al-Askary, Al-Furuq: p. 9).

To me, the use of jaa’a in (9.a) above is consistent with our approach that jaa’a is always
followed by the preposition ila ‘to’ before places. Therefore, the missing preposition in (9.a)
eliminates the possibility of a following category that refers to a place. So, the use of jaa’a in
(9.a) involves some sort of directional motion which implies an action not toward a place but
rather toward the speaker. On the other hand, the multiplicity of the senses40 with ata makes
leaving the complement position empty as in (9.b) above, ambiguous.

7.4.1 Summary
To sum up, the analysis of the seemingly synonymous pair jaa’a and ata was carried out in
three stages in order to highlight the subtle differences that occur between them. The first
stage consisted of lexical search for all occurrences in CAC of the tokens jaa’a and ata. The
second stage involved the categorisation of the tokens syntactically and according to their
frequency; this included manual elimination of all irrelevant hits. In the third stage, we used
MI to highlight the collocations of both. Then we managed to highlight some distinctions
between the two items by analysing their contexts. We finally used T-Test to capture the
subtle differences between the pair by extracting a semantic feature, which can differentiate
between them, i.e. negativity vs. positivity.

40 The senses with ja’a are all related to a directional motion, whereas the senses with ata are diverse and some
of them are metaphorical or euphemistic.

` 120
7.5 A case study: the word pair ithm and dhanb ‘sin’
ithm and dhanb ‘sin’ are commonly treated as synonymous as shown in the dictionary
definitions in (7.7) below. A casual account of the two words reveals that the two Arabic
nouns ithm and dhanb, which have a similar semantic and syntactic form and also a broadly
similar frequency (645 vs. 917 word forms) have been used in CAC to mean ‘committing a
bad deed’ in general.


Table (7.7) Definitions of ithm & dhanb

In the last section we used a range of two words on either side of the node to get an
understanding of the contextual distribution of a given pair. In Arabic, as mentioned in
chapter five, some differences might be overlooked within that short range due to the
syntactic structure of Arabic. Indeed, a small window seems not effective for languages with
many non-adjacent complements that result in non-adjacent collocations as shown in table
(7.8) below. We searched both ithm and dhanb in a span of 3:3 and the result was

` 121
inconclusive with both. So, let us consider the top ten collocates of the pair under
investigation to see how insufficient a span of 3:3 is as indicated in the following table.

3L Frq 2L Frq 1L Frq Search 1R Frq 2R Frq 3R Frq

‫ تأخر‬delay 41 ‫ قال‬say 67 ‫عليه‬ 169 ithm ‫ فل‬no 181 ‫تأخر‬ 60 ‫ في‬in 62
on him stay
‫ نفعهما‬their 20 ‫ من‬from 46 ‫وإثمك‬ 31 ithm ‫ أو‬or 54 ‫يومين‬ 47 ‫ ومن‬and 60
usefulness and two days from
your sin
‫ من‬from 18 ‫ ومن‬and 43 ‫كبير‬ 30 ithm ‫ تبوء‬carry 44 ‫ أن‬that 46 ‫ موص‬the 30
from much testator
‫ للناس‬for 14 ‫ في‬in 26 ‫من‬ 26 Ithm ‫ من‬from 44 ‫جنفا‬ 34 ‫ قال‬said 27
people from unjust
‫ فل‬so not 13 ‫يقول‬ 21 ‫ على‬on 25 Ithm ‫ بإثمي‬with 31 ‫ تبوء‬carry 28 ‫والميسر‬ 14
says my sin and
‫ غفر‬was 12 ‫بينهم‬ 16 ‫قال‬ 24 Ithm ‫ فيهما‬In 24 ‫غير‬ 28 ‫ فأصلح‬so 13
forgiven between said them without he
them reconciled
‫ في‬in 10 ‫يعني‬ 15 ‫يقول‬ 23 Ithm ‫ باب‬chapter 19 ‫ قل‬say 23 ‫ ما‬no 13
means says
‫برئ‬ 9 ‫ومنافع‬ 15 ‫أكبر‬ 23 Ithm ‫ ل‬no 19 ‫ قال‬said 19 ‫ من‬from 11
innocent and bigger
‫ إثم‬sin 7 ‫أي‬ 12 ‫ في‬in 21 Ithm ‫متجانف‬ 17 ‫ قوله‬his 18 ‫ أن‬that 8
namely deliberately saying
‫ حرج‬wrong 7 ‫ ل‬no 12 ‫عليه‬، on 18 Ithm ‫ فإنما‬verily 15 ‫بينهم‬ 15 ‫ كبير‬much 8
him between
Table (7.8): The top ten collocates of ithm in a span of 3:3.

In the table above we could not find any statistically significant collocation for ithm except
for one item: al-maysir ‘gambling’, which will be analysed in table (7.9) below. The span of
3:3 resulted in either non-adjacent complements. To take an example the underlined words in
the table above are part of a Qur’anic verse about performing pilgrimage which reads,
ِ‫جلَ فِي َيوْ َمْينِ َفلَ ِإْثمَ َعَليْ ِه وَمَن َتأَخّرَ فَل ِإثْمَ عََلْيه‬
ّ ‫فمَن تَ َع‬

faman tacajjala fi yawmayni fala ithm calayhi wa man ta’akhar fala ithm calayh.
so-who haste-becomes in days-two so-no sin on-him and who late-becomes so-no
sin on-him.
But whosoever hastens to leave in two days, there is no sin on him and whosoever

` 122
stays on, there is no sin on him. (Qur’an, al-Baqarah: 203)

The verb tacajjala ‘hastens’ does not appear in the table as a collocation for ithm because it
comes fourth to the right, so the span of 3:3 failed to capture it.

As mentioned in section 5.2 we chose to work on flexible spans since there might be some
expressions in Arabic that stretch over the average span: 4:4. In addition, it is hard to capture
in that span the semantic features that stretch over several units not included in our span41.

Further analysis of the pair under investigation by using MI statistic and in a bigger span
(7:7) shows more interesting collocations which are statistically significant. Therefore the
span size for this study is set to 7:7 i.e. seven word forms to the left and to the right.42 For
example, some verbs co-occur more often with one item than the other as in table (7.9).

(y) (dhanb/w) )f(x, y )F(y MI

ask for forgiveness ‫استغفر‬ 85 161 11.49
sin ‫يذنب‬ 17 33 11.45
confess ‫اعترف‬ 5 49 9.11
precede ‫تقدم‬ 45 540 9.09
repent ‫يتوب‬ 13 185 58 .8
follow ‫تأخر‬ 7 149 8.00
hit ‫يصيب‬ 3 755 4.43
approach ‫يأت‬ 8 2219 4.29
do ‫فعل‬ 4 1764 3.62
fall ‫يقع‬ 3 1995 3.03
(y) (ithm/w) )f(x, y )F(y MI
bear ‫تبوء‬ 44 59 12.49
intend ‫يتعمد‬ 11 24 11.79
allege ‫يفتري‬ 4 32 9.92
gain ‫يكسب‬ 16 194 9.32
avoid ‫يتنب‬ 5 61 9.31

41 In fact, taking a span of 3 or 4 words can reveal some intersting differences between the collocates of ithm
and dhanb. However, in the first place, if we stick to such a span in this case, we will ovelook many intersting
collocations. Secondly, we would like to make use of all possible collocations in our realtively small corpus.
42 The maximum span I can handle automatically using the Monoconc tool is 3:3. So, I will use Microsoft Word
to capture the nodes that extend over that span. I run the concordance first and save the result into a text-only
file, then I use Microsoft Word to count the hits which I see relevant to my search term such as an adjective
modifying it which is located far apart in the line.

` 123
increase ‫يزداد‬ 12 200 8.86
earn ‫يكتسب‬ 5 179 7.75
carry ‫احتمل‬ 15 542 7.74
record ‫يكتب‬ 5 928 5.83
incur ‫يوجب‬ 4 615 5.65
Table (7.9) the top ten verb collocates of ithm & dhanb in a span of 7 words.
The most significant V+N collocation of dhanb, i.e. hits with high MI scores as shown in the
table above, are astaghfiru ‘I ask for forgiveness’, yudhnib ‘sin’, taqaddama ‘precede’,
ictarafa ‘confess’ and yatuub ‘repent’. As for ithm, tabuu’ ‘bear’, yatacammad ‘intend’,
yaftari ‘allege’, yaksib ‘gain’, and yajtanib ‘avoid’ are the strongest collocates.

According to our approach, there must be a difference in meaning between these two items
because of their contextual differences. The first question which can be raised, then, is why
such collocates appear more frequently with either item. For example, ‘ask for forgiveness’
and ‘repent’ often co-occur with dhanb. In the Islamic creed sins can be forgiven, whether
venial or deadly, except blasphemy against Allah. But how can one attain forgiveness?

Sins can be forgiven by doing good deeds and/or repentance. There are sins which can only
be forgiven through repentance when a fault is done against people. Then there must be an
extra action: compensating those who have been wronged or obtaining their forgiveness.
Therefore, we can say that every repented sin can be forgiven: venial sins by the act of inner
repentance alone (by asking for forgiveness or practically by doing good deeds and refraining
from bad deeds), and mortal sins by repentance expressed through the compensation or
reconciliation with those who you wronged. On the other hand, collocations of ithm do not
reveal how sins are expiated. They rather refer to a state of accumulation of such sins.
Let us now examine all other types of collocations as shown in (7.10) below to support or
eliminate that distinguishing feature.

` 124
n= 5000000, x(ithm/w) = 645 x(dhanb/w) = 917
(y) (ithm/w) )f(x, y )F(y MI
changing ‫تبديل‬ 21 15 12.43
staying on ‫التأخر‬ 99 178 12.07

walking ‫المرور‬ 6 3 11.92

charity ‫الزكاة‬ 6 6 11.78
gambling ‫الميسر‬ 105 28 11.01
Lying ‫الكذب‬ 19 4 10.67
property ‫أموال‬ 35 5 10.11

eating ‫الميتة‬ 88 8 9.46

drinking ‫الخمر‬ 542 28 8.64
usury ‫الربا‬ 240 3 6.59
(y) (ithm/w) )f(x, y )F(y MI
fornication ‫الزنا‬43 232 11 8.01
unbelief ‫الكفر‬ 34 11 10.78
treaties ‫الميثاق‬ 10 3 10.67
apostasy ‫الردة‬ 16 3 9.99
orphans ‫اليتامى‬ 35 3 8.86
theft ‫السرقة‬ 45 3 8.50
major sins ‫الكبائر‬ 93 5 8.19
oppression ‫الظلم‬ 193 9 7.99
murder ‫القتل‬ 2643 38 6.29

Table (7.10) the most significant noun collocates with ithm (the top ten words) and dhanb
(the last nine words) with minimum frequency 3.

The underlined words in the table above are parts of multi-word religious concepts which are
frequently used in CAC. They read as follows: ‫‘ تبديل الوصية‬changing the will’, ‫التأخر أو التعجل ف الج‬

‘staying on or haste to leave in pilgrimage’,‫‘ الرور بي يدي الصلي‬walking in the prostration

position of someone’, ‫‘ الكذب على ال‬lying to Allah’, ‫‘ منع الزكاة‬not paying charity, ‫نقض اليثاق‬

‘breaking treaties’.

The table reveals that ithm is mainly used for sins that are personal or do not entail a
punishment in this world, like missing some obligatory worshipping acts or doing a bad deed

43 ‘Adultery/ fornication’ are referred to as faah}isha and zinaa in CAC.

` 125
that recurs on oneself, like drinking, gambling, lying to Allah, etc. On the other hand, dhanb
is used for sins that entail punishment in this world or the next, e.g. killing, theft, adultery,

The uses can be summarised as follows:

1) Missing an obligatory worshipping act.
2) Doing an act that causes harm to one’s religion.
3) Behaving or doing actions which are considered morally wrong.
4) Causing harm to one’s own self.
5) Secret bad deeds caused to others.

1)Doing an act that causes harm to others.
2)Doing an act which is considered illegal.
3) Committing a major sin.
4)Doing an act that might entail punishment in this world.

The items of the pair are not absolute synonyms though they share the same range of
application (both refer to committing a bad deed in general). However, a subtle difference
will emerge by using the t-score statistic.

T-score can tell us how much difference exists between ithm and dhanb by comparing the
frequency of the co-occurrence of either word of the pair and its collocates with the other.
This will help us find out each word’s preferential usage. Then we will be able to abstract out
of the differences that will come up the main attributes that distinguish both of them.

` 126
(y) (ithm/w) )f (ithm/w )f(dhanb/w T
changing ‫تبديل‬ 0 15 3.87 P < 0.01
staying on ‫التأخر‬ 99 0 9.94 P < 0.0001

walking ‫المرور‬ 3 0 1.73 Not sig.

charity ‫الزكاة‬ 6 0 2.44 P < 0.10
gambling ‫الميسر‬ 28 0 5.29 P < 0.0001
Lying ‫الكذب‬ 4 1 1.34 Not sig.
property ‫أموال‬ 5 0 2.23 P < 0.20

eating ‫الميتة‬ 8 0 2.82 P < 0.05

drinking ‫الخمر‬ 28 1 5.01 P < 0.0001
usury ‫الربا‬ 3 0 1.73 Not sig.
(y) (ithm/w) )f(x, y )F(y
fornication ‫الزنا‬ 44
1 11 2.88 P < 0.02
unbelief ‫الكفر‬ 6 11 1.21 Not sig.
treaties ‫الميثاق‬ 0 3 1.73 Not sig.
apostasy ‫الردة‬ 0 3 1.73 Not sig.
orphans ‫اليتامى‬ 1 3 1.00 Not sig.
theft ‫السرقة‬ 0 3 1.73 Not sig.
major sins ‫الكبائر‬ 2 5 1.13 Not sig.
oppression ‫الظلم‬ 6 9 0.77 Not sig.
murder ‫القتل‬ 11 38 3.85 P < 0.0001
Table (7.11) the t-test scores of the most significant collocates of ithm and dhanb.

As we can see in table (7.11), the high t-scores go with ithm when describing one’s own
actions that bring harm to oneself such as gambling, drinking wine, eating dead or
unslaughtered animals or birds or missing an obligatory worshipping act like staying on or
haste to leave in pilgrimage. In addition, it describes one’s actions whose results will only

44 ‘Adultery/ fornication’ are referred to as faah}isha and zinaa in CAC.

` 127
affect one’s abode in the hereafter such as missing prayers, not paying charity, etc. On the
other hand, dhanb gets the highest t-scores45 with actions that bring harm to other people like
murder, theft, etc. So, we can say that ithm is intrinsic whereas dhanb is extrinsic.

7.5.1 A Few Remarks

If both words are used to describe the same bad action, it can be noticed that either one is
often the preferred choice when a particular semantic attribute is involved. For example,
murder is a major sin that entails punishment in this world and the next. It always collocates
with dhanb as mentioned in the Qur’an,

‫ولم علي ذنب فأخاف أن يقتلون‬

wa lahum calayyi dhanb fa’akhaafu an yaqtluun

And they have a sin (a charge of crime) against me, and I fear they will kill me.
(Al-Shu’ara: 14)

This sin is explained in another verse as a charge of murder.

‫قال رب إن قتلت منهم نفسا فأخاف أن يقتلون‬

He said: My Lord! I have killed a man among them, and I fear that they will kill me.
Al-Qasa: 33

However, it is mentioned twice with ithm in the following contexts:

1) when it refers to the murder of the son of Adam, Cain and this is obviously logical because
nobody was there to prosecute Abel for that murder.

2) or with committing suicide, such as ‘the ithm ‘sin’ that recurs when some people starved
themselves until they died’.

Al-khamr ‘Drinking wine’ is considered a major sin in Islamic belief. However, it collocates

45 Although the scores are significant with a few examples of dhanb, yet we still can draw a conclusion. This is
the most we can get out of our corpus. Hopefully, by increasing the corpus in the future, more examples may

` 128
with ithm not dhanb. An interpretation given by Al-Asfhani in his Mucjam (Lexicon) explains
why it is considered an ithm. He said that drinking might prevent the drinker from doing
obligatory acts he is entitled to do so as to enter Paradise; so he is harming himself when
missing such acts, which are described as ithms accordingly.

Zinaa ‘Fornication/adultery’ co-occurs more often with dhanb than with ithm. This indicates
that it is not a personal action as some people might think in that it does not affect others.
Indeed, the consequences of fornication are grievous and can harm the whole society by
unwanted pregnancy or abstaining from marriage, the right soil for procreation.

Kufr (unbelief) is described one time as ithm and other time as dhanb. It is, in the first place,
something between a man and his God; it is something that rests in one’s heart. But if this
behaviour is publicised, i.e. it contradicts the main stream of the society; it will spoil the unity
of this society; it will be called then riddah ‘apostasy’. Therefore it becomes a public menace
and should be controlled.

7.5.2 Summary
The MI tests conducted in this section for ithm and dhanb show that these two words
significantly collocate with negative actions. They both describe one’s bad deeds in religious
terms. The most significant collocations of ithm refer to sins which involve harming oneself,
such as drinking intoxicants, missing some obligatory prayers, etc. On the other hand, dhanb
significantly collocates with sins which involve harming others, such as unjustly taking of
people’s property, killing, etc.

The T-test highlighted an interesting difference between the two words by comparing all
occurrences of both words with high MI information scores. The semantic feature that was
extracted from the T-test tables affirmed that the semantic feature that distinguishes between
ithm and dhanb is intrinsic vs. extrinsic.

` 129
7.6 A case study: The word pair h}asiba and z}anna ‘think’
The seemingly synonymous pair, h}asiba and z}anna, which mean ‘think’, will be
examined below to extract other semantic features. The dictionary meaning is given in Table
(7.12) before doing our corpus-based analysis.
Table (7.12) definitions of h}asiba and z}anna in four Arabic dictionaries

In the first place, the dictionary meanings of the pair as shown above give the denotation of
the words under investigation which simply refers to (1) uncertainty or probability, (2)
suspicion and (3) certainty. The meanings (1) and (3) seem contradictory because it would be
confusing to have a word meaning somethingAl-Muh
and its opposite.
iit In the second place, the
dictionaries presume that the pair is synonymous by not giving a definition to h}asiba but
rather refer to it as a sole synonym. The near synonym pair h}asiba and z}anna are used
to define each other in many dictionaries. In addition to the similarities in meaning, these two
verbs are seemingly syntactically parallel since both are ditransitive verbs i.e., they can have
two direct objects (like give) and they may have an intransitive usage as well. In addition,
they can be nominal modifiers and undergo nominalisation. In addition, both can occur with
subordinate clauses. Let us consider the following examples in (10a-12b) below.

‫ظنّ عمرو بكرا خالدا‬

z}anna camrun Bakran Khaalidan.

amr thought Bakr as Khalid.

‫ومن يغترب يسب عدوا صديقه‬

wa man yaghtarib yah}sibu caduwan s}adiiqahu .

Whoever goes abroad will think an enemy a friend.

` 130
....”‫ “شهادة الزور‬:‫وأكب ظن أنه قال‬

wa akbaru z}anni annahu qaala shahaadata al-zuur.

I more likely think that he said, “the false witness”.

...‫ ما أعد من عقابه لهل معصيته بسبانم أنم فيما أتوا من معاصي ال مصلحون‬...

ma acadda min ciqaabihi li’ahli macs}iyatihi bi-h}usbaanihim annahum fiimaa

ataw min macaasi}i allaahi mus}lih}uun.
What have (Allah) prepared of torment to the sinners who think they are righteous
despite the sins they committed.

...‫ظن قوم أن سم الصلة خاصة بارد‬..

z}anna qawmun anna summa al-’as}lah khaas}s}atan baarid

Some people think that the python’s poison is cold.

ُ‫سبُ أَنّ مَاَلهُ أَ ْخلَدَه‬
َ ْ‫ َيح‬..

yah}sabu anna maalahu akhladahu

He thinks that his wealth will make him eternal.

In Arabic grammar books, h}asiba and z}anna are called ‘afcaal al-quluub’ (heart verbs)
and they function as nawaasikh46. They are both used to mean certainty and probability but
probability is more likely to be the dominant case (Mubarak, 1982: 180).

As shown above, it seems that z}anna and h}asiba can be used interchangeably. As corpus
data revealed some important and subtle differences between the previous pairs of synonyms
that are hard to recognise solely by intuition, we need to carry out the same methodology to
46 nawaasikh are verbs that assign case endings for the first two nouns that follow. Some verbs like kaana
assign a nominative case for the first noun, the subject, and an accusative case for the second, the complement.
z}anna assigns an accusative case for two objects that follow, apart from the subject which is always in
nominative case.

` 131
examine whether the pair under investigation are synonyms or not. This can help us extract
the contrasts or subtle differences that pertain to their collocational distribution.

The statistics47 shown by the corpus demonstrate that for z}anna the most frequent word
form is z}ann in nominal form. There are 1254 instances of that particular form, which is
53.58% (1254/2340) of the total. The second most frequent is the verb z}anna, with 1086
hits, which is 46.41% (1086/2340). That is a lot and suggests that z}ann as a noun and
z}anna as a verb are both central items to learn. But h}asiba is exclusively used as a verb,
with only one occurrence in nominal form.

We will examine the first left and right collocates of z}ann as nominal modifiers below,
particularly those collocates that function as adjectives or genitives. In order to analyse the
significant collocations of this word, we, in the first place, searched all collocations with
minimum frequency 3. Secondly, we discarded all insignificant collocations, i.e.
combinations with MI scores lower than 1. Thirdly, we manually eliminated collocations
other than adjectives and genitives. Table 7.13 below shows all adjective and genitive
collocates of z}ann with minimum frequency 3.
z}ann (left collocates)
(x) F(x,y) MI
true ً‫صادقا‬ 10 7.75
false ‫كاذب‬ 10 8.25
invalid ‫فاسد‬ 7 8.56
the era of ignorance ‫الجاهلية‬ 6 5.03
suspicious ‫سيئ‬ 6 5.75

z}ann (right collocates)

(x) F(x,y) MI

47 We have carried out the statistics after singling out all possible forms of the search-term. This can be done
easily automatically if we have a tagged corpus. Having a tagged corpus will require a lot of time before
conducting such lengthy research, so we searched the corpus for what we know as verbs and nouns separately.

` 132
bad ‫سوء‬ 40 7.92
good ‫حسن‬ 31 6.51
more likely ‫أغلب‬ 18 10.85
much ‫كثير‬ 6 3.05
certain ‫مؤكد‬ 4 10.25
suspicious ‫سيئ‬ 3 7.79
Table (7.13 the first left collocates with nominal z}anna (the top five words) and the right
collocates (the last six words): adjectives and genitives with minimum frequency 3.

Examining the most significant collocates of z}ann (in nominal form) as represented in
table (7.13), we find out that z}ann occurred more frequently with words of negative sense.
For the negative collocates the table shows the following examples which occurred altogether
72 times in CAC: invalid ‘faasid’, false ‘kaadhib’, the era of ignorance ‘aljaahiliyyah’,
suspicious ‘sayyi’’, bad ‘suu’’. For the positive collocates which occurred 41 times, the table
shows the following: good ‘h}asan’, true ‘s}aadiqa’. A minority of examples are neutral,
which constitute only 28 examples: more likely ‘aghlab’, much ‘kathiir’), certain

However, we cannot draw any conclusive description of z}anna before studying the other
form (verb). Let us now examine the left collocates48 of z}anna (verb) with the same
procedure taken above in table (7.13) as shown in table (7.14) below.

(x) )F(x,y MI

48 In Classical Arabic, the canonical structure of a sentence is VSO. The alternative basic order which is SVO is
also possible provided that we have a good reason like emphasis. Also, we may have a fronted object as in
iyyaaka nacbudu ‘You-(alone) we-worship’ (surah Al-Fatihah: 5). According to the Arabic grammar you can
say, nacbuduka ‘worship-You’ but a pronoun referring to Allah preceded the verb to exclude any other one from
the act of worshipping. So we examined the left collocates of z}anna (V) because (1) analysing the right
collocates could mislead us by counting items relating to other verbs and (2) most of the items which modify the
Arabic verb fall on the left-hand side.

` 133
suspicions ‫الظنون‬ 18 10.22
untrue ‫ غير الحق‬49 14 10.10
false ‫كاذب‬ 5 6.61
bad ‫سوء‬ 3 2.70
death ‫الموت‬ 3 2.02

Table (7.14) the immediate left collocates with z}anna (verb).

As shown above z}anna is mainly used negatively. As for h}asiba, (which is mainly used
as a verb in CAC) we have collocates like the following:

(x) )F(x,y MI
eternity for the martyrs ‫الله أموات سبيل في قتلوا‬ 18 12.96
entering Paradise ‫تدخلوا الجنة‬ 9 7.92
safety from torment ‫العذاب من مفازة‬ 6 13.43
good 4 4.22
truth 3 3.23

Table (7.15) the right collocates of h}asiba with minimum frequency 3.

There are also many positive instances of h}asiba in CAC which occurred once or twice.
For the sake of brevity I will give two examples only.

َّ ‫م لُؤ ْلُؤ ًا‬ َ َ ‫م‬
‫منثُوًرا‬ ْ ُ‫سبْتَه‬
ِ ‫ح‬
َ ‫م‬ َ ‫خل ّدُو‬
ْ ُ‫ن إِذ َا َرأيْتَه‬ َ ُّ ‫ن‬ ْ ِ‫ف ع َلَيْه‬
ٌ ‫م وِلدَا‬ ُ ‫وَيَطُو‬

wa yat}uufu calayhim wildaanun mukhalladuun idha ra’aytahum h}asibtahum

lu’lu’an manthuura.
And round about them (will serve) boys of everlasting youth. If you see them, you
would think them scattered pearls

49 The construction ghayr al-h}aq ‘not truth’ is examined as a whole.

` 134
‫وهم يحسبون أنهم بفعلهم ذلك مصلحون‬

wa hum yah}sabuuna annahum bi-ficlihim dhaalika mus}lihuun.

They think while doing that they are righteous.

We can further support the hypothesis of negativity and positivity of z}anna and h}asiba
by searching their occurrences in CAC, i.e. with a frequency lower than 3. However we will
not be able to show all these occurrences for the sake of brevity; we would rather refer to
their total number added to the total number of the negative and positive occurrences of
z}anna and h}asiba with minimum frequency 3 as shown in table (7.16) below. We are
now able, motivated by the collocational analysis above, to look at all the occurrences of
z}anna and h}asiba (without designating a particular threshold) to draw some differences
between them in terms of their semantic features. The result is given in table (7.16).

z}anna & h}asiba (N)

Nominal Positive negative Neutral Total
z}ann 41 83 1130 1254
h}usban 1 0 0 1

z}anna & h}asiba (V)

Verbal Positive Negative Neutral Total
z}anna 20 68 998 1086
h}asiba 61 25 301 387
Table (7.16) z}anna and h}asiba in terms of negativity and positivity.

We can notice that the neutral sense50 of z}anna and h}asiba is dominant. However, we
still can say that z}anna is used more negatively than positively whereas h}asiba shows a
tendency to be more positive. Most of the negative occurrences of h}asiba, unlike
z}anna, are negated51. For example:

50 Examples 10a and 10b mentioned earlier are good examples of the neutral sense of z}anna & h}asiba.
51 The negative forms are underlined in the Arabic text along with the transliteration.

` 135
‫ول تسب ال غافلً عما يعمل الظالون‬

wa la tah}sabanna Allaaha ghaafilan camma yacmalu al-z}aalimuun.

And not think Allah unaware of-what do-they the-wrongdoers.
Think not that Allah is unaware of that which wrongdoers do. (14: 42)

‫فل تسب ال ملف وعده رسله‬

fala tah}sabanna Allaaha mukhlifa wacdihi rusulah.

So-not think Allah breaking promise-His Messengers-His.
So think not that Allah will fail to keep His Promise to His Messengers. (14:47)

One more item of evidence is that one of the dictionary meanings of z}anna, as mentioned
earlier, is ‘suspicious’ and this sense is mainly negative in Arabic, on the other hand
h}asiba is used in the context of praise as in the following hadith:

.‫إذا كان أحدكم مادحا أحدا فليقل أحسب فلنا هكذا‬

Idha kaana ah}adukum maadih}an akhahu fa-l-yaqul: ah}sibu fulaannan

Whoever was one-(of)-you praising brother-his so-say think-I somebody like-this.
Whoever amongst you has to praise his brother should say, ‘I think that he is so and
so (Sahih Muslim: Volume 3, ch. 48, no. 830).

We can also wonder why the difference in frequency between z}anna and h}asiba is so
great. Two explanations can be given here. The first explanation is that the non-occurrence of
h}asiba in the nominal form could be for morphological or phonological reasons since
z}anna can be used as a verb and as a noun alike. To change h}asiba into a noun one
morphological process (step 1 below) and two other phonological processes (steps 2 and 3)
have to take place: 1) to add a suffix which is (aan) in this case, 2) to change the first vowel

` 136
into u and 3) delete the second vowel. So the output form is h}usbaan. Secondly, we can
say that the less frequent word (h}asiba) is used in a more restricted sense, perhaps mainly
in particular contexts only. To see whether this is the case, we can start by looking at the
distribution of the pair across the Qur’an subcorpus, and compare the distribution.

z}anna is mentioned in the Qur’an 55 times, of which 6 are in nominal form whereas
h}asiba occurs 43 times, entirely verbs. This search in the Qur’an subcorpus shows us how
similar z}anna and h}asiba are in terms of frequency with 49 times for z}anna as a
verb and 43 times for h}asiba. Such similarity in frequency in the Qur’an subcorpus can
give equal data for analysis.

Sometimes z}anna is used in the Qura’n to mean ‘certainty’ and other times ‘doubt’. So it
refers to two contradicting senses: the thing and its opposite. All commentators of Qur’an
give two contradictory meanings to z}ann. They treat z}ann as a polyseme that has two
different meanings, but different here means oppositeness. One commentator, Al-Tabari,
gives more examples from Arabic to strengthen his point of view; he mentions al-sudfa to
mean darkness and light, al-s}ariikh to mean the rescued and the rescuer. He is not actually
the only one who is in favour of this approach. Ibn Al-Anbari compiled a book called Al-
Ad}daad ‘The Opposites’ where he collected all homophones of opposite meanings, the top
word of which was z}ann.

On the other hand, some linguists denied the existence of this phenomenon in Arabic52 like
Ibn Durustwayh who compiled a book called Ibt}}aal Al-Ad}daad (Refuting the book of
Opposites) in which he denied that approach because it contradicts the wisdom of Arabs (Al-
Suyuti, Al-Muzhir: 400).
52 Some other linguists give two explanations for the existence of such phenomenon:
1.Broadening of meaning, such as al-s}areem, which literally means (the separated) to mean the night because
it is separated from the day and the same applies to the day that is separated from the night. Al-sudfa which
means both light and darkness can be explained in the same way, al-sudfa is originally put to mean to hide
so when darkness comes it hides the light of the day and when light comes it hides the darkness of the
2.Dialectical variations: for instance, al-jawn means black in Tamim’s dialect and white in Qays’.
One more reason can be added to the above explanations which is not mentioned in Al-Muzhir: narrowing. For
example, al-ma’tam which originally means a gathering of men and women for a sad or a merry occasion is
limited later on to the sad occasion. Therefore oppositeness can no longer hold between homophonous words.

` 137
Now let us come back to the subject matter of this chapter by looking at z}ann which is
often regarded as a polyseme that has two opposite meanings, ‘doubt’ and ‘certainty’. Some
commentators, like Mujahid who says whenever z}ann is mentioned in Qur’an it means
certainty yet he interprets z}ann in some verses as meaning doubt. The selection of
meaning depends entirely as they presume on context. For example, z}ann in the following
two verses mean certainty in (16.a) and doubt in (16.b).

.‫ الذين يظنون أنم ملقو ربم وأنم إليه راجعون‬.‫واستعينوا بالصب والصلة وإنا لكبية إل على الاشعي‬

wa istaciinuu bi-l-s}abri wa al-s}alaati wa innahaa lakbiiratun illa cala al-

khaashciin. alladhiin yaz}unnuuna annahum mulaaquu rabbihim wa annahum
ilayhi rajicuun.
And seek-help-you with-the-patience and the-prayer and verily-it very-big except on
the humble who think that-they meeting Lord-their and that-they to-Him returning.
And seek help in patience and prayer; and truly it is hard except for the humble-
minded. They are those who are certain that they are going to meet their Lord and
that unto Him they are going to return.53 (The Noble Qur’an, trans by M. Khan, 2:

Allah described the true believers as those who have z}ann that they will meet Allah and
they will return to Him. This is a matter of belief. If they are doubtful they would not be
called believers.
. ‫ومنهم أميون ل يعلمون الكتاب إل أمان وإن هم إل يظنون‬

wa minhum ummiyuuna la yaclamuun al-kitaaba illa amaaniyya wa in hum illaa

And from-them unlettered not know-they the-book except wishes and but they
And there are among them unlettered people, who know not the Book, but they trust
53 The translation is slightly modified (cf. fn. 26).

` 138
upon false desires and they but guess. (The Noble Qur’an, 2: 78)

z}anna is translated above as guess. The verse in (2) talks about some Jews who are
illiterate and do not know the reality of their book; however, they follow their scholars
blindly and believe them. This is a different category from those who know the truth and
falsify it mentioned in the verse preceding it (2: 77). So if we interpret z}anna as doubt or
guess as commentators say, we presume that that second category of Jews who do not know
the reality of their book do not believe in it. But this is not the case since this category is
blindly following their scholars and this is a type of belief. We would rather say there are
some Jews who only know the false version of the Bible and they are certain about what they
believe even if it is false.

The inconsistency of the interpreters of Qur’an and the translators later on created a big
confusion when assessing the following verse.

)17(‫وذا النون إذ ذهب مغاضبا فظن أن لن نقدر عليه فنادى ف الظلمات أن ل إله إل أنت سبحانك‬

wa Dha An-Nuun idh dhahaba mughaad}iban fa- z}anna an lan naqdira

And Dhun-Nuun when he went off in anger, and imagined that we shall not punish
him! But he cried through the darkness: none has the right to be worshipped but You,
Glorified are You. (The Noble Qur’an, 21: 87)

The first dictionary meaning of naqdira is ‘be able’. Ibn Katheer and Al-Qurtubi interpreted
naqdira as ‘to narrow’ or ‘constrict’ as in (18):

)18( ‫ومن قدر عليه رزقه فلينفق ما آتاه ال‬

wa man qudira calayhi rizquhu fa-l-yunfiq mimma ‘aataahu Allaah.

And who was-restricted on-him livelihood-his so-spend-(he) of-what gives-him
And the man whose resources are restricted, let him spend according to what Allah

` 139
has given him (Qur’an, 65:7).

The meaning of the verse as presented by Ibn Katheer is ‘So Jonah (Dhul-Nuun) thought that
Allah might not constrict him in the belly of the fish’.

Commentators on Qur’an eliminated the possibility that the Prophet Jonah had doubt that
Allah was not able to get him by explaining the meaning of qadara as to constrict. But the
question is still raised, how Prophet Jonah, who is infallible according to the Islamic creed,
thinks that Allah might not constrict him in the belly of the fish, while he went off in anger
fleeing from his people without permission from Allah. If we interpret z}anna here as
certain, the whole argument will be solved. So the meaning is Jonah was certain that if he
prayed to Allah he will be saved. The use of the fa with the following verb naada clarifies
this point as fa introduces a result. So we can say, Jonah was certain he won’t be constricted
in the belly of the fish if he prayed to Allah.

In short, there is a subtle difference between z}anna and h}asiba because of the
contextual variation that occurs with them. Still we need to figure out what semantic features
that make them different in a more methodical way by means of corpus-based analysis.
Let us try to study the whole environment of z}anna and h}asiba particularly the first and
second left collocates to see what preferential distribution they appear in.

(y) F(x,y) )F(y MI

` 140
suspicions ‫الظنون‬ 22 25 10.87
most ‫غالب‬ 18 22 10.77
meeting ‫ملقو ال‬ 11 14 10.71
Allah54 ‫كل الظن‬ 4 6 10.47
certainly ‫مؤكد‬ 4 7 10.25
certainly َ‫ظنا‬ 17 64 9.14
very ‫أن‬ 579 15537 6.31
that ‫بـ‬ 252 34845 3.94
in ‫كثي‬ 6 1547 3.05
much ‫ف‬ 48 24009 2.09

Table (7.17) the1st & 2nd left collocates with z}anna with minimum frequency 3.

(x, y) F(x,y) F(y) MI

that ‫أن‬ 102 15537 6.40
in ‫ف‬ 5 24009 1.42
at ‫بـ‬ 3 34845 0.15
Allah ‫ال‬ 3 19246 1.00

Table (7.18) the1st & 2nd left collocates with h}asiba with minimum frequency 3.

The first thing to notice from the above table is the high frequency of an ‘that’ (an introducer
of a subordinate clause) which occurred 579 times with z}anna, whereas it occurred just
102 times with h}asiba. However, an ‘that’ has almost the same percentage with the both
items: h}asiba (388/102= 26.82%) and z}anna (2340/579=24.74%). So they both have
the same proportion of occurrences with subordination.

54 We searched this item plus the following one because they constitute one concept which is resurrection.

` 141
We can also see that z}anna collocates with the full range of intensifiers such as ‘certainly,
much, most, very’, whereas none of these intensifiers occurs with h}asiba, even after a
further assessment of all possible occurrences of both items. Therefore, we can say that
z}anna is something that can increase or become more certain. It can increase to reach a
level of conviction as mentioned above in example (16.a) Qur’an: 2:46).

We then see that ‘z}anna mulaaquu Allaah’ (they believe they will meet Allah) has a high
MI score at 10.71. We can say then that z}anna collocates with a word denoting belief in
resurrection and this involves certainty. In fact, the dominating sense for z}anna so far, on
the basis of the evidence given throughout, is to denote belief. However, there are some
occurrences of z}anna which are assumed to denote probability or doubt as mentioned
earlier. For practical reason, we can fit all these senses in an epistemic scale.

doubt possibility probability necessity prediction factuality

‘Factuality’ in the above scale represents the highest degree of certainty, whereas ‘possibility’
and ‘doubt’ is the lowest. So, we can easily include all senses of z}anna: probability, belief
and certainty, to get the unanimity of all lexicographers by just sticking to one sense which
resides halfway between ‘doubt’ and ‘certainty’ or between ‘doubt’ and ‘certainty not’, i.e. a
state of strong or weak possibility, as represented in the following scale.

certainty not ----- z}anna ------- ------ z}anna ------- factuality


In fact, the use of z}anna to mean ‘believe’ reflects a faith-related commitment. Therefore,
this sense eliminates its use in relation to the prophet in the following verses:

‫ول تسب ال غافلً عما يعمل الظالون‬

wa la tah}sabanna Allaaha ghaafilan camma yacmalu alz}aalimuun.

` 142
And not think Allah unaware of-what do-they the-wrongdoers.
Consider not that Allah is unaware of that which wrongdoers do. (14: 42)

‫فل تسب ال ملف وعده رسله‬

fala tah}sabanna Allaaha mukhlifa wacdihi rusulah.

So-not think Allah breaking promise-His Messengers-His.
So think not that Allah will fail to keep His Promise to His Messengers. (14:47)

Two explanations are given in Tabari’s Tafseer (Commentary on the Qur’an) for h}asiba in
this particular context:
(1) To highlight the Prophet’s belief that he does not consider Allah unaware of what the
wrongdoers do. Similar phrasing can be earmarked in Qur’an in more than one place. For
instance, Allah says, ‘O ye who believe, believe’ (4: 136).
(2) To draw the attention to the fact that Allah is aware of the wrongdoers actions and He will
punish them accordingly.

If the addressee is not the Prophet, the literal meaning will not be infringed.
In Al-Qurtubi’s explanation, he said, ‘this is to relieve the Prophet (Muhammad) after relating
to him this sad story about the people of Abraham and how impudent they are in discrediting
his religion.’

To know, in the first place, that the addressee in the following verse (14: 44) is the Prophet
shows that the addressee in the previous verse has to be him as well.
However, the addressee in the above verses can include all categories of the participants in
the speech-act: the speakers, the listener/ reader and the audience, because this is put in an
admonishing style. This is in conformity with the basic idea of prophethood and the
revelation which is for the good of the whole people, not just for the Prophet. Accordingly,
h}asiba should have another meaning, that suits all potential addressees, different from
z}anna which means something in between certainty and doubt. We would better define it
as a verb that refers to the inclination of one’s heart to think. Secondly, it is obvious that the

` 143
two verses (19a & b) are imperative and negative at the same time. This use is only typical
with h}asiba. The negated imperative occurs 37 times with h}asiba (i.e. 9.56%) and only
three times with z}anna (0.12%). In this context I examined z}anna and h}asiba in
verbal forms and it turned out that all their occurrences in negative imperative are followed
by clausal complements (subordinate clauses) and these clauses can function as subject,
object or complement. This is quite significant in drawing up the differences between
z}anna and h}asiba.

We have seen that h}asiba occurs as negative imperative more than z}anna. Let us now
look at some possible explanations as to why this is so.

Basically, the pronouns used with h}asiba in imperative case must be second person,
singular or plural, feminine or masculine. The personal pronoun, you is used in ‘a direct
address language’ (Leech, 1966: 34). The language of direct address is an appropriate vehicle
for effective communication, where the addresser seems as if holding a conversation and
talking to the addressee directly. In this case, the person receiving the message, the addressee,
is the passive part of the speech-act. Is not that proof that h}asiba is a passive word? No,
we cannot make that claim before we assess the other part of the description, namely the
imperative mood.

First of all, the literal meaning of imperative mood is for direct instructions and admonition.
Imperatives can be positive, meaning direct exhortations, or negative when connoting
prohibitive warnings (ibid 110-111). All occurrences of h}asiba and z}anna in imperative
mood are accompanied by the negative form. Therefore, they are used as prohibitive
warnings. This sense, prohibitive warnings, coupled with the language of direct address are
significant in religious discourse where the speaker tries to remedy the defects of the
listeners/ hearers without any sort of sophisticated locution. The speaker only aims to touch
the souls of his/ her audience in a simple and short cut way.

Secondly, the use of h}asiba in this way implies that the message to be delivered is enough
to treat a superficial problem that did not find its way to the heart. Thus, it can reduce the

` 144
discourse complexity, by expressing in just one or two sentences (as in example (19 a & b)
above) what would otherwise have been expressed in a lengthy address with z}anna. As for
z}anna as in (Qur’an 2:154-171), Allah gives an account, using z}anna, of the behaviour
of some Muslims in the battlefield and the remedy of it. He, therefore, gave a lengthy
treatment of such a problem, which is cowardice or fear of death, after it had found its way to
their hearts. Therefore the main distinction between h}asiba and z}anna is that the former
is used for deeply held belief or conviction whereas the latter is for superficial belief (i.e.
belief about relatively unimportant issues). For example, z}anna is used throughout the
Qur’an subcorpus to mean a state of belief or disbelief that leads either to heaven or Hell-fire.
Let us consider the following examples of z}anna:

.‫ الذين يظنون أنم ملقو ربم وأنم إليه راجعون‬.‫ واستعينوا بالصب والصلة وإنا لكبية إل على الاشعي‬.

wa istaciinuu bi-l-s}abri wa al-s}alaati wa innahaa lakbiiratun illa cala al-

khaashciin. alladhiin yaz}unnuuna annahum mulaaquu rabbihim wa annahum
ilayhi rajicuun.
And seek-help-you with-the-patience and the-prayer and verily-it very-big except on
the humble who think that-they meeting Lord-their and that-they to-Him returning.
And seek help in patience and prayer; and truly it is hard except for the humble-
minded. They are those who are certain that they are going to meet their Lord and
that unto Him they are going to return. (The Noble Qur’an, trans by M. Khan, 2: 46)

.‫وأما من أوتى كتابه بيمينه فيقول هاؤم اقرؤوا كتابيه إن ظننت أن ملق حسابيه‬

wa ammaa man utiya kitaabahu bi-yamiinihi fa-yaquulu … inni z}anantu anni mulaaqin
And as-for who given-(him) book-his in-right-hand-(his) so-says … verily-I thought
that-I meeting reckoning-me.
Then as for him who will be given his Record in his right hand will say… Surely, I
did believe that I shall meet my Account. (Qur’an, 69: 19-22)


` 145
.‫وأما من أوتى كتابه وراء ظهره فسوف يدعو ثبورا ويصلى سعيا إنه كان ف أهله مسرورا إنه ظن أن لن يور‬

wa amma man uutiya kitaabahu waraa’a z}ahrihi fa-sawfa yadcu… innahu z}anna
an lan yahuura.
And as for who given-(him) book-his behind back-his so-will invoke-he
destruction… Verily-he thought that not return.
But whosoever is given his Record behind his back, He will invoke (his) destruction.
… Verily, he thought that he would never come back (to Us)! (Qur’an, 84:10-14)

Now let us have a look at the following examples of h}asiba.

‫قيل لا ادخلي الصرح فلما رأته حسبته لة وكشفت عن ساقيها‬

qiila lahaa udkhuli al-s}arh}}a fa-lammaa ra’athu h}asibathu lujjah.

Said to-her enter-you the-building so-when saw-it thought-it pool.
It was said to her: Enter the palace, but when she saw it, she thought it was a pool,
and she (tucked up her clothes) uncovering her legs,” (Qur’an, 27: 44).

‫سبَْتهُمْ ُلؤُْلؤًا مّنثُورًا‬
ِ َ‫خلّدُونَ إِذَا َرَأْيَت ُهمْ ح‬
َ ‫َوَيطُوفُ عََلْي ِهمْ وِلدَانٌ ّم‬

wa yat}uufu ‘alayhim wildaanun mukhalladuun idhaa ra’ytahum h}asibtahum

lu’lu’an manthuuraa.
And round about-them boys overlasting if see- you-them think-you-them pearls
And round about them (will serve) boys of everlasting youth. If you see them, you
would think them scattered pearls

‫يسبهم الاهل أغنياء من التعفف‬

yah}sabuhum al-jaahilu aghniyaa’a mina al-tacaffuf.

Thinks-them the-not-knower rich from modesty.

` 146
The one who knows them not, thinks that they are rich because of their modesty.
(Qur’an, 2: 273)

We can notice in the above examples that h}asiba is used to describe one’s own impression
of a particular situation. In (21a) when Queen Belqees visited Solomon, the latter wanted to
impress her in a way that makes her believe in Allah. He asked her to enter a glass palace
built on water. She had never seen such edifice before, so she thought nothing was there and
tucked up her clothes. She came to her decision just by mere sighting, So the use of
h}asiba here refers to a state of roughly-held perspectives based on non-methodical
conception inducted to one’s mind or heart through mere sighting as in (21a-b) or hearing or
by prediction as in (21c).

We can eventually say that h}asiba and z}anna are verbs whose meanings imply a
personal element which is described by Badawi (2000) as an introducer for the relationship
that holds between subject-predicate on the basis of one’s own point of view. But h}asiba
describes a personal state attained via feelings or mere senses rather than on facts and
knowledge. z}anna as discussed above is based on personal perspectives residing in one’s
own mind with which he can believe in the validity or the invalidity of a given concept.
These perspectives can be true with someone and false with another according to how
accurate or inaccurate his perception of something is. So the semantic feature which can be
deduced out of these differences between h}asiba and z}anna is that the former is
immediate reaction (based on one’s feelings or mere senses) whereas the latter is considered
reaction (based on one’s own ideas which he obtained after long contemplation on it).

In conclusion, we have probed two different semantic features that distinguish between
h}asiba and z}anna: positive vs. negative and immediate reaction vs. considered reaction.
The two features, although apparently unrelated, complement each other. The discourse
characterised by h}asiba tends to be an immediate reaction which is mainly positive in the
sense that it represents only what is the case, without deep thinking. With z}anna, by
contrast, it gives the impression of a considered reaction which is mainly a negative report of
the events, i.e. it expresses one’s personal evaluation of the situation or state of affairs
referred to.

` 147
Therefore, it turns out that the synonymy relation can no longer hold between h}asiba and
z}anna. One more piece of evidence is that if we assume that h}asiba and z}anna are
synonymous, we should be able to exchange one word for the other without changing the
meaning of the sentence to any great extent. If we try that with the examples above, we get,

‫ول تسب ال غافلً عما يعمل الظالون‬

wa la tah}sabanna Allaaha ghaafilan cammaa yacmalu al-z}aalimuun.

And not think Allah unaware of-what do-they the-wrondoers.
Think not that Allah is unaware of that which wrongdoers do. (14: 42)

ً‫أحسبك رجلً عاقل‬

ah}sabuka rajulan caaqila.

Think-you man rational
I think you are a rational man.

*‫ول تظنن ال غافلً عما يعمل الظالون‬

wa la taz}unnanna Allaaha ghaafilan cammaa yacmalu al-z}aalimuun.

And not think Allah unaware of-what do-they the-wrongdoers.
believe not that Allah is unaware of that which wrongdoers do. (14: 42)
ً‫أظنك رجلً عاقل‬
az}unnuka rajulan caaqila.
believe-you man rational
I believe you are a rational man.

The replacement seems to work for the second sentence but not for the first.
This is because the addressee in (22a) is the Prophet who basically believes in Allah’s

` 148
ultimate power and has no doubt that Allah is a ware of everything. So z}anna which
means something based on facts does not fit in here, simply because he is a prophet.
h}asiba can only fit with its meaning ‘Do not let the phenomenal situation of Allah’s
wisdom (in postponing the punishment of the tyrants and the wrongdoers and giving them the
upper hand) be inducted to your mind or heart through just mere observation of the situation.

7.6.1 Summary
In this section we had to carry out some preliminary analysis prior to the statistical tests,
represented in 1) singling out the most central forms of the pair; 2) discussing the
grammatical and semantic position of both words; 3) refuting the polysemous nature of
z}anna as having two opposite senses. Then we have identified interesting differences
between the pair of words by probing the semantic features of both: negative vs. positive and
considered vs. immediate reaction.

Statistically, we have only used MI to highlight how significant the collocations of both
words are. We found out that the T-test is not useful with this pair of words, because lists of
collocations with both are different. So, the difference between them can be brought about by

` 149
7.7 A case study: The word pair h}bb and wdd ‘love’
The synonymous pair, h}bb and wdd , which are commonly taken to mean ‘love’, will be
examined below to see if they are absolute synonyms. Let us have a look first at the
dictionary meaning in table (7.19) below.
Table (7.19) definitions of h}bb and wdd in four dictionaries

Having made a search for the exact match we found out that the output is dramatically less
than the one reached by using a wild card although the results include all word classes of the
above lexeme. Searching every category separately takes a lot of time but it is more accurate.
So, we will search all the word-forms that occur in CAC. This could leave some word-forms
without analysis because they did not occur in our corpus. The total number of the
occurrences of the base-word h}bb in CAC is 1972 and the search result can be represented
in the following table.

Lexical Items POS Frequency

1 ‫يحب‬ he loves V 386
2 ‫المحبة‬ the love N 213
3 ‫الحب‬ the love N 209
4 ‫أحب‬ I love V 144
5 ‫أحب‬ he loved V 126
6 ‫الحبيب‬ the lover (mas.) N 104
7 ‫تحب‬ (you (sing.) love) V 72
8 ‫المحب‬ the lover N 61
9 ‫تحبون‬ (they (Nom.) love) V 58
10 ‫حبه‬ his love N 51
11 ‫محبته‬ his love N 47
12 ‫الحباب‬ the beloved persons N 46
13 ‫حبيبي‬ my love (masc.) V 45
14 ‫يحبه‬ he loves him V 42
15 ‫يحبون‬ they love N 38
16 ‫حبها‬ her love N 33
17 ‫المحبوب‬ he beloved V 26
18 ‫محبتها‬ her love N 26
19 ‫يحبه‬ he loves him V 24
20 ‫يحبها‬ he loves her N 24
21 ‫محبتي‬ my love N 17
22 ‫محبتك‬ your love N 16
23 ‫الحبة‬ the lovers pl. N 14
24 ‫حبيبك‬ your lover (masc.) N 13
25 ‫حبي‬ my love V 12
26 ‫أحبها‬ he loved her N 12

` 150
27 ‫حبيبتي‬ my lover (fem.) N 9
28 ‫محب‬ the lovers (acc.) V 9
29 ‫يحبونهم‬ they loved them N 9
30 ‫التحاب‬ love N 8
31 ‫حبيبا‬ a beloved (acc.) N 7
32 ‫الحبيبة‬ the beloved person (fem.) V 7
33 ‫محبتكم‬ your love (pl.) N 6
34 ‫تحابا‬ they loved each other (dual) V 5
35 ‫تحبوا‬ you (pl., jussive/ acc.) love N 5
36 ‫محبتكم‬ (your love) N 4
37 ‫محبتهم‬ their love N 4
38 ‫الحباء‬ the lovers pl. N 4
39 ‫أحبهم‬ he loved them V 4
40 ‫المحبون‬ the lovers (nom) N 4
41 ‫تحبها‬ you love her N 4
42 ‫تحابوا‬ love one another (pl.) V 4
43 ‫المتحابان‬ the (dual) lovers N 4
44 ‫المتحابين‬ the lovers N 3
45 ‫يحبهم‬ he loves them V 2
46 ‫يتحابون‬ they love one another V 2
47 ‫يحبونها‬ they love him V 2
48 ‫يحبان‬ they (dual) love N 2
49 ‫حبيبه‬ his beloved (masc.) N 2
50 ‫تحبان‬ you (dual) love V 1
51 ‫يحبوا‬ they (pl., jussive/ acc.) love V 1
52 ‫يتحابوا‬ they love one another (pl.) V 1
53 ‫تحاببتم‬ you loved one another (pl.) V 1
54 ‫حبيبته‬ his beloved person (nom., fem.) N 1
55 ‫حبيبتها‬ her beloved person (acc., fem.) N 1
56 ‫محبتين‬ two lovers (acc., fem.) N 1
57 ‫محبتهن‬ their love (acc., fem.) N 1
Total: 1972
Table (7.20): Lexical Frequency of h}bb in CAC

The lexical items in the table above are all derived from the same root, h}bb. However, to
discuss them all will be a tedious work and time-consuming. Instead, based on corpus
linguistics techniques, we can choose the most frequent items from the above list to analyse
and see if we can get a significant understanding of the whole scope.

Nominal Verbal Total

Frequency 638 786 1424
Percentage 45% 55%
Table (7.21) Statistics of the top ten words of the base-word h}bb in CAC

Table (7.21) above shows that these top ten word forms comprise most of the overall
occurrences of the base-word h}bb. They altogether form 72% of the total frequency, which

` 151
is enough to work on for a realistic result. We can also notice that there is no big difference in
frequency between verbal and nominal forms.

Initial observation of the base-word h}bb, unlike wdd, shows that it is more likely to be
frequent in love stories and fiction. To be consistent, we need either to avoid analysing this
item or treat it as exceptional since we are working from the very beginning on just general
words as mentioned in 7.2. So let us see how often this word occurs in different kinds of
texts. This can help us find out whether it is a general word or register-specific, related to
love stories and fiction. In the examples below, one can see the distribution of the base-word
h}bb (of the above top ten items) in the different subcorpora in the CAC.
Subcorpus Text size V N Total 55
The Holy Qur’an 88,622 43 7 50 056.
Biography 393,933 105 28 133
Fiction 579,223 107 301 408
Hadith 683,970 196 82 278
Lexicons 404,080 9 8 17 040.
Philosophy 478,141 23 49 72 004.
Poetry 69,385 10 11 21
Proverbs 362,054 36 70 106
Science 903,205 26 18 44 030.

Theology 1,037,387 231 64 295 029.


Total 5,000,000 786 640 1424 028.
Table (7.22): Subcorpus frequencies of the top ten forms of the base-word h}bb in CAC.

The list above shows that the base-word h}bb is more frequently used in Fiction, the Holy
Qur’an and Hadith than in any other text type. The occurrences of the words under
examination in these sub-corpora exceed the overall occurrence of such words in the whole
corpus, which is .028%. They are least frequent in the texts that are considered technical like

55 By percentage Imean the ratio of the item under examination ‘hbb’ per subcorpora.

` 152
linguistics, science and philosophy. In the first place, this could be an indication that the more
general the text is the more likely the word love occurs. Secondly, if a word is frequently used
in a specific text, it is probably important in that text, but if it is frequently used in all texts, it
is not important in any of them. Therefore, h}bb is a general word because it occurred in all
texts and is used frequently in general texts.

Let us now search our corpus for the words in question to see which dictionary meaning
mentioned in table (7.19) is the most common and what semantic feature/s are associated
with it. This can be done by concordances, which are able to detect patterns of usage in
different contexts. This can enable us to examine their collocation easily and discover what
words they group with.

Analysing the co-occurrences of h}bb shows that this word occurs in a pattern. In table
(7.23) below is a list of the immediate left and right collocates of the forms yuh}ibbu,
[al]h}ubb and [al]mah}abbah in a window of 2 items on either side of the search-term
with a minimum frequency of 3.

.Freq Right1 .Freq Left1

105 who ‫من‬ 57 Allah ‫ال‬
77 Allah ‫ال‬ 47 who ‫من‬
10 who ‫الذي‬ 26 righteous ‫الحسني‬
9 people ‫الناس‬ 15 sexual intercourse ‫الماع‬
6 he ‫وهو‬ 12 repentant ‫التوابي‬
6 man ‫رجل‬ 12 corruption ‫الفساد‬
6 the son ‫الولد‬ 12 Messenger ‫رسول‬
5 the father ‫الوالد‬ 11 for himself ‫لنفسه‬
5 the lover ‫البيب‬ 11 transgressors ‫العتدين‬
5 the soul ‫النفس‬ 11 the pious ‫التقي‬
4 sins ‫الذنوب‬ 11 the perseverant ‫الصابرين‬
4 abandoning ‫فراق‬ 11 the purified ‫التطهرين‬
4 their hearts ‫قلوبم‬ 9 Muhammad ‫ممد‬
4 his Messenger ‫رسوله‬ 8 Ansar ‫النصار‬
3 repentant ‫التوابي‬ 7 those who trust ‫التوكلي‬
3 the king ‫اللك‬ 7 oppressors ‫الظالي‬

` 153
3 the world ‫الدنيا‬ 6 ‫ أحدكم‬one of you
3 faith ‫اليان‬ 5 disbelievers ‫الكافرين‬
3 Zabeedah ‫زبيدة‬ 4 his action ‫عمله‬
3 dog ‫الكلب‬ 4 the poor ‫الفقراء‬
4 who ‫الذين‬
4 man ‫الرء‬
4 the soul ‫النفس‬
4 woman ‫امرأة‬
3 praise ‫الدح‬
3 optimism ‫التيمن‬
3 fun ‫اللهو‬
3 sleep ‫النوم‬
3 traitors ‫الائني‬
3 food ‫الطعام‬
3 people ‫الناس‬
3 life ‫الياة‬
Table (7.23): The base-word h}bb in a window of two items on either side.

Not all hits are represented in this table or discussed below, simply because we filtered the
results by removing adjunct56 examples. Those examples, although they contained the desired
lexemes (i.e. left or right collocates), did not constitute subjects or objects for the verbs or
complement of the noun phrase.

Studying the right and left collocates of h}bb (verbal and nominal) can reveal potential
subjects. It emerged that all of the right collocates which can stand for subjects are animate.
The concordances show that the most frequent subject in the list is the relative pronoun ‘who’
105 times57, the word Allah 76 times, people 9, man 6, son 6, father 5, soul 5, dog 3. So, the
base-word h}bb reflects one’s inner feeling of liking something.

We can also examine the objects and then ask, “What does X love?” Most of the objects
listed in the table above are either good or bad qualities, however the most frequent left
collocate in the list is the word Allah. Also we have objects like Messenger, man, woman,
food, fun and sleep. We then can conclude that the base-word h}bb can describe someone’s
strong feeling of liking towards something. That thing which is loved can either be animate

56 Non-nuclear elements in the sentence like adverbs, adjectives etc.

57 In Arabic relative pronouns are of three types: +human (e.g. man ‘who’), -human (e.g. ma ‘which’), general
(such as alladhi ‘who’ for masculine and allati ‘who’ for feminine).

` 154
such as people, man or inanimate such as food, fun, sleep. The collocates of h}bb can be
summarised in terms of frequency in the following domains: (1) religious experience, (2)
friendship, (3) sexuality, (4) family, and (5) non-human objects.

Let us now have a look on the other item of the pair: wdd, which is more problematic
because there is no consistency in explaining its meaning in the Arabic Qur’anic exegeses
and in translating it afterwards. It is sometimes translated as affection, kindness or friendship
as will be discussed below.

Lexical Items POS Frequency

1 ‫المودة‬ the love N 78
2 ‫الود‬ the love N 37
3 ‫يود‬ he/they love V 32
4 ‫ود‬ he/they loved V 23
5 ‫تود‬ you love V 13
6 ‫أود‬ I love V 11
7 ‫مودته‬ his love (mas.) N 10
8 ‫الودود‬ the lover N 8
9 ‫يودون‬ they love V 7
10 ‫ودي‬ I like N 7
11 ‫وده‬ his love (fem.) N 5
12 ‫مودتي‬ my love N 6
13 ‫ود‬ love (acc.) N 5
14 ‫مودتك‬ your love (sing., acc.) N 3
15 ‫مودتها‬ her love N 3
16 ‫مودتكم‬ your love (mas., pl.) N 3
17 ‫ودت‬ she loved V 3
18 ‫تواد‬ love V 3
19 ‫ودادهم‬ their love (mas.) N 2
20 ‫ودهم‬ their love N 2
21 ‫وداده‬ his love N 2
22 ‫توده‬ you love him V 2
23 ‫الودود‬ the lover N 2

` 155
24 ‫مودتنا‬ our love N 1
25 ‫وداد‬ love N 1
26 ‫ودادي‬ my love N 1
27 ‫مودتهم‬ their love (mas.) N 1
28 ‫ودادكم‬ your love (mas.) N 1
29 ‫ودادتي‬ my love N 1
30 ‫أوده‬ love him V 1
31 ‫التواد‬ the love N 1
32 ‫يودوا‬ they love V 1
33 ‫يوادونهم‬ they love them V 1
34 ‫ودها‬ her love V 1
35 ‫يودك‬ he loves you V 1
36 ‫يوده‬ he loves him V 1
Total: 280
Table (7.24): Lexical Frequency of wdd in CAC

We can notice that the overall frequency of wdd in the whole corpus is far less than h}bb;
the frequency of wdd constitutes only .005% (280 occurrences) in the whole corpus whereas
h}bb is .039% (1972 occurrences).
.Freq Right1 .Freq Left1
15 who ‫من‬ 12 58
one of them ‫أحدهم‬
11 who ‫الذين‬
10 many of ‫ كثي‬them
4 somebody ‫فلن‬
4 the friend ‫الصديق‬
3 family ‫أهل‬
Table (7.25): The base-word wdd in a window of two items on either side.

Examining the first node on the left and right hand side of wdd, as represented in table (7.25)
above, does not give as much significant information about collocation as h}bb. The only
semantic feature that can be extracted out of these examples is that wdd co-occurs with
+human lexemes whereas h}bb is more general as it can co-occur with +animate lexemes.
This can be represented in the following table.

58 In Arabic, the personal pronoun in plural masculine position, attached or detached, is used to refer to humans

` 156
h}bb (L) h}bb (R) wdd (L) wdd (R)
Allah √ √ X X
Messenger √ √ X X
Man √ √ √ √
Woman √ √ √ √
Food √ X X X
Sleep √ X X X
Dog X √ X X
Sexual intercourse √ X X X
Fun √ X X X
Table (7.26) Left and right collocates of h}bb and wdd

Further analysis of the same collocates without applying a filter may reveal other semantic
features invisible to us within a span of two. For example, law (if) followed wdd (verbal) 27
times, on the other hand, law did not co-occur at all with h}bb. We can say then that wdd
behaves like verbs of imagination such as hope and wish because of the similarity between
wdd as a verb having if-clauses following it and these verbs having the same function. Let us
have a look at the following examples.

... ‫َي َودّ أَ َح ُدهُمْ َلوْ ُي َعمّرُ أَلْفَ سنة‬

yawaddu ah}aduhum law yucammaru alfa sanatin

wish one-of-them if live-long thousand years.
Everyone of them wishes that he could be given a life of a thousand years.
(Qur’an, Al-Baqarah: 96)
َ ِ‫ َودّ اّلذِينَ كَفَرُواْ َلوْ َتغْ ُفلُونَ َعنْ َأ ْسل‬...
.ً‫حتِ ُكمْ َوأَ ْمِت َعتِ ُكمْ َفَيمِيلُونَ َعَليْكُم ّمْيَلةً وَا ِحدَة‬

wadda alladhiina kafaruu law taghfluuna… fa-yamiiluuna calaykum

Wish those disbelieve if neglect-you about arms-your and baggage-you so-attack-
you attack one.
Those who disbelieve wish, if you were negligent of your arms and your baggage, to
attack you in a single rush,
(Qur’an, Al-Nisa’:102)

` 157
‫ َيوْمَ َتجِدُ ُكلّ َن ْفسٍ مّا َعمَِلتْ ِمنْ َخْيرٍ ّمحْضَرًا وَمَا عَ ِمَلتْ مِن ُسوَءٍ َت َودّ َلوْ أَنّ َبْيَنهَا َوَبْينَهُ َأ َمدًا َبعِيدًا‬..

yawma tajidu kullu nafsin maa camilat min khayrin muh}d}ara wa ma

amilat min suu’ tawaddu law anna baynaha wa baynahu amadan baciidaa
day find every soul what did of good present and what did-it of evil wish-it if that
between-it and between-it distant time.
On the Day when every person will be confronted with all the good he has done, and
all the evil he has done, he will wish that there were a great distance between him
and his evil.
(Qur’an, Al-Imran 30)

The use of wdd followed by an if-clause in the above examples sheds light on the possibility
of using this word to mean either love or wish; this is mentioned in the dictionary meanings.
That extra sense (wish) is obviously a good distinction between h}bb and wdd. However, to
make the analysis to find subtle differences between h}bb and wdd we need to stick only to
one side of the meaning: affection. Therefore we will exclude examples containing if-clauses.
This apparently applies to wdd in verbal forms, because none of the if-clauses occurred after
wdd in nominal form in CAC. wdd in verbal form occurred 112 times in CAC, 55 of which
are followed by law (if). So we will exclude these 55 examples to get a reliable comparison
between h}bb and wdd meaning affection. This leaves 57 examples and after examining
them we do not get any interesting collocation either, as shown in table (7.27) below.

Right 1 .Freq Left1 .Freq

and who ‫ومن‬ 5 that ‫أن‬ 5
not ‫ما‬ 4 many ‫كثي‬ 5
in his ‫ بقوله‬saying 3 wished ّ‫َود‬ 5
His saying ‫قوله‬ 3 who ‫الذين‬ 2
Table (7.27): Collocates of wdd after excluding the instances followed by if.

As looking into the concordances of wdd in a small span does not show any significant
collocation we need to increase the span to see whether we can get any particular distribution
of that word. In a span of five on each side of the search-term we found out that that word
tends to occur in a certain semantic profile different from h}bb. Below we will examine the

` 158
word wdd meaning affection in nominal forms.

Having searched the concordances of wdd and h}bb in that bigger span, we found the
following results:

1) None of the intensifiers or adverbs of degrees, such as shadiid ‘very’, kathrat ‘much’ and
zaa’idah ‘exceedingly’, did occur with wdd, whereas h}bb occurred with intensifiers like
shadiid or shiddah ‘very or strong’ (17 times), kathrah ‘much’ (4), and adverbs like fart}u
‘exceedingly’ (6), zaa’idah ‘excessively’ (5).

2) Some verbs occur more often with wdd than with h}bb. For instance, wdd occurs more
frequently with verbs that mainly describe a concrete or observable action such as ta’ti
‘come’, tadnu ‘come closer’, tanqatic ‘cut off’, yussir ‘does discretely’, abana ‘show’, yas’al
‘request’, yabdhul ‘give’, yaquum cala ‘maintain’, ad}aaca ‘waste’, yunaasih} ‘does
sincerely, incaqad ‘interlink’ and jacala ‘make’, tarjuu ‘wish’. The last verb is the only
example which describes an abstract action.

3) The verbs that co-occur with h}bb mainly describe an abstract or unobservable action:
tahakkama fi ‘control’, sakanat fi ‘rest in’, yaddaci ‘claim’, yarzuq ‘bless’, tashtadd
‘strengthen’, ra’a ‘see’, yuz}hir ‘disclose’, yufrit} ‘exaggerate’, yu’thir ‘prefer’, mazaja
‘establish’, yud}mir ‘hide’, yuksib ‘cause to gain’, zaada fi ‘increase’, yucadhib ‘torture’,
yajlub ‘bring’, waqaca fi ‘fall in’, ‘alqa fi ‘put in’. The last four verbs tend to be concrete.

4) The preposition fi ‘IN’ occurs 37 times with the verbs that precede h}bb, whereas it
occurs twice only with wdd.

We then need to compare the two sets of verbs and determine how likely the difference
between the two sets occurred by chance; this can be done by the t-test59. We only selected
the verbs with minimum frequency 3 for the test below.

59 I used to only search the items whose MI scores are significant. We found this test useful in summarising the
whole data which we can use for further analysis. In case of a limited list like the one we have in table 7.29 we
prefer to run the t-test statistic only.

` 159
V f(h}bb /w) f(wdd/w) Gram. Function T Significance

fall in ‫ف يقع‬ 21 1 O 4.26 P < 0.0001

increase ‫يزداد‬ 10 0 S 3.1 P < 0.02

establish ‫تغلغل‬ 5 0 S 2.3 P< 0.20

bring ‫يلب‬ 5 0 O 2.3 P< 0.20

keep ‫يفظ‬ 0 5 O 2.3 P< 0.20

come ‫أتى‬ 0 4 S 2.0 P< 0.20

cut ‫قطع‬ 0 4 O 2.0 P< 0.20

does ‫يسر‬ 0 3 O 1.7 ..Not sig


claim ‫يدعي‬ 3 0 O 1.7 ..Not sig

request ‫يطلب‬ 0 3 O 1.7 .Not sig

give ‫يبذل‬ 1 3 O 1.0 Not sig.

disclose ‫أظهر‬ 2 3 O 0.4 .Not sig

Table (7.29) T-score of h}bb and wdd (nouns).

In the table above the higher the t-score the more different the pair under examination. We
can notice that h}bb gets the higher t-score in the context of verbs like waqaca ‘fall’, zaada
‘increase’, taghalghala ‘establish’, yajlub ‘bring’ and yaddaci ‘claim’; it co-occurs with verbs
that describe an abstract action. On the other hand wdd gets higher t-score when co-occurring

` 160
with verbs that refer to a concrete action, such as ‘come, cut, keep, request and give’. In other
words, wdd is used with verbs that express a practical action which affects somebody else,
such as cutting a relation with him, maintaining a relation with him, asking him, giving him
etc. As for h}bb, it expresses an abstract action like X falls in love, love increases, love is
established in his heart, brings love, X claims love. By abstract action, I mean a private
action which does not necessarily affect the recipient.

On the basis of the above results (1-4) and table (7.29), we can conclude that wdd (as in result
1) is more emphatic than hbb, because intensifiers are superfluous items used to amplify
actions. So the absence of intensifiers often indicates more emphasis. Secondly, the frequent
use of motion verbs with wdd (as in 2 and 3 & table (7.29)) shows that wdd is more concrete
than h}bb. Finally, the preposition IN, which means containment or inclusion, i.e. locating
or limiting the activities of the contained entity, occurs more frequently with h}bb (as in 4),
which might be an indication that h}bb tends to be contained or lying in a particular place.
So we can conclude that wdd is +emphatic and +concrete.

Secondly, a further look on the concordances of the pair, in a span of five on both sides,
reveals that qalb ‘heart’ co-occurs 79 times with h}bb and only once with wdd. Because the
word heart co-occurs more frequently with h}bb, this indicates that there is a strong bond
between them and that the heart is traditionally and psychologically connected to feelings like
h}bb. This gives another evidence that h}bb is an abstract feeling.

Thirdly, the following Qur’anic verse can be a piece of evidence in favour of the above
conclusion as shown (26):

(26) O ye who believe! Take not my enemies and yours as friends (or protectors), offering
them (your) mawaddah (love), even though they have rejected the Truth that has come to
you, and have (on the contrary) driven out the Messenger and yourselves (from your homes),
(simply) because ye believe in Allah your Lord! (Qur’an, Al-Mumtahinah: 1)

This verse was revealed about a man (Hatib ibn Abi Baltacah) who was in the Muslim army

` 161
heading towards Makkah to liberate it from Pagans. He sent a message to the pagans of
Quraysh requesting protection for his children and relatives left behind in Makkah in return
for information about the Muslims’ strategy and weaponry being prepared to conquer
Makkah. When the man was caught he declared that he hates those people to whom he sent
the message and he was truthful about his feeling. He only intended to do the Makkah people
a favour by virtue of which his family and property in Makkah may be protected. The
Prophet said he was truthful. This story is recorded in the Qur’an where Allah described the
favour he did towards the People of Makkah as wudd.

Fourthly, one of Allah’s names is al-waduud (the Loving). This is because, in the first place,
hbb (love) is commonly understood, as a bond between two entities and some kind of need.
Secondly, it is a state of lack of control. These apparently do not fit with Allah’s perfection.
Moreover, wdd is more general than h}bb, i.e. h}bb is devoted to particular persons, which
could be in Allah’s sight, the pious who are real true believers. This is because all occurrences
of hbb in verbal forms with Allah show that Allah loves particular people who are righteous
and does not like the wrongdoers. So if He named Himself Al-h}abiib this would be a static
attribute that eliminates some people forever. In other words, if Allah named Himself Al-
h}abiib, this would exclude some people from His bounties and blessings, which are available
to all people.

Therefore, based on the above remarks, the semantic feature that can be extracted to
differentiate between hbb and wdd is abstract vs. concrete.

7.8 Conclusion
First, the widely claimed four synonymous pairs discussed above can be summarised as
•intrinsic vs. extrinsic (as between ithm& dhanb)
•closer vs. further (as between ata & jaa’a )
•negative vs. positive (as between ata & jaa’a , h}asiba & z}anna)
•considered vs. immediate reaction (as between z}anna & h}asiba)

` 162
•abstract vs. concrete action (as between h}bb & wdd)

Secondly, we used the following methodology to test the synonymy between two items of a
given pair.

Using a corpus to help get hold of all the occurrences of the pair under investigation
quickly and accurately.
Identifying the word class of a given item. This is important in looking for collocation
because it enables us to know which word is more significant. For example, the
collocation of a word which is a verb, is more likely to be found in the right hand side
in Arabic. This is done manually.
Determining the syntactic function of the term under investigation. It is also important
because sometimes we need to look at the complement of an item.
Analysing collocation.
Analysing the context to understand how/when the variants are used (semantic
Applying statistics to find anything interesting about their distribution.
The identification of a semantic feature of the search term according to their
contextual use.
Substituting one word for the other to see if any change happens in the meaning of the

` 163
Chapter Eight: Conclusion
Arabic corpus linguistics is a very active area; I had to rework what I have done several times
because of the incessant contributions in this field, especially when discussing the available
corpora and tools that work on Arabic language.

One of the main important contributions this study made is providing a computational Arabic
corpus of the early classical Arabic. This corpus will be available for research purposes to be
exploited in NLP applications for Arabic and for more accurate analysis of Arabic linguistic
phenomena. With regard to size, Arabic corpora should be big enough to be reliable for
generalisation, due to the richness of Arabic vocabulary. For example, in Arabic a given word
is expected to appear less often than in an English text of the same length (Goweder and de
Roeck (2001)). This is because of the inflectional nature of Arabic and the abundance of its
vocabulary (cf. p. 99).

The corpus-based analysis can be used as a successful methodology for testing what has been
introduced by early linguists on all linguistic levels (morphology, syntax, semantics, etc.).
More than that, it can give new insights and introduce rules and models which have not been
previously discussed. There seems to be no corpus-based research directly analysing
synonymous words in Arabic, classical in particular. I do not claim that my analysis is correct
or privileged, but rather that it is more methodical and systematic than one based on intuition.

Final findings suggest that applying corpus linguistics methodology to Arabic can help us
improve lexical awareness and choice as most Arabic linguists are unaware of the
collocational differences between synonymous pairs, let alone ordinary native speakers of

Corpus-based analysis of items which are often regarded as roughly synonymous in Arabic
can highlight subtle differences in meaning among such items. This can be done by
abstracting semantic features through comparing differences observed in their contextual
idiosyncrasies and examining practical examples of the usage of such items. In this way,

` 164
absolute synonyms can be ruled out if we come across one context in which one of the
synonymous pair carries more meaning, has a different distribution or is used in a different
register. Also, with the aid of statistical techniques we can have an accurate account of
whether there are systematic differences in the use of certain types of seemingly synonymous
words by summarising their distribution in the corpus.

The results given throughout my work imply a need for a fresh look at Arabic studies. The
new and unexpected shades of meanings will raise lots of questions about the credibility of
most old and modern Arab contributions in the following fields:
2)Interpretation of the Holy Qur’an
3)Translation of Qur’an
5)Prophetic Traditions (Hadith)

In lexicography, for example, had the dictionary-makers been aware of the subtle differences
and uses of seemingly synonymous words they would have made more accurate definitions.
Suppose we use the corpus-based methodology to build up an Arabic lexicon. As mentioned
elsewhere, the macro structure of an Arabic-Dutch dictionary contains 24,000 words. Since
the prevailing view is that the Arabic vocabulary is very extensive, we might ask ourselves if
a dictionary containing 24,000 words will serve the user sufficiently when reading or
listening to Arabic. Although Nijmegen University recently has managed to create that kind
of corpus-based lexicon, it is only restricted to texts written in Modern Standard Arabic.

In the field of Qur’an exegesis lots of work has been done but based on the old perspectives:
non-corpus-based. The outcome was huge, yielding various contributions. Nonetheless, some
verses are left either vague or misinterpreted because of the vagueness of some lexemes as in
verse 2:78
wa minhum ummiyuuna la yaclamuun al-kitaaba illa amaaniyya wa in hum illaa

` 165
And from-them unlettered not know-they the-book except wishes and but they think-
And there are among them unlettered people, who know not the Book, but they trust
upon false desires and they but guess. (Qur’an, 2: 78)

The verse above talks about some Jews who are illiterate and do not know the reality of their
book; however, they follow their scholars blindly and believe them. This is a different
category from those who know the truth and falsify it mentioned in the verse preceding it (2:
75). So if we interpret z}anna as doubt as commentators, like Al-Tabari and ibn Katheer,
say, we presume that that second category of Jews who do not know the reality of their book
do not believe in it. But this is not the case since this category is blindly following their
scholars and this is a type of belief. We would rather say there are some Jews who only know
the false version of the Bible and they are certain about what they know even if it is false.
This meaning cannot not be attained by simple study of the word; it rather requires an
accurate probing of the whole senses of the word based on the corpus methodology.

As for the translation of Qur’an, it is basically based on its exegesis. It depends on the same
methodological approach of the author of the exegesis.

In Jurisprudence, much of the arguments between Muslim scholars and schools of thoughts
arises from their own understanding of the language of the Qur’an and Hadith. One of the
main reasons of such differences is their linguistic differences concerning some texts of the
Holy Qur’an and Prophetic traditions on the syntactic or semantic level. This sometimes
leads to the difference in understanding and formulating laws derived from such texts. For
example, s}uurah as used in hadiths is interpreted as ‘picture’. Such interpretation could
lead to forbidding all types of painted pictures or photographs. This is the opinion of a big
group of Muslims nowadays called Salafis who understand s}uurah as a picture. Another
group of Muslim scholars interpret s}uurah as statue because this is the meaning which
was current in the Prophet’s lifetime. They further argue that this ruling is only applicable to
statues which are made to be respected and worshipped.

` 166
Corpus-based analysis can distinguish between the different senses of a given word
synchronically or diachronically. With this methodology, a particular sense of a word is

` 167
Appendix 1: Copyrights

Muhaddath website
Conditions for copying books from our site: taken Top of Page
You may copy books from our site to another, according to the following
•Theusage of the book on your site must be for non-commercial
to each book you copy, mentioning that your source is “Al
Muhaddith Project”, and adding a link to our site.
•Giveproper notice concerning books that are not permitted to use for
commercial purposes.
As an example, refer to our note concerning Ibn Katheer’s summary by
•Uponcompletion, informing us and sending us a link to
From: “Moutasem Zakkar” <>
To: <>
Subject: Re: I need permission for downloading
Date sent: Sat, 2 Mar 2002 10:30:37 +0400

Alwaraq website
Dear Sir :

Thank you for your message .

You can get any page you want from any book by pressing the button whose
hint is “Send me this page” ,
and you will have to put your Email address and then you will recive it
immidieatly .
In the future you will be able to Download any book after paying a fee .

Moutasem Zakkar
Technical manager

` 168
----- Original Message -----
From: “Abdel-Hamid Elewa” <>
To: <>
Sent: Tuesday, February 12, 2002 5:37 PM
Subject: I need permission for downloading

> as-salamu alaykum,

> thx alot for the big effort you have done in Alwarak
> project. Can I get hold of some books from that project for
> the sake of research as my Ph.D project is on Arabic
> linguistics and I need to work on computerised Arabic
> texts. Just give me permission and I can download some
> pages from the your site.
> yours
> Elewa
> Elewa
> Department of language engineering
> Centre for Computational Linguistics
> Manchester
> UK

` 169
Appendix 2: mathematics
The contents of the CAC are summarised in the following charts (1) & (2):
geography Qur'an Hadith
Chart (1)

proverbs philosophy
fiction theology
Chart (2)

Appendix 3:
Genres and texts included in CAC.
Genre: Thought and Belief
Subgenre Texts Text Size Perc.
belief &
The Holy Qur’an The Holy Qur’an thought
linguistics 88,622 1.8
Prophetic literature
1.Sahih Al-Bukhari 683,970 13.6
Tradition (Hadith) 2.Sahih Muslim

Biography 1.seerah of Ibn Hisham 393,933 7.8

2.Al-Akhbar Al-Tiwal

Philosophy 1.Ara’ Ahl Al-Madinah Al-Fadilah by 478,141 9.5

2.Logic by Ibn Sina
3.Al-Falsafa Al-Ula by Al-Kindi
Theology 1.Exegesis of the Qur’an (Tafseer): 1,037,387 20.7
by Al-Tabari)
2.Jurisprudence (fiqh): Al-Risalah
by Al-Shafi’i
3.Dogmatics (al-’Aqeedah): Al-
Ibanah by Al-Ash’ari
Poetry 1.Al-Mucallakt 69,385 1.3
2.Al-Mutanabbi collected
Fiction 1.Arabian Nights 3.579,223 11.5
2.The Misers
Proverbs 1.Majmac Al-Amthaal 362,054 7.2
2.Jamharat Al-Mthaal

` 170
Lexicons 1.Al-’Ayn (Al-Khalil 404,080 8.0
2.Fiqh al-Lughah (Al-Tha’alibi
Geography Ahsan al-Taqaseem fi ma’rifat al- 82,499 1.46
Aqaleem by Al-Maqdisi

Physics Al-Jamaher Fi Ma’rifat al-Jawahir by 57,553 1.15

Medicine Al-Qanun fi Al-Tib by ibn Sina 736,469 14.72
Mathematics Mafatih Al-’Ulum by Al-Khawarizmi 26,684 .53

Appendix: 4
A sample of concordances as appearing on the Monoconc window.

` 171
Appendix 5:
A picture of the concordance lines run by Monoconc and then saved to an only-text file.

` 172

` 173
Aijmer, K. & Altenberg B. (1991). English Corpus Linguistics. Longman, London and New
Al-Anbari, Abu Barakat, (d. 1207). nuzhat al-alibbaa’ (The Fun of the Men of Wit). (ed.) Abu
al-Fadl Ibrahim, Daar Nahd}at Misr li-l-T}{abbc wa-l-Nashr, Cairo.
Al-Ashcari, A. (1994). al-ibaanah can us}uul al-diyaanah (Explanation About the Basics of
Belief). (ed.) Abbas Sabbagh, Daar Al-Nafaa’is, Beirut.
Al-Askary, A. (1931). al-furuuq fi al-lughah (Differences in Language). (ed.) Imad al-Barudi,
Daar Al-Kitaab Al-cArabi, Damascus.
Al-Bustani, B. (b. 1819-1883). muh}}iit al-muh}}iit: ay qaamuus mut}awwal lil-lughah
al-cArabiyyah (The Comprehensive Ocean: i.e. A Detailed Dictionary of the Arabic
Language). Beirut.
Al-Fayruzabadi, M. (1952). al-qamuus al-muhiit (The Comprehensive Lexicon).
Mat}bacat Mustafa Al-Babi Al-Halabi, Cairo.
Al-Hamadhani, A. (1991). al-alfaaz} al-kitaabiyyah (The Literary Words). Dar al-Kutub al-
cIlmiyyah, Beirut.
Al-Jabouri, A. J. R. & Knowles, F. E. (1988). “A computer-assisted study of cohesion based
on English and Arabic corpora: An interim report”. Proc. of 13 th International
conference, University of E. Anglia (Norwich), 1-4 April 1986, Computers in
Literary and Linguistic Research. Paris; Geneva, Chapion; Slatkine, pp. 59-77.
Allen, J. (1995). Natural Language Understanding. The Benjamin/Cummings Publishing
Company, Inc. Redwood City, CA.

Almuhanna, A. (2003). Scientific and Technological Term Transfer into Arabic: A Corpus-
Based Study of Arabic Noun + Noun and Noun + Adjective Compounds.
Unpublished Ph.D. thesis, UMIST, Manchester.
Al-Qurtubi M. (1998). al-jaamic li-ahkaam al-qur’aan (The Compendium of
Qur’anic Rulings). Daar Al-Fikr, Lebanon.
Al-Sakkaki, Y. (b. 1066). Miftaah} al-culuum (the Key to Sciences) (1st ed.).
Mat}bacat Mustafa Al-Babi Al-Halabi, Cairo.
Al-Tabari, I. (d. 922). jamic al-bayaan fi ah}kaam al-qur’aan (The

` 174
Comprehensive Book in the Rulings of the Qur’an). Daar Al-Fikr, Beirut,
Al-Yaziji, I. (1970). nujcat al-raa’id wa shurcat al-waarid fi al-mutaraadif wa-l-mutaawarid
(The Spring of the Seeker in Synonyms and Associations) (2nd ed.). Maktabat
Lubnaan, Beirut.
Atkins, S. Clear, J. & Ostler, N. (1992). “Corpus Design Criteria”. Literary and Linguistic
Computing, vol. 7, 1: 1-16.
Badawi, E. (2000). “An opinion on the meanings of icrab in Classical Arabic: The state of the
nominal sentence”. In Diversity in Language: Contrastive Studies in English and
Arabic Theoretical and Applied Linguistics. (eds.) Ibrahim, Z., Kassabgy, N.,
Aydelott, S. The American University in Cairo Press, Cairo.
Bakalla, M. H. (1983). Arabic Linguistics: An Introduction and Bibliography. Mansell
Publishing Ltd, London.
Barlow, M. (1999). Monoconc Program, Version 1.0. Athelstan, Houston, USA.
Barnbrook, G. (1996). Language and Computers. Edinburgh University Press, Edinburgh.
Benson, M. (1990). “Collocations and general-purpose dictionaries”. International Journal of
Lexicography, vol. 3, 1: 23-35.
Benson, M., Benson, E. & Ilson, R. (1997). The BBI Dictionary of English Word
Combinations. John Benjamins, Amsterdam/Philadelphia.
Berry-Rogghe, G. L. M. (1970). Collocations: Their Computation and Semantic
Significance. Unpublished Ph.D. thesis, University of Manchester.
Biber, D. (1993). “Representativeness in corpus design”. Literary and Linguistic Computing,
vol. 8, 4: 243-257.
Biber, D., Conrad, S. & Reppen, R. (1994). “Corpus-based approaches to issues in applied
linguistics”. Applied Linguistics, vol. 15, 2: 169-189.
Biber, D., Conrad, S. & Reppen, R. (1998). Corpus Linguistics: Investigating Language
Structure and Language Use. Cambridge University Press, Cambridge.
Bloomfield, L. (1935). Language. Allen & Unwin, London.
Bohas, J., Guillaume P., & Kouloughli, D. (1990). The Arabic Linguistic Tradition.
Routledge, London.
Brinton L. and Akimoto, M. (eds.) (1999). Collocational and Idiomatic Aspects of Composite

` 175
Predicates in the History of English. John Benjamins Publishing.
Carter, R. (1987). Vocabulary: Applied Linguistic Prescriptive. Allen & Unwin, London.
Charniak, E. (1993). Statistical Language Learning. The MIT Press, Cambridge,
Chejne, A. (1969). The Arabic Languag: Its Role in History. University of Minnesota Press,
Chomsky, N. (1965). Aspects of the Theory of Syntax. The MIT Press, Cambridge,
Chomsky, N. (1971). Chomsky: Selected Readings. (eds) J.P. Allen & Paul Van Buren.
Oxford University Press, London.
Choueka, Y.; Klein, T. and Neuwitz, E. (1983). “Automatic retrieval of frequent idiomatic
and collocational expressions in a large corpus”, Journal for literary and linguistic
computing, vol. 4: 34-38.
Christopher D. M. and Hinrich S. (1999). Foundations of Statistical Natural Language
Processing. The MIT Press, Cambridge, Massachusetts.

Church K. and Hanks P. (1990). “Word association norms, mutual information and
lexicography”, Computational Linguistics, vol. 16: 22-29.
Church, K., Gale W. Hanks P. and Hindle M. (1991). “Using statistics in lexical analysis”, in
Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, (ed.) Zernik,
U. Hillsdale, NJ: Lawrence Erlbaum Associate, 115-164.
Clear, J. (1993). “From Firth principles, computational tools for the study of collocation”. In
Text and Technology, in Honour of John Sinclair, (eds.) Baker, M., Francis, G. &
Tognini-Bonelli, E. John Benjamins, Philadelphia/Amsterdam, pp. 271-292.
Cowie, A. P. (1978). The Place of Illustrative Material and Collocations in the Design of a
Learner’s Dictionary, in Honour of A.S. Hornby. Oxford University Press, Oxford.
Cowie. A.P. (1981). “The treatment of collocations and idioms in learners’ dictionaries”.
Applied Linguistics vol. 3, 223-235.
Cruse D. A. (2000). Meaning in Language: An Introduction to Semantics and Pragmatics.
Oxford University Press, Oxford.
Cruse D.A. (1986). Lexical Semantics. Cambridge University Press, Cambridge.
Crystal, D. (1987). The Cambridge Encyclopaedia of Language. Cambridge University Press,

` 176
Dichy, Joseph (2001) “On lemmatization in Arabic, a formal definition of the Arabic entries
of multilingual lexical databases”. 39th Annual Meeting of the ACL, Workshop on
Arabic Language Processing: Status Prospects, Toulouse.
Dickins, J., Hervey, S. and Higgins, I. (2002). Thinking Arabic Translation, A Course in
Translation Method: Arabic to English. Routledge, London.
Ditters E. (1990). “Arabic corpus linguistics in past and present”. In Studies in the History of
Arabic Grammar II, (eds.) Versteegh K. & Carter G. John Benjamin Publishing
Company, Amsterdam, pp. 129-141.
Emery, P (1988). Body-Part Collocations and Idioms in Arabic and English. Unpublished
Ph.D. thesis, University of Manchester.
Fasold R. (1984). The Sociolinguistics of Society: Introduction to Sociolinguistics vol. 1.
Basil Blackwell Ltd. England.
Fillmore, C. (1992). “‘Corpus linguistics’ or ‘Computer-aided armchair linguistics’”. In
Directions in Corpus Linguistics, (eds.) Svartvik J. Mouton de Gruyter, Berlin, New
Firth, J.R. (1935). Papers in Linguistics, Oxford University Press, London.
Firth, J.R. (1957). “Modes of meaning”. In Papers in Linguistics, Oxford University Press,
Francis, N. (1992). “Language corpora B.C.”. In Directions in Corpus Linguistics, (eds.)
Svartvik J. Mouton de Gruyter, Berlin, New York.
Freeman A. (2001). “Brill’s POS tagger and a morphology parser for Arabic”. 39th Annual
Meeting of the ACL, Workshop on Arabic Language Processing: Status Prospects,
Garside, R, Leech G, and Sampson G. (1987). The Computational Analysis of English, a
Corpus-Based Approach. Longman, London and New York.
Garside, R. Leech G. and McEnery, T. (1997). Corpus Annotation. Longman, London & New
Ghali, M. (1997). Synonyms of the Glorious Qur’an. Daar al-nashr lil-Jaamicaat, Cairo.
Ghazala, H. (2001). “Cross-cultural link in translation (English-Arabic)”. Majallat Al-Lisaan
Al-cArabi (The Magazine of the Arabic Language), vol. 50, Al-Ribat, Morocco.

` 177
Goldziher, I. (1966). A short history of Classical Arabic Literature. (trans.) J.DeSomogyi.
Georg Publishers, Olms.
Goweder A. and Roeck, A. (2001). “Assessment of a significant Arabic corpus”. 39th Annual
Meeting of the ACL, Workshop on Arabic Language Processing: Status Prospects,
Granger, S. (1999). “Use of tenses by advanced EFL learners: Evidence from an Error-tagged
computer corpus”. In Out of Corpora, (eds.) Hasselgard & Signe Oksefjell, Rodopi,
Amsterdam, pp 191-202
Granger, S. (eds) (1998). Learner English on Computer. Longman, London and New York.
Gross M. (1990). Constructing Lexicon-Grammar. University of Paris, Paris.
Guillaume, A. (1931). The Legacy of Islam. Oxford University Press, Oxford.
Haeri, N. (2003). Sacred Language, Ordinary People: Dilemmas of Culture and Politics in
Egypt. Palgrave Macmillan, New York.
Halliday, M.A.K. (1991). “Corpus studies and probabilistic grammar”. In English Corpus
Linguistics, (eds.) Aijmer, K. & Altenberg B. Longman, London and NewYork.
Halliday, M.A.K., McIntosh, A. and Stevens, P. (1964). The Linguistic Sciences and
Language Teaching. Longman, London.
Hanks, P. (2000). “Literal and metaphorical word meaning”. Tuscan Word Centre document.
Harris, R. (1973). Synonymy and Linguistic Analysis. University of Toronto Press,
Haywood, J. (1965). Arabic Lexicography, (2nd ed.). Brill, Leiden.
Hitti, P. K. (1958). History of the Arabs. Macmillan, New York.
Hoey, M. (1997). “From concordance to text structure: New uses for computer corpora”. Talk
given at the 1997 Practical Applications of Language Corpora (PALC) conference,
University of Lodz, April 12-14, Later published in Melia, J. & Lewandoska, B.
(eds) Proceedings of PALC 97. Lodz University Press, Lodz.
Hoogland, J. (1993). “Collocation in Arabic (MSA) and the treatment of collocations in
Arabic dictionaries”. The Arabist, Proceedings of the Colloquium on Arabic
Lexicology and Lexicography, Budapest, 1-7 Sept. 1993, (eds.) Devenyi, K., Ivanyi,
T. and Shivtiel, A. Csoma de Koros Soc, Budapest, Hungary.
Horrocks, G (1987). Generative Grammar. Longman, London & NewYork.

` 178
Hurford, J. & Heasley, B. (1983). Semantics: A Coursebook. Cambridge University
Press, Cambridge.
Ibn Al-Anbari, (1904). al-ad}daad (Antonyms). (ed.) Abu al-Fadl Ibrahim, Al-
Maktabah Al-cAs}riyyah, Lebanon.
Ibn Faris, A. (d. 1105). al-s}aahibi. (ed.) Al-Sayed Sakr. Mat}bacat Isa Al-Babi Al-Halabi
wa-shurakaah, Cairo
Ibn Katheer, I. (1996). Tafseer al-qur’aan alcaziim (Explanation of the Great Qur’an). Daar
al-macrifah, Lebanon.
Ibn Jinni, A. (d. 1102). al-khasaa’is (The Properties). Mat}bacat Al-Hilal, Cairo
Ibn Manzur, M. (b.1232-1311 or 12). lisaan al-carab (Arabs’ Language). Daar Bayruut lil-
T}ibacah wa-al-Nashr, Beirut.
Ivanyi, T. (1993). “Dynamic vs. static: a type of lexical parallelism in the maqamat of al-
Hamadhani”, The Arabist, Proceedings of the Colloquium on Arabic Lexicology and
Lexicography, Budapest, 1-7 Sept. 1993, (eds.) Devenyi, K., Ivanyi, T. and Shivtiel,
A. Csoma de Koros Soc, Budapest.
Izwaini, S. (2000). Translating Collocations: Arabic/English/Swedish. Unpublished MA
dissertation, CTIS, UMIST, Manchester.
Izwaini, S. (in progress). Translation and The Language of Information Technology: A
Corpus-Based Study of the Vocabulary of Information Technology and Translation
from English into Arabic and Swedish. Unpublished Ph.D. thesis, UMIST,
Jackson H. (1988). Words and Their Meaning. Longman, London and New York.
Johansson, S. (1995). “ICAME-Quo Vadis? Reflections on the use of computer corpora in
linguistics”. Computer and the Humanities, vol. 28: 243-252.
Jones, S. (1986). Synonymy and Semantic Classification. Edinburgh University
Press, Edinburgh.
Jones, S. and Sinclair, J.M. (1974). “English lexical collocations”. Cahiers de Lexicologie,
vol. 24: 15-61.

Kamir, D. Soreq, N. Neeman, Y. (2002). “A Comparative NLP system for Modern Standard
Arabic and Modern Hebrew”. In Rosner, M. & Wintner, S., Proceedings of the
Workshop on Computational Approaches to Semitic Languages. University of

` 179
Kennedy, G. (1998). An Introduction to Corpus Linguistics. Longman, London.

Kenny, D. (1999). Norms and Creativity: Lexis in Translated Text. Unpublished Ph.D. thesis,
UMIST, Manchester.
Kenny, D. (2001). Lexis and Creativity in Translation: a Corpus-Based Study. St. Jerome
Publishing, Manchester.
Khalid, J. AlDaimi and Maha A. Abdel-Amir (1994). “The syntactic analysis of Arabic by
machine”. Computers and Humanities, vol. 18: 29-37.
Khoja, S., Garside, R. and Knowles, G. (2001). “A tagset for the morphosyntactic tagging of
Arabic”. Proc. of the Corpus Linguistics 2001 Conference, Lancaster University, 29
Mar-2Apr 2001.
Khoja. S. (2003). An Automatic Arabic Part-of-Speech Tagger. Unpublished Ph.D. thesis,
University of Lancaster.
Kjellmer, G. (1987). “Aspects of English collocations”. In Corpus Linguistics and Beyond,
(ed) Meijs, W. Rodopi, Amsterdam, pp. 133-140.
Knowles, G. (1996). “Corpora, databases and the organisation of linguistic data”. In Using
Corpora for Language Research, (eds.) Thomas J. & Short M. Longman, London
and NewYork.
Koenraad, d., Hazel, G, Espen, O., Tito, O, Harold, S, Jacques, S. and William V. (eds.)
(1999). Computing in Humanities Education: A European Perspective. The
University of Bergen, Bergen.
Krenn B. and Samuelsson, C. (1997). “The Linguist’s Guide to Statistics”,
Langendoen, T. (1968). The London School of Linguistics. MIT Press, Cambridge,
Leceibi, H. (1980). al-taraaduf fi al-lughah (Synonymy in Language). Dar al-Rashiid,
Leech, G. (1991). “The state of the art in corpus linguistics”. In English Corpus Linguistics,
(eds.) Aijmer, K. & Altenberg B. Longman, London, pp. 8-29.
Lehrer, A. (1974). Semantic field and lexical structure. North-Holland, London.
Lewis, M. (1993). The Lexical Approach. Language Teaching Publications, Hove, England.

` 180
Louw, B. (1993). “Irony in the text or insincerity in the writer? The diagnostic
potential of semantic prosodies”. In Text and Technology: In
Honour of John Sinclair, (eds.( Baker, M., Francis, G. and E. Tognini-
Bonelli. John Benjamins, Amsterdam, pp. 157-176.
Lyons, J. (1963). Structural Semantics. Basil Blackwell, Oxford.
Lyons, J. (1969). Introduction to Theoretical Linguistics. Cambridge University
Press, Cambridge.
Lyons, J. (1977). Semantics. Cambridge University Press, Cambridge.
Lyons, J. (1981a). Language, Meaning and Context. Fontana Paperbacks, GB.
Lyons, J. (1981b). Language and Linguistics. Cambridge University Press,
Lyons, J. (1995). Linguistic Semantics: An Introduction. NewYork: Cambridge University
Press, Cambridge.
Majmac al-lughah al-carabiyyah (1977) al-wasiit} (the intermediate) Daar al-macaarif, Cairo.
Makkai, A. (1987). “Major diseases of linguistics”. In Language topics, Essays in honour of
M. Halliday, (eds.) Ross S.& Terry T. John Benjamin Publishing Co., Amsterdam/
Philadelphia. pp. 269-280.
Manning C. and Schütze, H. (1997). Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, Massachusetts.
Matthews, P. H. (1993). Grammatical Theory in the United States from Bloomfield to
Chomsky. Cambridge University Press, Cambridge.
McEnery, T. & Wilson, A. (1996). Corpus Linguistics. Edinburgh University Press,
McIntosh, A. and Halliday, M.A.K. (1966). Patterns of Language: Papers in General,
Descriptive and Applied Linguistics. Longmans, London.
Meyer, Ch. (2002). English Corpus Linguistics. Cambridge University Press, Cambridge.
Miller, George A. (1963). Language and Communication. McGraw-Hill Bark company, Inc.
New York, Toronto, London.
Mitchell, T.F. (1971). “Linguistic ‘going on’: Collocations and other lexical matters arising
on the syntagmatic/linguistic record”. ARCHIVUM LINGUISTICUM, 2 (new series,

` 181
Mubarak, M. (1982). The Grammar of Arabic. Daar Al-Kitaab Al-Lubnaani, Beirut.
Mujahid, I. (1989). Tafsiir Mujahid (Explanation of the Qur’an by Mujahid), version 1.
verified by M. Abdel-Salam. Daar Al-Fikr al-Islaamiy al- h}adiithah, Cairo.
Nelson, M. (2000). A Corpus-Based Study of Business English and Business English
Teaching Materials. Unpublished Ph.D. thesis, Manchester: University of
Owens, J. (1988). The Foundation of Arabic Grammar. John Benjamins Publishing Company,
Palmer, F. R. (1981). Semantics (2nd ed.). Cambridge University Press, Cambridge.
Rene, A. (2000(. “Language, concepts and culture: old wine in new
bottles”. Bilingualism: Language & Cognition, vol. 1, issue 1.
Cambridge University Press.
Renouf, A. J. (1984). “Corpus development at Birmingham University”. In Corpus
Linguistics: Recent Developments in the Use of Computer Corpora in English
Language Research, (eds.) Jan Aarts & Willem Meijs. Rodopi, Amsterdam, pp. 3:39.
Robert, A. (2004). aConCorde Program, version 0.4. University of Leeds
Roulet, E. (1975). Théories grammaticales, Descriptions et Enseignement des Langues
(Applied linguistics and language study). (trans.) Christopher N. Candlin, Longman:
Sinclair, J. (1987a). Looking Up. Collins London and Glasgow.
Sinclair, J. (1987b). “Collocation: a progress report”. In Language topics, Essays in honour
of M. Halliday, (eds.) Ross S.& Terry T. John Benjamin Publishing Co., Amsterdam/
Philadelphia, pp. 319-332.
Sinclair, J. (1991). Corpus, Concordance and Collocation. Oxford University Press, Oxford.
Sinclair J. (ed.) (1995). Collins Coubild English Dictionary (2nd ed). HarperCollins, London.
Sinclair, J. (1996). “The search for units of meaning”. Reprinted with permission from Textus
IX, pp. 75-106.
Sinclair, J., Mason, O., Ball, J. and Barnbrook G. (1998). “language independent statistical
software for corpus exploration”. Computers and the humanities, 31: 229-255.
Smadja F. (1991). “Macrocoding the Lexicon with Co-occurrence Knowledge”. In Lexical

` 182
Acquisition, (ed.) Zennik, U. Lawrence Erlbaum Associates, NJ., 165-189.
Smadja F. (1994). “Retrieving collocations from text: Xtract”. In Computational Linguistics.
MIT Press, vol. 19, 1: 143-177.

Smadja, F., McKeown, K. and V. Hatzivassiloglou (1996). “Translating collocations for

bilingual lexicons: A statistical approach”. Computational Linguistics, 22, 1: 1-38.
Somekh, S. and Alexander B. (ed.) (1991). Genre and Language in Modern Arabic literature.
Otto Harrassowitz, Wiesbaden.
Souter, C. and Atwell, E. (eds.) (1993). Corpus-Based Computational Linguistics. Rodopi,
Steins, J. M. (1978). The Random House Dictionary of the English Language. Ballantine
Books, New York.
Straley, D. S. (1989). An Annotated Bibliography of American Doctoral Dissertations on
Arabic Language, Literature and Culture, 1967-1986. AATA, Columbus, Ohio.
Stuart A. (1968). Basic Ideas of Scientific Sampling. Charles Griffin & Company Ltd.,
Stubbs, M. (1993). “British traditions in text analysis from Firth to Sinclair”. In
Text and Technology: In Honour of John Sinclair, (eds.( Baker, M.,
Francis, G. and E. Tognini-Bonelli. Amsterdam, John Benjamins, pp.
Stubbs, M. (1995a). “Collocations and semantic profiles: on the cause of the trouble with
quantitative studies”. Functions of Language, 2, 1: 23-55.
Stubbs, M. (1995b). “Corpus evidence for norms of lexical collocation”. In Principle and
Practice in Applied Linguistics: Studies in honour of H.G.Widdowson, (eds) Cook,
G. & Seidlhofer, B. Oxford University Press, Oxford.
Stubbs, M. (1996). Text and Corpus Analysis. Blackwell, Oxford.
Stubbs M. (2001a). Words and Phrases, Corpus Studies of Lexical Semantics. Blackwell,
Stubbs, M. (2001b). “Recent Work on Phraseology: The View from Corpora”. A seminar
given at CITS, UMIST.
Suyuti, A.(d. 1505). al-muzhir fi culuum al-lughah wa-anwacihaa (Tthe Flowery Book in
Linguistics and Types of Languages). Daar al-Jiil, Beirut.

` 183
Svartvik, J. (1992). “Corpus Linguistics comes of age”. In Directions in Corpus Linguistics,
(ed.) Svartvik J. Mouton de Gruyter, Berlin, New York, pp. 7-13.
Thacalibi, Abd al-Malik ibn Muhammad, (b. 961 or 2-1037 or 8.) fiqh al-lughah wa-sirr al-
carabiyyah (The Philology of the Arabic Language and Its Secrets) (vol). 1,
Maktabat Al-Khanjiy, Cairo.
The Nijmegen Dutch-Arabic/Arabic-Dutch Dictionaries (2003) Bulaaq, Amestrdam.
Tognini Bonelli, E. (2000). “Lexis in contrast”. In Studies in Corpus Linguistics, (eds.)
Sylviane Granger and Bengt Altenberg. Benjamins, Amsterdam and Philadelphia.
Ullmann, S. (1962). Semantics: An Introduction to the Science of Meaning. Basil,
Blackwell, Oxford.
Van der Wouden, T. (1997). Negative Contexts: Collocation, Polarity and Multiple Negation.
Routledge, London and New York
Van Mol, M. (2002). “The semi-automatic tagging of Arabic corpora”. In Arabic Language
Resources and Evaluation-Status and Prospects. A workshop held in LREC, 3rd
international conference on language resources and evaluation. Las Palmas, Spain.
Versteegh, K. (1997). Landmarks in Linguistic Thought, the Arabic Linguistic Tradition.
Routledge, London.
Watt, R. (2001). Concordance Program, Version 3.0, personal product.
Wehr, H. (1980). A Dictionary of Modern Written Arabic. Macdonald and Evens Ltd,
Whitaker, B. (2002). “Lost in translation”. The Guardian (UK), Monday June 10, 2002.
Wittgenstein, L. (1953). Philosophical Investigation. Blackwell, Oxford.

` 184