Sie sind auf Seite 1von 252

Telematics and Informatics 22 (2005) 161180

www.elsevier.com/locate/tele

Multiple determinants of life quality: the roles


of Internet activities, use of new media,
social support, and leisure activities q
Louis Leung *, Paul S.N. Lee
School of Journalism & Communication, The Chinese University of Hong Kong, Shatin, N.T.,
Hong Kong, Hong Kong
Received 18 February 2004; received in revised form 30 March 2004; accepted 13 April 2004

Abstract
The quest for quality of life (QoL) is a growing concern for individuals and communities
seeking to find sustainable life satisfaction in a technologically changing world. Industry,
consumer groups, academics, and policy makers have sought to better understand how the
Internet contributes to or detracts from society. This study examined the eects of Internet
activities, new media use, social support, and leisure activities on perceived quality of life.
Correlational results showed that Internet activities, such as using the Internet for sociability,
fun seeking and information seeking, and new media use, correlate positively with various
dimensions of social support. However, use of the Internet, especially for sociability, and
computer use were inversely linked to QoL. Furthermore, hierarchical regression analysis
revealed that aectionate, positive social interaction, and emotional and informational social
support, received from either online or oine sources, are the strongest determinants of
quality of life. More important, QoL can also be enhanced if suitable amounts of time are
spent on media-related activities, namely, less time on using the Internet for intimate selfdisclosure and in playing computer games, and more time on listening to music on CD/MD/
MP3. Finally, participating in community or religious activities for leisure was also a significant predictor of QoL. Implications regarding policy formulation to improve life quality are
discussed.
! 2004 Elsevier Ltd. All rights reserved.
Keywords: Internet; Quality of life; Social support; Leisure activities

The work described in this paper was fully supported by a grant from the Research Grant Council of
the Hong Kong Special Administrative Region (project no. CUHK 4315/01H).
*
Corresponding author. Tel.: +852-26097703; fax: +852-26035007.
E-mail address: louisleung@cuhk.edu.hk (L. Leung).
0736-5853/$ - see front matter ! 2004 Elsevier Ltd. All rights reserved.
doi:10.1016/j.tele.2004.04.003

162

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

1. Introduction
Over the past decade, the Internet has changed the way people work, play, learn,
and communicate. Today, there is a scarcely an aspect of our life that is not being
aected by the torrent of information available on the hundreds of millions of sites
crowding the Internet, not to mention its ability to keep us in constant touch with
each other via electronic mails (Henderson, 2001). The Internet adds a new entry to
the list of older mechanisms such as the telephone, postal mail, TV, radio, and
newspaper, all of which import communication and information into the household.
In fact, many view the growth of the Internet and e-commerce as a global megatrend
along the lines of the printing press, the telephone, the computer, and electricity.
Since these relatively recent developments, technology in much of the world has just
about taken over our lives.
The quest for quality of life is a growing concern for individuals and communities
seeking to find sustainable life satisfaction in a technologically changing world
(Mercer, 1994). Globalization and rapid advances in information technology oer us
vast, unprecedented opportunities to improve life quality. Yet, this opportunity may
also be burdened with undesirable consequences. With the Internet, peopleliving
in the most plugged-in and mechanized society in historymay be working harder
than ever. Rather than creating time for leisure, our technology is creating ways that
make it possible to undertake more work at home. Cellular phones, palmtops, and
Internet access devices may be making it virtually impossible to escape our jobs.
Technology may diminish our leisure time, not increase it (Anderson and Tracey,
2001). Does using the Internet make people happier or unhappier? Is the Internet
empowering, to which specific groups of people, and under what circumstances?
Does virtual community erode face-to-face community? These are some of the key
questions social scientists are exploring today.
Previous research in assessing life quality have included selected attributes such as
access to leisure activities, amount of non-work time, telework, and use of new
media technology (Kernan and Unger, 1987; Leung, 2004; Moller, 1992; Wei and
Leung, 1998), among others. However, little research has been carried out to further
explore the potential relationship between the Internet and QoL. For the time being,
both theoretical and empirical researches on the impact of the Internet are still in
their infancy. This study examines the possible influence of the Internet with particular emphasis on the roles of Internet activities, use of new media, social support,
and leisure activities on quality of life.

2. Theoretical frameworks
2.1. Quality of life
Quality of life, a cognitive judgmental process, is defined as a global assessment
of a persons life satisfaction according to his chosen criteria (Shin and Johnson,

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

163

1978). Diener (1984) suggested that the judgment of how satisfied people are with
their present state of aairs is based on a comparison with a standard, which each
individual sets for himself or herself. It is not externally imposed. Although many
people see wealth, health, employment, leisure, personal life, and fame as desirable,
dierent individuals may place dierent values on them. As defined by Argyle (1987),
[the] meaning of happiness is a state of joy or positive emotion; or the satisfaction
with life as a whole, or with work, leisure, and other parts of it. Therefore, quality
of life is a measure of overall life satisfaction, rather than a summation of life satisfaction across specific domains.
In reviewing the quality of life literature, two constructs have been used to explain
the determinants of life satisfaction or quality of life: subjective and objective perspectives (Diener, 1984). The subjective construct hypothesizes that perceived quality
of life is influenced by personality or dispositional factors (e.g., optimism, pessimism,
isolation, self-worth, and neuroticism). On the other hand, the objective construct
proposes that life quality is aected by environmental or situational factors (e.g.,
family, job, leisure, neighborhood, community, and satisfaction with standard of
living). According to the objective determinants of life quality, peoples quality of life
tends to be a direct function of their evaluations of important life domains such as
social support, leisure activities, and standard of living of overall life (e.g., Andrew,
1986; Andrews and Withey, 1976; Diener, 1984). Satisfaction or dissatisfaction with
standard of living is likely to spill over to influence subjective well-being. Therefore,
the greater the satisfaction with ones standard of living, the greater the satisfaction
with life and vice versa. Here, standard of living is usually meant as being materially
better o than a typical family (Andrews and Withey, 1976; Diener, 1984; Prenshaw, 1994).
To maintain or to have a high standard of living, technologies and innovations have always played a major role in the past (McPheat, 1996). Household
technologies introduced around the middle of the last century, such as television,
refrigerators, air-conditioners, vacuum cleaners, and clothes dryers, are permanently embedded in society. Even more taken-for-granted are changes in workplace technology such as the use of mobile phones, faxes, and e-mails. The
impact of the Internet on society as a whole has been debated continuously since
its widespread use in the 1990s. Industry, consumer groups, academics, and
policy makers have sought to better understand how the Internet contributes to
or detracts from society. Communications media are so fundamental to society
that new media forms have the capacity to reshape our work, leisure, lifestyle,
social relationships, national and cultural groups and identities in ways that are
dicult but important to predict. As the Internet continues to expand its technological capabilities and global penetration, one of the most pressing questions
is: Does the Internet have a positive or negative eect on life quality? As
shown in Fig. 1, this study examines, from an objective perspective, the impact
of social support, leisure activities, and standard of living (as supported and
maintained by the use of information technologies such as the Internet) on
quality of life.

164

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

Internet Activities

H 2.1
H 2.2
H4.1

Use of New Media

Social Support
* emotional & informational
* affectionate
* positive social interaction

H 4.2
H1
H 3.1

Leisure Activities

H 3.2

Quality of Life

Demographics

Fig. 1. An objective model explaining quality of life.

2.2. Social support


In a review of social indicators research, Cobb (1976) gave social support the first
definition as information leading the subject to believe that he or she is cared for
and loved, that he/she is esteemed and valued, and he/she belongs to a network of
communication and mutual obligation. Other scholars defined social support as
interpersonal transactions involving aect, armation, aid, encouragement, and
validation of their feelings (Abbey, 1993; Kahn and Antonucci, 1980). House (1986)
gave a third definition that social support involves the flow between people of
emotional concern, instrumental aid, information, or appraisal.
Existing measures of social support are rather varied because of the dierent
definitions of social support and the lack of a clear conceptualization of the construct
(Cohen and Syme, 1985; Donald and Ware, 1984). However, recent research has
generally attempted to measure the functional components of social support because
functional support is the most important and can be of various types providing: (1)
emotional support which involves caring, love, and sympathy, (2) instrumental support providing material aid or behavioral assistance and referred to by many as
tangible support, (3) information support oering guidance, advice, information, or
feedback that can provide a solution to a problem, (4) aectionate support involving
expressions of love and aection, and (5) social companionship (also called positive
social interaction), which involves spending time with others in leisure and recreational activities (Sherbourne and Stewart, 1991).
Although the Internet has become an important resource for information and
entertainment, little is known about the ways in which individuals use this technology for social support. In what way do communication technologies play a role in
influencing mediated social support and, in turn, relate to a variety of outcomes in
life quality? Recent research found that Internet-based support groupsincluding
newsgroups, message boards, and listservs for specific medical conditionshave
been successful in improving some intermediate patient outcomes in clinical trials
involving Alzheimers caregivers (Brennan et al., 1995; Gallienne et al., 1993) and in

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

165

patients with AIDS (Brennan et al., 1991). These studies have demonstrated that the
use of a computer-based communication system reduced self-reported isolation in an
AIDS trial and led to greater perceived confidence in the ability to care for family
members in the Alzheimers caregivers study. Internet-based peer support groups
for depression have also been found providing information and support, in which
heavy users of the Internet groups were more likely to have resolution of depression
during follow-up than less frequent users (Houston et al., 2002). Similarly, in
addition to research focused on the impact of the Internet on disabled people, past
study also investigated social support in the computer-mediated environment for
well-bodied people and found that older adult Internet users reported higher satisfaction with Internet providers of social support; and greater involvement with an
online community was predictive of lower perceived life stress (Wright, 2000).
It is impossible to consider all the variables from the subjective and objective
perspectives in assessing quality of life for any individual. The list of possible indicators is endless. One solution is to see which of them increases the objective quality
of life within the domain of mediated social impact by information technologies.
Furthermore, building from Putnams (1995) conceptual links between quality of
life, community involvement, and social capitals, further research has demonstrated
that frequent and increasing use of the community computer network and the Internet significantly influence social capital formation (Kavanagh and Patterson,
2001). Therefore, we expect that:
H1: Social support is positively associated with QoL.
H2.1: Internet activities (especially for sociability) are positively associated with
social support.
H2.2: Internet activities (especially for sociability) are positively associated with
QoL.
2.3. Leisure activities
As reviewed earlier, one important objective determinant of life quality is leisure
activities. In studying leisure, scholars like to ask whether place-centered leisure
activities, which take place in urban parks, or sporting and entertainment venues,
contribute more to a persons self-reported quality of life or whether QoL is primarily influenced by people-centered factors such as social interaction, sense of
achievement, and level of satisfaction with ones leisure lifestyle. Social interaction is
a central component of leisure activities (Auld and Case, 1997) and the most positive
experiences people report are usually those with friends (Csikszentmihalyi, 1997). In
a study that examined the relative importance of selected place and people-centered
leisure attributes in predicting quality of life, Lloyd and Auld (2001) found that the
people-centered leisure activities were the best predictor of quality of life. In particular, the domain of social support from family, friends, and marriage has the most
eect on life quality and social leisure activity has the most positive influence on QoL
for a diverse range of social groups (Siegenthaler and Vaughan, 1998). Moreover,
previous research has demonstrated a positive relationship between engaging in

166

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

leisure activities such as sports (Wankel and Berger, 1990) and fitness exercises
(Dowall et al., 1988) and improved life quality. Foong (1992) explained that these
significant relationships are due to the salutary consequences of social interaction
with other people resulting from engagement in active leisure. This study will use
both people-centered as well as place-centered indicators to assess leisure activities. As
a result, we hypothesize that:
H3.1: Leisure activities are positively associated with social support.
H3.2: Leisure activities are positively associated with quality of life.
2.4. Impact of the Internet and new media
Extensive qualitative and quantitative evidences also supported the Internets
potential that home Internet access enabled the informationally disadvantaged or
low-income families to experience powerful emotional and psychological transformations in identity (self-perception), self-esteem, personal empowerment, a new
sense of confidence, and social standing or development of personal relationships on
the Internet (Anderson and Tracey, 2001; Bier and Gallo, 1997; Henderson, 2001).
The appropriate use of computers, mobile phone, online newspaper, and online
forum, etc. can help to promote self-suciency, psychological empowerment, lifelong learning, and rehabilitation (Bier and Gallo, 1997; Hu and Leung, 2003;
Wellman and Haythornthwaite, 2002). Wright (2000) found that greater involvement
with the online community was predictive of lower perceived life stress for older
adults. A trend toward decreased loneliness and improved psychological well-being
among older adults was observed when e-mail and Internet access was provided
(White et al., 1999). Based on these findings and the theoretical frameworks reviewed, we propose two additional hypotheses and ask one research question:
H4.1: Use of new media technology is positively associated with social support.
H4.2: Use of new media technology is positively associated with QoL.
RQ: To what extent can Internet activities, use of new media, and traditional media
use aect quality of life when other influences, such as social support, leisure activities, and demographics are considered simultaneously for Internet users?

3. Method
3.1. Sample and sampling procedures
Data were gathered from a probability sample of 1192 respondents, using a faceto-face structured questionnaire interview during the months of OctoberDecember
2002. Respondents were eligible members of randomly generated households from
the Census and Statistics Department in Hong Kong. If there was more than one
eligible respondent living in the household, the person who was between the ages of
15 and 64 and had had the most recent birthday was interviewed. Interviewers were

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

167

trained university students. A total of 238 households were discarded when interviewers found them to be vacant, for non-residential use or ineligible, had no response after having visited more than three times, or were simply refused by the
respondents. Of the 954 qualified households, 696 successfully completed the questionnaires, resulting a 73% response rate.
The sample consisted of 46.7% males and 53.3% females. The mean age was 36.8
with 30.3% who were in the 3544 age group, 21.6% in 2534, 20% in 1524, 19.7% in
4554, and 8.5% were in 5564. This age distribution very closely resembled the 2001
population census in Hong Kong. Of the 696 respondents, 41.9% were high school
graduates, 24% college graduates, 19.5% had completed junior high, and 13.4% only
had grade school education. In terms of income, the mean was at the income bracket
of US$2565$3205 a month, with 16.9% earning less than US$1282 a month, 21%
between US$1282 and $1923, 13.6% between US$1924 and $2564, 12.8% between
US$2565 and $3205, 17.6% between US$3206 and $5128, 9.9% between US$5129
and $7692, and 8.3% more than US$7692 a month. Over 38% were managers,
administrators, professionals, or associate professionals, 19.4% clerks, 14.3% service
or sales workers, 10.8% craft and related workers, 9.8% had elementary occupations,
and about 5% were plant and machine operators and assemblers.
3.2. Measurements
Quality of life. To measure quality of life, the Satisfaction with Life Scale (SWLS)
developed by Diener et al. (1985) was employed. With good internal consistency and
high reliability, SWLS is narrowly focused to assess global life satisfaction and does
not tap related constructs such as positive aect or loneliness. Respondents were
asked about their agreement with a five-item scale using a 5-point scale with
1 strongly disagree, 2 disagree, 3 ordinary, 4 agree, and 5
strongly agree. The five items include: (1) in most ways my life is close to my ideal;
(2) the conditions of my life are excellent; (3) I am satisfied with my life; (4) so far I
have gotten the important things I want in life; and (5) if I could live my life over, I
would change almost nothing. Reliability alpha was high at 0.83.
Social support. To assess social support, a battery of 19 items within four subscales developed by The Rand and Medical Outcome Study (MOS) teams was
adopted with slight modification. The five original dimensions of social support were
further reduced as items from emotional support and informational support were
highly correlated and considerably overlapped. Therefore, emotional and informational support was merged into one. As a result, the four subscales were tangible,
aectionate, positive social interaction, and emotional or informational
supports. It was recommended that the subscale scores rather than the total score be
used (McDowell and Newell, 1996). Moreover, items from the tangible support
subscale were excluded because tangible support refers mostly to medical or health
related assistance from friends or close relatives rather than being aective or
emotional related. Respondents were asked how often each of the support items,
measured in the remaining three dimensions, is available to them if they need them
either from the online or oine world. A 5-point scale was used including

168

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

Table 1
Factor analysis of social support
How often is each of the following kinds of
support available to you if you need it?
Emotional and informational
1. Someone whose advice you really want
2. Someone to give you good advice about
a crisis
3. Someone to give you information to help
you understand a situation
4. Someone to turn to for suggestions about
how to deal with a personal problem
Positive social interaction
5. Someone to get together with for
relaxation
6. Someone to do something enjoyable with
7. Someone to do things with to help you get
your mind o things
Aectionate
8. Someone who shows you love and
aection
9. Someone to love and make you feel
wanted
10. Someone who comforts you sincerely
(hugs you)
11. Someone you can count on to listen to
you when you need to talk
Eigenvalue
Variance explained
Cronbachs alpha

Mean

SD

Factors
1

3.58
3.54

0.83
0.89

0.77
0.77

3.57

0.86

0.71

3.47

0.95

0.61

3.63

0.84

0.80

3.56
3.35

0.86
0.90

0.78
0.67

3.69

0.87

0.86

3.61

0.91

0.69

3.64

0.90

3.72

0.84

0.41

0.44
0.42
6.41
58.27
0.86

0.65
0.53

0.80
7.27
0.83

0.69
6.26
0.84

Scale used: 1 none of the time, 2 a little of the time, 3 some of the time, 4 most of the time, and
5 all of the time; N 388.

1 none of the time, 2 a little of the time, 3 some of the time, 4 most
of the time, and 5 all of the time. Principal components factor analysis in Table
1 extracted three factors and explained 71.8% of the variance. The three factors were
emotional and informational support with alpha 0.86, positive social interaction (alpha 0.83), and aectionate (alpha 0.84).
Leisure activities. Respondents were asked how often they engage in five popular
people-centered and place-centered leisure activities in Hong Kong including: talking
to family and friends face-to-face for more than 10 min, playing mahjong, participating in community or religious activities, physical exercise, and window shopping.
A 5-point scale was used with 1 never, 2 seldom, 3 sometimes,
4 quite often, and 5 very often.
Internet activities. Respondents were asked how often they use the following Internet activities: learning from the Internet, searching for information, reading news
online, listening to music, playing games, surfing for leisure and entertainment,

169

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180


Table 2
Factor analysis of Internet activities (Internet users only; N 387)
How often do you use the following
Internet services?
Fun seeking
1. Playing games on the Internet
2. Listening to music on the Internet
3. Surfing for leisure and entertainment
Sociability
4. Talk about things of your inner
world to other people on the Internet
5. Communicate with somebody you
knew before on the Internet
6. Communicate with somebody you
did not know before on the Internet
Information seeking
7. Searching information on the
Internet
8. Reading news on the Internet
9. Learning from the Internet
e-commerce
10. To get service on the Internet
(e.g., paying bills)
11. Purchasing on the Internet
Eigenvalue
Variance explained
Cronbachs alpha

Mean

SD

Factors
1

2.59
2.93
3.36

1.26
1.22
0.98

0.79
0.77
0.69

2.05

1.02

0.78

3.42

1.08

0.77

2.16

1.09

3.70

0.86

0.79

3.07
2.82

0.93
0.90

0.69
0.67

2.81

0.62

0.86

2.74

0.64

0.83

0.51

3.32
30.16
0.71

0.54

1.56
14.13
0.70

1.17
10.66
0.58

1.04
9.42
0.62

Scale used: 1 never, 2 seldom, 3 sometimes, 4 often, and 5 very often.

purchasing, using services on the Internet (such as paying bills, account transfer,
booking tickets, etc.), communicating with somebody you did not know before,
communicating with somebody you knew before, and talking about aspects of your
inner world to other people. A 5-point Likert scale was used with 1 meaning
never, 2 seldom, 3 sometimes, 4 often, and 5 very often. After
excluding two items, principal components factor analysis with Varimax rotation
yielded four factors with eigenvalues greater than 1.0, explaining 64.37% of the
variance. As shown in Table 2, these factors are fun seeking, sociability, information
seeking, and e-commerce with alpha ratings equaling 0.71, 0.70, 0.58, and 0.62
respectively.
New media use. Respondents were asked how much time they spent on the eight
most popular new media technologies in their leisure time, namely, Internet use,
computer use, ICQ, e-mail, and talking on the phone in minutes per day and playing
computer games, listening to CD, MD, MP3, and watching VCD and DVD in
minutes per week.
Traditional media use. Four traditional mass media variables were included in the
analyses: printed newspaper reading, TV watching, magazine reading, and radio

170

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

listening. Respondents were asked to report the time on average spent on these media
in a normal day. Newspaper reading, TV watching, and radio listening were measured
in minutes per day while magazine reading was measured in minutes per week.
4. Results
4.1. Hypotheses testing
H1 predicted that social support is positively associated with quality of life. As
expected, correlation results in Table 3 showed that emotional and informational
(r 0:36, p < 0:001), positive social interaction (r 0:40, p < 0:001), and aecTable 3
Correlation analyses of all criterion variables and social support and quality of life
Social support

Quality of
life (QoL)

Emotional
and informational

Positive social
interaction

Aectionate

Internet activities
Fun seeking
Sociability
Information seeking
E-commerce

0.11"
0.11"
0.11"
n.s.

0.16""
n.s.
n.s.
n.s.

n.s.
n.s.
0.12"
0.11"

n.s.
)0.16""
n.s.
n.s.

New media use


Internet use (min/day)
Computer use (min/day)
ICQ
e-mail
Talking on the phone
Playing computer game
Listening to CD/MD/MP3
Watching VCD/DVD/LD

n.s.
n.s.
n.s.
n.s.
0.14"""
0.11""
0.15"""
n.s.

n.s.
n.s.
n.s.
n.s.
0.14""
0.14""
0.23"""
n.s.

n.s.
n.s.
n.s.
n.s.
n.s.
n.s.
0.21"""
n.s.

)0.16""
)0.13"
n.s.
n.s.
n.s.
)0.08""""
0.18""
n.s.

Social support
Emotional and informational
Positive social interaction
Aectionate

People-centered leisure activities


Talking with family or friends face
to face
Playing mahjong
Participating in community or
religious activities
Place-centered Leisure Activities
Physical exercise
Window shopping
Notes:

""""

p < 0:1;

"""

p < 0:05;

0.24"""
n.s.
0.14"""

0.10""
0.11""
""

0.27"""

0.33"""

0.36"""
0.40"""
0.47"""
0.25"""

0.11"
n.s.

n.s.
n.s.

n.s.
0.13""

n.s.
0.22"""

n.s.
0.11"

0.11"
n.s.

p < 0:01; " p < 0:001; N 388.

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

171

tionate (r 0:47, p < 0:001) dimensions of social support were significantly correlated to quality of life. This suggests that people, who have strong social support
available when they need it, such as armation, aid, encouragement, information,
aect, and validation of their feelings, are those who enjoy a high quality of life.
Thus, H1 received strong support from the data.
H2:1 predicted that Internet activities are significantly linked to social support. In
the four main categories of Internet activities, results in Table 3 indicated that fun
seeking, sociability, and information seeking were significantly related to the emotional and informational dimension of social support (each with r 0:11, p < 0:05).
This means that people who often receive advice in crises in the real world are those
who are active on the Internet talking about aspects of their inner world with friends
and strangers, relying heavily on the Internet for advice and information to help
them understand their personal problems, playing games, listening to music, and
surfing the web. Secondly, fun seeking was also significantly related to the positive
social interaction dimension of social support (r 0:16, p < 0:01). This indicates
that people who enjoy a large social network for interaction and relaxation oine
are those who are active game players and fun seekers on the Internet. Thirdly, bivariate relationships between information seeking and e-commerce and the aectionate dimension of social support were also significant (r 0:12, p < 0:05 and
r 0:11, p < 0:05 respectively). This suggests that people who have a large circle of
friends providing them with love and aection in the oine world are also those who
are active on the Internet seeking information, advice, and receiving support.
Therefore, H2:1 is largely supported.
Contrary to what H2:2 hypothesized, that Internet activities are positively associated with life quality, results in Table 3 showed that sociability was negatively
linked to quality of life (r #0:16, p < 0:01). Such a finding indicates that people
who spend a lot of time disclosing their inner world to others on the Internet are
those with a lower assessment of their overall life quality. This relationship could be
explained in that when people spend a lot time talking about their personal feelings
online, this may take away time from more valuable activities oine, including social
contact, sleep, leisure activities, or reading books. Therefore, H2:2 is not supported.
H3:1 proposed that leisure activities would influence social support. Results in
Table 3 showed that emotional and informational social support are significantly
related to people-centered leisure activities such as talking with family and friends
face-to-face (r 0:24, p < 0:001) and participating in community or religious
activities (r 0:14, p < 0:001). This means that, at the time of crises, people tend to
obtain information and advice by engaging in face-to-face conversation with other
people and/or by actively involving in religious or community activities. In addition,
people also find informational and emotional social support through place-centered
leisure activities, e.g., physical exercise (r 0:10, p < 0:01) and window shopping
(r 0:11, p < 0:01). Similarly, when people get together for relaxation and fun for
positive social interaction, people tend to talk with family and friends (r 0:27,
p < 0:001), play mahjong (r 0:11, p < 0:05), and to go window shopping with
friends (r 0:22, p < 0:001)a popular place-centered leisure activity in Hong
Kong. As expected, people who receive a lot of aection and love for social support

172

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

are those who often engage actively in people-centered, face-to-face chat with family
and friends (r 0:33, p < 0:001) and place-centered window shopping (r 0:11,
p < 0:05). Thus, H3:1 is largely supported.
H3:2 proposed that leisure activities would influence quality of life. As anticipated,
talking with family and friends face-to-face (r 0:25, p < 0:001) and participating in
community or religious activities (r 0:13, p < 0:01) in people-centered leisure
activities are significantly linked to quality of life. Furthermore, physical exercise or
sports (r 0:11, p < 0:05), a place-centered leisure activity, was also found significantly liked to QoL. Hence, H3:2 received strong support.
H4:1 predicted that use of new media technology is positively associated with
social support. Correlational results in Table 3 showed that, of the eight new media
technologies commonly used in daily life, emotional and informational social support was significantly linked to talking on the phone (r 0:14, p < 0:001), playing
computer games (r 0:11, p < 0:01), and listening to CD/MD/MP3 (r 0:15,
p < 0:001). This shows that people who can receive advice and information about a
crisis when they need them are those who tend to spend a lot of time on the phone
seeking counsel, guidance, or encouragement; others receive emotional and informational social support through computer gaming and listening to or sharing music
with online/oine friends. Similarly, positive social interaction and social support
were also significantly related to talking on the phone (r 0:14, p < 0:01), playing
computer games (r 0:14, p < 0:01), and listening to CD/MD/MP3 (r 0:23,
p < 0:001). These findings indicate that people who receive social support by getting
together and doing something enjoyable with friends are those who often like to talk
on the phone, play games with computers, and listen to CD/MD/MP3. Finally,
people who receive a lot of love, aection, and hugs in real life are those who also
listen to music online regularly to release social pressure (r 0:21, p < 0:001). As a
result, H4:1 received partial support.
H4:2 predicted that use of new media technologies is positively associated with
quality of life. Results showed that only three out of eight new technologies and QoL
were significantly linked. Surprisingly, use of the Internet and use of computer were
negatively related to QoL (r #0:16, p < 0:01 and r #0:13, p < 0:05 respectively); while watching VCD/DVD/LD for entertainment and life quality were
positively linked (r 0:18, p < 0:01). Therefore, H4:2 received little support.
4.2. Predicting quality of life
Finally, to compare the relative influence of Internet activities, use of new media,
and traditional media use on quality of life when other factors, such as social support, leisure activities, and demographics are considered simultaneously for Internet
users, a hierarchical regression analysis was run. Results in Table 4 show that
sociability was a significant predictor (b #0:11, p < 0:05) under the Internet
activities block. However, the negative correlation indicates that the people who
spend more time communicating their inner thoughts to other people on the Internet
are those who tend to have a lower level of life quality. The first block accounted for
2% of the variance.

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

173

Table 4
Stepwise regression of Internet activities, new media use, traditional media use, social support, leisure
activities, and demographics on quality of life (QoL)
DR2

Predictor variables

Block 1: Internet activities


Fun seeking
Sociability
Information seeking
E-commerce

n.s.
)0.11"
n.s.
n.s.

0.02

Block 2: New media use


Internet use (min/day)
Computer use (min/day)
Talking on the phone
Playing computer game
Listening to CD/MD/MP3
Watching VCD/DVD/LD

n.s.
n.s.
n.s.
)0.14""
0.18""
n.s.

0.08

Block 3: Traditional media use


TV watching
Newspaper reading
Magazine reading
Radio listening

n.s.
n.s.
n.s.
n.s.

0.00

Block 4: Social support


Emotional and informational
Positive social interaction
Aectionate

0.13""
0.23"""
0.35"""

0.20

Block 5: Leisure activities


Talking with family or friends face to face
Playing mahjong
Participating in community or religious activities
Physical exercise
Window shopping

n.s.
n.s.
0.08"
n.s.
n.s.

0.01

Block 6: Demographics
Gender (female 1)
Age
Education
Monthly household income

n.s.
0.16""
n.s.
0.11"

0.03

R
Final adjusted R2
2

0.36
0.34

Notes: Figures are standardized beta coecients from final regression equation with all blocks of variables
included for the entire sample.
""""
p < 0:1; """ p < 0:05; "" p < 0:01; " p < 0:001; N 388.

Use of new media technologies were entered into the equation next. Results
showed that playing computer games (b #0:14, p < 0:01) and listening to CD/
MD/MP3 (b 0:18, p < 0:001) were the only two significant predictors. The negative link between playing computer games and QoL reveals that the violent nature of
most computer games has led people to view computer games as a negative force in
aecting their self-evaluation of life quality. Quality of life was also predicted by

174

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

heavy listening to CD/MD/MP3. This indicates the eect of a wide range of music on
users well-being. These two variables contributed 8% of the variance. However,
traditional mass media had no significant impact on life quality.
The three dimensions assessing social support were the next entries in the equation. Aectionate (b 0:35, p < 0:001), positive social interaction (b 0:23,
p < 0:001), and emotional and informational (b 0:13, p < 0:01) dimensions contributed significantly to the regression equation and explained a total of 20% of the
variance. Five variables from the leisure activities block were entered next. Participating in community or religious activities (b 0:08, p < 0:05) was a significant
predictor that accounted for another 1% of the variance.
Demographic predictors were entered last and it was found that age (b 0:16,
p < 0:01) and monthly household income (b 0:11, p < 0:05) were significant. The
equation explained 34% of the variance in total with the first three blocks of mediarelated predictors contributing a significant proportion of 10%. This suggests that
while social support dimensions were the strongest predictors, appropriate use of the
Internet and new media technologies do have an impact on quality of life.

5. Discussion
5.1. Social support and QoL
This study has shown that people with strong social support, such as armation,
aid, encouragement, and aect, available to them when they need them either from
the online or oine world reported a higher quality of life. This finding means that
receiving support from strong ties increases life quality. This is consistent with past
research that lower levels of perceived social support, satisfaction with social contacts, and participation in social activities were all found to be related to poorer
psychological well-being or life quality (House, 1986). Conversely, when people have
high levels of emotional support, mediated entirely by the perception that one has
someone to call on when they need to, they expect to live longer (Ross and Mirowsky, 2002). Furthermore, hierarchical regression results confirmed that aectionate, positive social interaction, and emotional and informational dimensions of
social support were all significant predictors of QoL and explained the majority of
the variance.
5.2. Internet activities and social support
Internet activities, such as using the Internet for sociability, fun seeking, and
information seeking, were found to be positively related to various dimensions of
social support. These imply that people who communicate their inner world with
friends and strangers online and rely heavily on the Internet for advice and information to help them understand personal problems are those who often receive
guidance and assistance in times of crisis. This finding is in line with past research
which indicates that individuals who regularly oer advice and information oine

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

175

receive more help more quickly when they ask for something in the online world
(Rheingold, 1993; Wellman and Gulia, 1999). Wellman and Haythornthwaite (2002)
also found that those who have more real support receive more Internet support.
Thus the receipt of support happens synergistically online and oine.
5.3. Internet activities and QoL
However, contrary to what was originally hypothesized, using the Internet for
sociability and their overall assessment of quality of life were inversely linked. There
are several possible explanations. First, many of the social relationships people
maintain online are less substantial and less sustaining than relationships that people
have in their actual lives. Second, more time spent online may take away from more
valuable activities, including social contacts oine, sleeping, or reading books.
Third, online communication is a less adequate medium for close social communication than the telephone or face-to-face interactions it displaces. Fourth, computermediated relationships are usually superficial with easily broken bonds. This finding
is in line with previous research suggesting that relationships maintained over long
distances through the Internet erode personal security and happiness (Kraut et al.,
1998). In the end, the Internet is useful for linking people to information and social
resources which are unavailable in peoples closest local groups (e.g., professional
groups), but may be poor for deep feelings of aection and obligations. Thus, the
weak social ties supported by the Internet network are likely to be more limited than
friendships supported by physical proximity. As a result, this negative relationship
may lead to a decrease in the assessment of life quality.
5.4. New media use and social support
It is also interesting to note that frequencies of participation in new media
activities, such as talking on the telephone regularly, playing games on the computer,
and listening to music on CD/MD/MP3, showed positive relationships with social
support, especially in the emotional/informational and positive social interaction
dimensions, among Internet users. These mean that use of new media technologies,
such as the telephone, computer, and CD/MD/MP3, may service various needs such
as companionship, entertainment, and relaxation (Wachter and Kelly, 1998). In past
research, companionship has been linked to the direct eects model of social support
(Antonucci, 1990). In other words, people might receive advice, information, suggestions, relaxation, and various types of social supports derived from a wide range
of new media activities.
5.5. New media use and QoL
Interestingly, use of ICQ, e-mail, and talking on the phone did not significantly
influence QoL as expected (see Table 3 for details). In fact, use of the Internet and
computer were negatively linked to QoL. These findings may mean that heavy use of
the Internet and computer, such as playing computer games and use of the Internet

176

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

for sociability purposes, may actually degrade quality of life if these technologies
were used excessively or used for unhealthy reasons. Furthermore, with the Internet,
we are living in the most plugged-in society in history. Rather than creating time for
leisure, computer and the Internet may have created ways by which we can do more
work while we are away from the oce. Similarly, cellular phones, e-mails, and
Internet access devices are making it virtually impossible to escape our jobs. As a
result, technology may be diminishing our leisure time, not increasing it. In a study
of the impact of TV, Brock (2002) also found that excessive or frequent TV viewing
contributes to a number of issues, including fractured family time, poor reading and
academic performance, increased violence, inactive lifestyles, and obesity. However,
TV-free individuals fill their newly discovered free time with a variety of hobbies,
community involvement, conversation, reading, writing, cooking, and playing (Sirgy
et al., 1998). By turning o the TV and taking back their time, they gained more
communication with children and spouses, improved marriages, experienced less
conflict among siblings, and increased community involvement (Brock, 2002; Kubey
and Csikszentmihalyi, 1990).
In sum, as shown in the results from the hierarchical regression analysis that the
extent to which the QoL of the individual can be enhanced does in part hinge on a
suitable amount of time spent on media-related activities, namely, less time on using
the Internet for intimate self-disclosure, less time in playing computer games, and
more time on listening to music on CD/MD/MP3.
5.6. Leisure activities, social support, and QoL
Finally, although past research indicates that the people-centered leisure attribute,
especially leisure satisfaction, was the best predictor of quality of life and placecentered attributes failed to influence life quality (Lloyd and Auld, 2001), most
bivariate relationships in this study, however, between people-centered and placecentered leisure activities and social support, as well as quality of life, were found
significant. These results are consistent with findings by McCormick and McGuire
(1996) that the primary leisure attribute that creates and maintains life quality is not
exclusively person-centered or place-centered leisure activities, but their interaction.
This means that people who engaged in social activities more frequently and who are
more satisfied with the psychological benefits they derive from leisure, regardless of
people-centered or place-centered, experienced a higher level of perceived quality of
life (Lloyd and Auld, 2001). Despite these results, however, the hierarchical regression analysis revealed that participating in community and religious activities was
the only people-centered leisure activity predictor which contributed significantly to
the objective assessment of living quality when the influences of Internet activities,
new media use, social support, and demographics were controlled.
Furthermore, it is also interesting to note that socioeconomic status, indicated by
age, gender, education, and income variables, only contributed a total of 3% incremental variance in the 34% total explained, while social support accounted for 20%,
media-related activities 10%, and leisure activities 1% of the variance. This suggests
that economic status is not a key determinant in predicting life quality in this data.

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

177

6. Conclusion
To conclude, this study has demonstrated that social relationships and social
supports are potent variables that can enhance quality of life. This suggests that
happy people may be those who receive and give love, aection, sympathy, guidance,
advice, information, and social companionship which involves spending time with
others in leisure and recreational activities. As a result, well-connected people, both
online and oine, with strong socially supportive relationships would contribute
greatly in both quality and quantity of life (House, 1986). Furthermore, use of the
Internet and some new media technologies do play important roles in enhancing life
quality, especially in music listening from CD/MD/MP3 and non-pathological use of
computer games, ICQ, or chat rooms on the Internet. However, the addictive potential of the Internet with harmful consequences could silently run rampant in our
schools, our universities, and our homes. These are the new societal challenges that
must be addressed through education. Only when parents and teachers recognize
Internet addiction as a true disorder and oer ways to combat it can schools and
parents start regaining the benefits certain applications of the Internet has unwittingly taken away. This research supports the need for the formulation of problem
deterrence policies to prevent excessive non-productive use of the Internet if a high
and sustainable quality of life can be maintained.
Furthermore, while many genuinely appreciate the wonder of technology and the
accommodations it continues to provide, many still find it disconcerting that technology may have created an environment for even greater intrusion, expectations,
and stress. For example, many workers today are perhaps concerned that with their
mobile phones, e-mails, and Internet at home, their work may appear to be a 24hour job intruding into every other aspect of their lives. In the past, it used to take a
day or two for a memo to reach the employeenow we have instant e-mail, which
demands an instant response. In fact, the long hours culture is seriously undermining the quality of family life.
Where technology takes us from here is an issue that is widely discussed. It is also
an issue that is hotly disputed. While technological change will always occur, there
will always be a section of the society, which is unable to accept the change comfortably. With changes so widespread and dramatic as those brought by the Internet,
the associated social changes are also very important. Not everybody is included in
the advantages brought by the Internet and those included may not be included
evenly. Nevertheless, regardless of the positives and negatives, the Internet will
clearly continue to be part of contemporary life. It is hoped that we use it wisely so
that we remain vigilant about how we should use the Internet to truly bring about a
better quality of life.

7. Limitations of the study


On the whole, each of the theoretical constructssocial support, leisure activities,
together with Internet activities, new media use, and demographicsperformed

178

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

reasonably well in helping to explain their self-assessment of life quality. However,


several limitations must be noted. First, this study did not directly consider the
original causal relationships among Internet activities, use of new media, social
support, leisure activities, and quality of life. We recognize that the impacts of the
Internet on social support and QoL may be bidirectional. Furthermore, nothing in
the data allows causal conclusions. Therefore, longitudinal studies will be better
equipped to address this cause-and-eect issue. Future studies may address the reverse causal linkages in additional to the impact of the forward direction of causality.
Second, lack of information regarding the purpose of the use of e-mails, telephones,
ICQ, and types of content viewed on the Internet greatly reduce the ability to relate
how these technologies enhance quality of life or how much of the positive psychological eect was due to the use of the Internet and how much was due to the
social interactions oine. Third, qualitative, interpretative methods were lacking to
fully explore the diversities of meanings attached to personal assessment of life
quality and social support by dierent people. Future research should address
whether the increasing the number of relationships on the Internet is related to
increased feelings of connection with society. Further, studies on the impact of
media-related technologies on quality of life should focus on gender dierences and
cross-national comparisons in future.

References
Abbey, A., 1993. The eect of social support on emotional well-being. Paper presented at the First
International Symposium on Behavioral Health. Nags Head, North Carolina.
Anderson, B., Tracey, K., 2001. Digital living: the impact (or otherwise) of the Internet on everyday life.
American Behavioral Scientist 45 (3), 456475.
Andrew, F.M., 1986. Research on the quality of life. Survey Research Center, Institute for Social
Research, University of Michigan, MI.
Andrews, F.M., Withey, S.B., 1976. Social Indicators of Well-being: Americas Perception of Life Quality.
Plenum, New York.
Antonucci, T.C., 1990. Social support and social relationships. In: Binstock, R.H., George, L.K. (Eds.),
Handbook of Aging and the Social Sciences. Academic Press, San Diego, CA, pp. 205226.
Argyle, M., 1987. The Psychology of Happiness. Methuen, London.
Auld, C., Case, A., 1997. Social exchange processes in leisure and non-leisure settings: a review and
exploratory investigation. Journal of Leisure Research 29, 183200.
Bier, M., Gallo, M., 1997. Personal empowerment in the study of home Internet use by low-income
families. Journal of Research on Computing in Education 30 (2), 107121.
Brennan, P.F., Ripich, S., Moore, S.M., 1991. The use of home-based computers to support persons living
with AIDS/ARC. Journal of Community Health Nursing 8, 314.
Brennan, P.F., Moore, S.M., Smyth, K., 1995. The eects of a special computer network on caregivers of
persons with Alzheimers disease. Nursing Research 44, 166172.
Brock, B., 2002. Life without TV. Parks & Recreation 37 (11), 6872.
Cobb, S., 1976. Social support as a moderator of life stress. Psychosomatic Medicine 38, 301314.
Cohen, S., Syme, L., 1985. Social Support and Health. Academic Press, Orlando, FL.
Csikszentmihalyi, M., 1997. Finding Flow: The Psychology of Engagement in Everyday Life. Basic Books,
New York.
Diener, E., 1984. Subjective well-being. Psychological Bulletin 95 (3), 542575.

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

179

Diener, E., Emmons, R., Larsen, R., Grin, S., 1985. The satisfaction with life scale. Journal of
Personality Assessment 49, 7175.
Donald, C.A., Ware, J.E., 1984. The measurement of social support. In: Greenley, R. (Ed.), Research in
Community and Mental Health, vol. 4. JAI Press, Greenwich, CT, pp. 325370.
Dowall, J., Bolter, C., Flett, R., Kammann, R., 1988. Psychological well-being and its relationship to
fitness and activity levels. Journal of Human Movement Studies 14, 3945.
Foong, A., 1992. Physical exercise/sports and biopsychosocial well-being. Journal of the Royal Society of
Health 112, 227230.
Gallienne, R.L., Moore, S.M., Brennan, P.F., 1993. Alzheimers caregivers: psychosocial support via
computer networks. Journal of Gerontology Nursing 19, 1522.
Henderson, C., 2001. How the Internet is changing our lives. Futurist 35 (4), 3845.
House, J.S., 1986. Social support and the quality and quantity of life. In: Andrew, F.M. (Ed.), Research on
the Quality of Life. Survey Research Center, Institute for Social Research, University of Michigan,
Ann Arbor, MI.
Houston, T.K., Cooper, Ford, D.E., 2002. Internet support groups for depression: a 1-year prospective
cohort study. The American Journal of Psychiatry 159 (12), 20622068.
Hu, S., Leung, L., 2003. Eects of expectancy-value, attitudes, and use of the Internet on psychological
empowerment experienced by Chinese women at the workplace. Telematics and Informatics 20 (4),
365382.
Kahn, R.L., Antonucci, T.C., 1980. Convoys over the life course: attachment, roles and social support.
In: Baltes, P.B., Brim, O. (Eds.), Life-Span Development and Behavior, vol. 3. Lexington Press,
Boston.
Kavanagh, A.L., Patterson, S.J., 2001. The impact of community computer networks on social capital and
community involvement. American Behavioral Scientist 45 (3), 496510.
Kernan, J., Unger, L., 1987. Leisure, quality of life, and marketing. In: Samli, A. (Ed.), Marketing and the
Quality of Life Interface. Quorum Books, New York, pp. 236252.
Kraut, R., Patterson, M., Lundmark, V., Kiesler, S., Mukopadhyay, T., Scherlis, W., 1998. Internet
paradox: a social technology that reduces social involvement and psychological well-being? American
Psychologist 53, 10171031.
Kubey, R., Csikszentmihalyi, M., 1990. Television and the Quality of Life: How Viewing Shapes Everyday
Experience. LEA, Hillsdale, NJ.
Leung, L., 2004. Societal, organizational and individual factors in the adoption of telework. In: Lee, P.,
Leung, L., So, C.Y.K. (Eds.), Impact and Issues in News Media: Toward Intelligent Societies.
Hampton Press, Cresskill, NJ.
Lloyd, K.M., Auld, C.J., 2001. The role of leisure in determining quality of life: issues of content and
measurement. Social Indicators Research 57, 4371.
McCormick, B., McGuire, F., 1996. Leisure in community life of older rural residents. Leisure Sciences 18,
7793.
McDowell, I., Newell, C., 1996. Measuring Health: A Guide to Rating Scales and Questionnaires, second
ed. Oxford University Press, New York.
McPheat, D., 1996. Technology and life-quality. Social Indicators Research 38-1, 2952.
Mercer, C., 1994. Assessing liveability: from statistical indicators to policy benchmarks. In: Mercer, C.
(Ed.), Urban and Regional Quality of Life Indicators. Institute for Cultural Policy Studies, Grith
University, Brisbane, pp. 312.
Moller, V., 1992. Spare time use and perceived well-being among black South African youth. Social
Indicators Research 26, 309351.
Prenshaw, P.J., 1994. Good life images and brand name associations: evidence from Asia, America, and
Europe. In: Allen, C., John, D.R. (Eds.), Advances in Consumer Research, vol. 21. Association for
Consumer Research, Provo, UT.
Putnam, R.D., 1995. Bowling Alone: The Collapse and Revival of American Community. Simon and
Schuster, NY.
Rheingold, H., 1993. The Virtual Community: Homesteading on the Electronic Frontier. Addison-Wesley,
Reading, MA.

180

L. Leung, P.S.N. Lee / Telematics and Informatics 22 (2005) 161180

Ross, C.E., Mirowsky, J., 2002. Family relationships, social support and subjective life expectancy.
Journal of Health and Social Behavior 43 (4), 469489.
Sherbourne, C.D., Stewart, A., 1991. The MOS social support survey. Social Science & Medicine 32, 705
714.
Shin, C.C., Johnson, D.M., 1978. Avowed happiness as an overall assessment of quality of life. Social
Indicators Research 5, 475492.
Siegenthaler, K., Vaughan, J., 1998. Older women in retirement communities: perceptions of recreation
and leisure. Leisure Sciences 20, 5366.
Sirgy, M.J., Lee, D.J., Kosenko, R., Meadom, H.L., 1998. Does television viewership play a role in the
perception of quality of life? Journal of Advertising 27 (1), 125142.
Wachter, C., Kelly, J., 1998. Exploring VCR use as a leisure activity. Leisure Sciences 20, 213227.
Wankel, L., Berger, B., 1990. The psychological and social benefits of sport and physical activity. Journal
of Leisure Research 22, 167182.
Wei, R., Leung, L., 1998. Owning and using new media technology as predictors of quality of life.
Telematics and Informatics 15 (4), 237251.
Wellman, B., Gulia, M., 1999. Net surfers dont ride alone. In: Wellman, B. (Ed.), Networks in the Global
Village. Westview, Boulder, CO, pp. 331366.
Wellman, B., Haythornthwaite, C., 2002. The Internet in Everyday Life. Blackwell, Malden, MA.
White, H., McConnell, E., Clipp, E., Bynum, L., 1999. Surfing the Net in later life: a review of the
literature and pilot study of computer use and quality of life. Journal of Applied Gerontology 18 (3),
358378.
Wright, K., 2000. Computer-mediated social support, older adults, and coping. Journal of Communication 50 (3), 100118.

Computers in Human Behavior 18 (2002) 437451


www.elsevier.com/locate/comphumbeh

Relationships among Internet use, personality,


and social support
Rhonda J. Swickert*, James B. Hittner, Jamie L. Harris,
Jennifer A. Herring
Department of Psychology, College of Charleston, 66 George Street, Charleston, SC 29424, USA

Abstract
Competing claims have been presented in the literature regarding the impact of Internet use
on social support. Some theorists have suggested that Internet use increases social interaction
and support (Silverman, 1999, American Psychologist 54, 780781), while others have argued
that it leads to decreased interaction and support (Kiesler & Kraut, 1999, American Psychologist 54, 783784). This study was designed to address this issue by examining the relationships among Internet use, personality, and perceived social support. Two-hundred and six
participants completed questionnaires that assessed Internet use, personality (agreeableness,
conscientiousness, extraversion, neuroticism, openness), and perceived social support. Using
principal components analysis, individual computer activities were combined into three primary factors: Technical, Information Exchange, and Leisure. Correlation and regression
analyses revealed only a marginal relationship between computer use and social support.
Similarly, only modest associations were found between personality and computer use. However, personality did moderate the relationship between computer use and social support.
That is, on two occasions, high computer use coupled with high personality was associated
with decreased perceived social support and on a third occasion this combination resulted in
increased perceived social support. These results help to address some of the inconsistencies
that have been reported in the literature. # 2002 Elsevier Science Ltd. All rights reserved.
Keywords: Internet; Computer use; Social support; Personality

It can be argued that the Internet has opened up a new frontier for human interaction. Like any new frontier there are many unknown factors and challenges associated with its exploration. The Internet is no exception to this rule, especially when
one is attempting to understand the impact of online activity on social interaction.
* Corresponding author. Fax: +1-843-953-7151.
E-mail address: swickertr@cofc.edu (R.J. Swickert).
0747-5632/02/$ - see front matter # 2002 Elsevier Science Ltd. All rights reserved.
PII: S0747-5632(01)00054-1

438

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

In particular, the role that the Internet might play in inuencing an individuals
social support system is, as of yet, unclear. Some researchers have suggested that
online activity might serve to facilitate an individuals feeling of social support
(Bromberg, 1996; Mickelson, 1997; Parks & Floyd, 1996; Silverman, 1999; Winzelberg, 1997). Others have indicated that Internet use can actually degrade social
relationships and reduce an individuals feeling of support (Jones, 1997; Kiesler &
Kraut, 1999; Kraut, Patterson, Lundmark, Kiesler, Mukopadhyay, & Scherlis,
1998b). This study was designed to test these competing claims by investigating the
relationship between Internet use and perceived social support.
Researchers who argue that Internet use facilitates feelings of social connectedness
and social support cite a variety of factors that appear to contribute to this eect.
One of the most important of these factors concerns the opportunity that the Internet aords individuals to meet and interact with people who have similar interests
(McKenna & Bargh, 2000). Relationships formed online via chat rooms or discussion groups might allow individuals with mutual interests or experiences to obtain
information and encouragement from others who are like-minded. Similarity has
long been known to contribute to friendship formation (Martin & Anderson, 1995;
Newcomb, 1961) and the Internet seems to maximize this eect. Indeed, researchers
have determined that it is common for individuals to form friendships with others
online (Katz & Aspden, 1997; The UCLA Internet Report, 2000) and to consider
those relationships to be as close as face-to-face non-Internet relationships
(McKenna, 1998; Parks & Floyd, 1996). Furthermore, research has demonstrated
that online relationships can be an important source of social support. For instance,
Winzelberg (1997), using an archival analysis approach, analyzed the postings of an
eating disorder discussion group over a 3-month period. Comments posted were
categorized into dierent types of social interaction. While it was found that the
most common message content involved self-disclosure (31%), requests for information (23%) and the direct provision of emotional support (16%) were also
recorded. These results are consistent with the conclusion that individuals do receive
(and provide) social support through online interaction and similar research has
supported this nding (King & Moreggi, 1998; Mickelson, 1997). Unfortunately
though, this work is based primarily on discussion group participants and therefore
may not generalize to other types of online contact (e.g., chat rooms, multiuser
dungeons). In addition, other research has suggested that online interaction
may actually reduce social connections and feelings of social support (Kraut et
al., 1998b).
The Home Net Project (Kraut, Kiesler, Mukopadhyay, Scherlis, & Patterson,
1998a) is the seminal study to date that provides evidence for the negative social
impact of the Internet. In this study, a sample of 169 people in the Pittsburgh,
Pennsylvania area were followed during their rst 2 years online. Kraut et al.
(1998a) reported that as participants used the Internet more their social connectedness, as measured by contact with family and friends, was reduced. Participants
perceptions of their social support was also measured over the 2-year period.
Although a negative relationship was found between Internet use and perceived
support, this relationship failed to meet the traditional level of statistical sig-

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

439

nicance. One reason why this eect may have failed to reach signicance is that the
measure used to assess social support was an abbreviated version of a larger scale
(the Interpersonal Support Evaluation List) and therefore the range of the scale may
have been restricted, making it dicult to detect a signicant eect. Also, because
only part of the scale was used, the measure may not have been psychometrically
reliable or valid. Because of these methodological problems the relationship between
Internet use and perception of social support remains unclear.
Given the conicting theoretical views, inconsistent research ndings, and paucity
of strong empirical evidence, further study is required to clarify the relationship
between Internet use and social support. We were particularly interested in determining the relationship between online activity and a type of support called perceived social support. Measures of perceived support assess whether individuals
perceive that they have others they can turn to for support (Cohen & Hoberman,
1983). We chose to focus on this facet of social support because recent research
suggests that perceived support is more psychologically salient and meaningful than
other types of support (e.g., objective or structural support; Hittner & Swickert,
2002). In addition, perceived support has been shown to be more strongly associated
with eective coping eorts than are other types of social support processes (Lakey
& Drew, 1997; Mankowski & Wyer, 1997). Given the importance of this type of
social support, it appears reasonable to assume that if Internet use does indeed
inuence social support, then, by measuring perceived levels of support, one should
be able to assess this putative eect. The question remains, however, as to the nature
of this eect. That is, does Internet use increase the amount of support a person
perceives because he or she now has more people in their support network? Or,
conversely, would it reduce the quality of the Internet users face-to-face social
contacts and lead to a degraded sense of support? Addressing this issue was one goal
of this study.
In addition to investigating the relationship between online activity and perceived social support, we were also interested in exploring the relationship between
Internet use and personality. One potentially fruitful place to start in addressing
the relationship between personality and online activity is with the Five Factor
Model (FFM) of personality. Extraversion (E) and neuroticism (N) are two of the
proposed Big Five personality traits; the other big ve traits include agreeableness (A), openness (O), and conscientiousness (C) (Costa & McCrae, 1992a). The
big ve, while not universally accepted (see Block, 1995, for a dissenting opinion),
are generally viewed as the essential traits of personality (McCrae & Costa,
1999), and they have been demonstrated to account for a wide variety of behaviors
from job performance to stress and coping (Barrick & Mount, 1991; OBrien &
DeLongis, 1996; Watson & Hubbard, 1996). Likewise, there is reason to believe
that some of these personality traits may be predictive of Internet use. For
instance, it could be argued that individuals who are high in openness, with their
curious manner and their tendency toward adventure seeking (McCrae, 1996),
might be very attracted to online activity as an opportunity to explore and seek out
the new and novel. Recent research seems to bear out this prediction (Tuten &
Bosnjak, 2001). Similarly, individuals high in agreeableness are often described as

440

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

very nice and easy to get along with (Costa & McCrae, 1992b). Given the
sometimes hostile nature of Internet interactions (Joinson, 1998), this trait might
make them very attractive to others when they go online and make it easier for
them to form friendships online. Likewise, individuals that are high in extraversion
tend to be gregarious and are attracted to stimulating environments (Eysenck,
1967). This tendency may inuence the extravert to go online to seek out the new
and exciting. In fact, researchers have documented, at least for males, a positive
association between extraversion and surng sex web sites (Hamburger & BenArtizi, 2000). However, in the same study, a negative correlation was found
between extraversion and traditional social online activities (e.g., chat room visits,
participate in discussion groups). Finally, it has been documented that individuals
that are high in neuroticism report lower levels of Internet usage (Tuten &
Bosnjak, 2001), and, in particular, information based activities (e.g., utilizing
search engines). This tendency may be due to the neurotics higher level of
anxiety and lowered self-ecacy in this particular domain. While these ndings
are suggestive, much of this work is based on single studies employing relatively
small numbers of subjects drawn from psychology courses. We were interested in
replicating these ndings in a larger and broader-based sample. Moreover, we
chose to have participants report the specic amount of time they spent engaged
in online activites, rather than asking subjects to approximate their time
online with likert-scale descriptors (1=not at all; 5=a lot). Because of this
specicity, we believe our approach will yield a more precise assessment of Internet
use.
In addition to addressing the association between Internet use and personality, a
third goal of this study was to determine the potential moderating role that personality might play between Internet use and social support. Personality factors have
been demonstrated to aect both Internet use (Hamburger & Ben-Artzi, 2000;
Kraut et al., 1998b) and perceived social support (Halamandaris & Power, 1999;
Procidano, 1992; Turner, 1999). Because of this, we believe that personality and
Internet use might interact to inuence perceived support. To illustrate, individuals
high and low in extraversion (E) might be dierentially aected by the same level of
Internet use. Whereas an individual high in E might report no change in perceived
support based on online social experiences, an individual low in E might report
enhanced support. This dierence could be due to the fact that low E individuals,
compared to high E individuals, have more to gain from these interactions because
their social support network is typically smaller. Determining the precise nature of
the interaction between personality (A, C, E, N, O) and Internet use was a nal goal
of this study.
To summarize, there were three major aims of this study. First, this study examined the relationship between Internet use and perceived social support to determine what eect, if any, Internet use may have on social support. A second aim of
the study was to determine if personality dimensions might inuence frequency and
type of Internet use. Finally, a third aim of the study was to explore how personality might moderate the relationship between Internet use and perceived social
support.

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

441

1. Method
1.1. Participants
Two-hundred and six participants were recruited from computer science, political
science, psychology, and sociology classes at a medium-sized public liberal arts college in the southeastern United States. Sixty-one percent of the participants were
female (39% male) and their ages ranged from 18 to 45 (M=21.34). Finally, 18% of
the participants were African-American, 1% Asian, 78% Caucasian, and 3% classied themselves as other.
1.2. Materials
1.2.1. Social support
The Interpersonal Support Evaluation List (ISEL; Cohen & Hoberman, 1983) was
used to assess the perceived availability of social support. This 48-item questionnaire
assesses four types of social support and yields an overall support measure. The
Appraisal subscale assesses the perceived availability of someone to talk to about
ones problems; the Belonging subscale, the perceived availability of people to do
things with; the Self-esteem subscale, the perceived availability of a positive comparison when comparing oneself to others; and the Tangible subscale, the perceived
availability of someone to provide material aid. Individuals are asked to indicate
whether statements concerning the availability of social support are probably true or
probably false. Items associated with each subscale are summed together to yield
four subscale totals and all of the items are added together to arrive at a total score.
The subscale scores can range from 0 to 12 and the total score can range from 0 to
48. The higher the total score, the higher the level of perceived support. The internal
reliability of the overall scale is good (alpha=0.77). Internal reliability of the
appraisal, belonging, self-esteem, and tangible subscales are adequate as well
(alpha=0.77, 0.75, 0.68, and 0.71, respectively). In the present study, alpha coecients for the subscales could not be calculated because only the sum scale scores,
rather than the individual items, were recorded. However, the internal reliability of
the overall scale could be estimated by treating the four subscales as items. The
resulting alpha coecient for the total ISEL was 0.77. Descriptive statistics, including mean, standard deviation, and range for the ISEL are reported in Table 1.
Information regarding the construct and convergent validity of the ISEL can be
found in Cohen and Hoberman (1983).
1.2.2. Internet use
The Computer Use Survey (CUS) was developed to assess Internet use and social
contact through online interactions. The survey requires the participant to record
the amount of time (hours/minutes) in an average week that she or he engages in a
variety of online activities including: search and do research, visit bulletin boards,
visit chat rooms, create/update websites, play games, use email, use instant messaging, visit multiuser dungeons, and access information as a form of entertainment

442

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

Table 1
Descriptive statistics for the ISEL, the CUS, and the NEO-FFI
Variable
Social support
Appraisal
Belonging
Self-esteem
Tangible
ISEL total
Computer usea
Search/research
Bulletin board
Chat room
Create webpage
Play games
Email
Instant messaging
Multiuser dungeon
Access information
Personality
Agreeableness
Conscientious
Extraversion
Neuroticism
Openness
a

Mean

Standard Deviation

10.16
8.26
8.55
10.67
37.62

2.46
2.59
2.21
1.90
7.08

131.17
12.51
18.03
8.52
60.54
160.19
77.65
2.95
150.16

204.83
51.54
56.95
35.34
204.07
244.89
176.70
23.10
230.49

31.30
30.74
30.87
21.77
28.81

6.57
7.04
6.40
8.77
5.72

Range
012
012
012
012
1048
02100
0600
0390
0300
02100
02400
01500
0300
01200
644
1146
343
242
1544

Internet use is reported in minutes.

(e.g., read newspaper, listen to music). Descriptive statistics for these variables are
presented in Table 1.
1.2.3. Personality
The NEO Five-Factor Inventory (NEO-FFI; Costa & McCrae, 1992b) was used
to assess agreeableness (A), conscientiousness (C), extraversion (E), neuroticism (N),
and openness (O). It consists of 60 items that participants respond to using a vepoint likert scale format (strongly disagree to strongly agree). Each factor is made
up of 12 items that collectively yield a score of 048. In each case, higher numbers
are associated with higher levels of the personality factor. Internal consistency of
this scale was calculated using coecient alpha. Coecients for A, C, E, N, and O
were 0.68, 0.81, 0.77, 0.86, and 0.73, respectively. Construct validity for this test is
reported in the NEO PI-R Manual (Costa & McCrae, 1992b). Descriptive statistics
for the NEO-FFI can be found in Table 1.
1.3. Procedure
Participants were tested while in class, at various sites on campus, typically in a
group of 2025, during 45 min testing sessions. All testing occurred between the
hours of 9.00 a.m. and 3.00 p.m. Participants were told that the purpose of the study

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

443

was to examine factors associated with Internet use. The assessment packets were
then administered to the participants in the following order: the demographic form,
the ISEL, the CUS, and the NEO-FFI. After completing the study, participants
were thanked for their participation and debriefed.

2. Results
Prior to addressing the major aims of the study, a variety of preliminary analyses
were conducted to reduce the data, transform skewed variables, and screen for
multivariate outliers. Regarding the data reduction procedure, the nine Internet use
variables were subjected to a principal components factor analysis with varimax
rotation and kaiser normalization. Inspection of the scree plot indicated a threefactor solution and the types of Internet use loading on each factor are as follows
(Cronbach alpha values in parentheses): (1) bulletin board use, chat room visitation,
creating web pages, and multiuser dungeon visitation (0.69), (2) search/research,
email use, and accessing information (0.60), and (3) utilizing instant messenger and
playing games (0.75). In light of the low Cronbach alpha coecient for factor No. 2,
we examined the intercorrelations among the three Internet use variables. Although
the correlation between email and accessing information was moderately large in
magnitude (r=0.54), the correlations between these two variables and search/
research were considerably smaller (rs of 0.23 and 0.19 for email and accessing
information, respectively). Given these results, we excluded search/research from the
second factor and recalculated the Cronbach alpha. The new alpha coecient was
0.70 and the three principal component factors accounted for 70% of the variance in
the Internet use intercorrelation matrix. Three Internet use factor variables were
then created by summing the individual Internet use variables within each factor.
Upon visual inspection of the component variables, we decided to label the rst
factor Technical. This factor was made up of bulletin board use, chat room visitation, creating web pages, and multiuser dungeon visitation. While these activities
appear to be quite diverse, we believe that there is a common underlying theme for
this factor. That is, one must be fairly technologically savvy to be able to engage in
all of these online activities, hence, the label of Technical seemed appropriate. The
second factor is comprised of email use and accessing information. Both of these
Internet activities involve either generating (email) or receiving (accessing) information. Therefore, we chose to label this factor Information Exchange. Finally, the
third factor, Leisure, is made up of utilizing instant messaging and playing games.
We felt that both of these activities were consistent with relaxing and having fun
through playing games and interacting with others online.
While these three factors collectively seemed to eectively capture online computer use, not all of our participants engaged in these online activities to an equal
degree. In fact, upon inspection of the factors it was found that many participants
reported that they did not engage in one or more of these types of online activities.
For instance, out of 206 participants only 70 individuals reported computer use
consistent with the Technical factor, 183 participants reported computer use

444

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

consistent with the Information Exchange factor, and 122 participants reported
computer use consistent with the Leisure factor. In an attempt to eectively deal
with this issue we decided to exclude non-users from the analyses. We reasoned that
it would be irrelevant to ask if computer use was inuencing social support for these
individuals given the fact that they were not participating in these types of online
activities. Therefore, all subsequent analyses were conducted solely on participants
who reported use consistent with each factor (Technical70 participants, Information Exchange183 participants, Leisure122 participants).
After completing the principal components analyses and excluding individuals
who reported no computer use by type of online activity, the normality of the distributions for each of the three computer use factors was examined. Due to the signicant positive skewness of all three factors, each distribution was logarithmically
transformed to increase normality. We also screened for multivariate outliers before
conducting multiple regression analyses. In particular, all participants who evidenced statistically signicant Mahalanobis D2 values were excluded from the
regression analyses. Finally, because so little is known about the relationships
among specic types of Internet use, personality and perceived social support, we
felt it was important to not overlook potential associations among these variables.
Therefore, in order to minimize the likelihood of committing Type II errors, we set
our critical P-value for all analyses at a more liberal value of P < 0.10. We report all
of our ndings on the basis of this value, however, we label P-values between 0.06
and 0.10 as marginally signicant.
The rst aim of the study was to determine if Internet use is related to perceived
social support. Correlational procedures were used to investigate this issue. No signicant correlations were found between two of the three types of Internet use and
the ISEL. Specically, nonsignicant associations were found between Technical
and the ISEL (r= 0.11, P=0.18) and Information Exchange and the ISEL
(r=0.03, P=0.32). However, a marginally signicant correlation was found between
Leisure and the ISEL (r=0.13, P=0.08). To further examine this issue a simultaneous multiple regression analysis (SMR) was conducted by entering all online
activities in one block to predict ISEL. No signicant eects were found in this
analysis.
To explore the second aim of the study, the relationship between personality, as
measured by the NEO-FFI, and Internet use was examined. No signicant correlations were found between personality and Technical. However, personality was signicantly correlated with Information Exchange and Leisure. Regarding
Information Exchange, both neuroticism (r= 0.11, P=0.07) and agreeableness
(r= 0.10, P=0.09) evidenced marginally signicant correlations. A SMR analysis
was conducted by entering all ve personality traits in one block to predict Information Exchange. The results from this analysis were somewhat consistent with the
correlational ndings in that there was a marginally signicant eect for neuroticism
[t (175)= 1.69, P=0.09, = 0.140]. However, no signicant eect was found for
agreeableness. Regarding the computer use factor of Leisure, signicant correlations
were found with neuroticism (r= 0.16, P=0.04) and conscientiousness (r=0.15,
P=0.05), and a marginally signicant association was found with extraversion

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

445

(r=0.13, P=0.08). However, results of a SMR analysis failed to reveal any signicant personality predictors of Leisure.
To explore the third aim of the study, hierarchical multiple regression analyses
were conducted to examine the interactive eects of Internet use and personality on
perceived social support. The rst set of analyses explored the interactive eects of
Technical and personality in predicting the ISEL. The main eects of personality (A,
C, E, N, and O) and Technical were entered in the rst block of the equation and the
interactions between personality and Technical (ATechnical, CTechnical,
ETechnical, NTechnical, OTechnical) were entered in the second block of the
equation. Results of these analyses demonstrated a signicant main eect for extraversion [t(60)=4.74, P=0.001, =0.587] and a marginal main eect for openness
[t(60)= 1.91, P=0.06, = 0.195]. In addition, a marginally signicant interaction
eect was found between neuroticism and Technical [t(55)= 1.75, P=0.09,
= 0.957]. In order to explore the nature of the interaction eect, we plotted the
four mean ISEL scores that are obtained by factorially crossing the neuroticism and
Technical factor (i.e., Low N, Low T; Low N, High T; High N, Low T; High N,
High T). These group means indicated that individuals who are high in neuroticism
and high in Technical have lower levels of perceived social support than any other
group (Fig. 1).
The second analysis examined the interactive eects of Information Exchange and
personality in predicting the ISEL. The same procedure as mentioned above was
utilized in entering the variables into the equation to predict the ISEL. Signicant
main eects were found for neuroticism [t(167)= 2.49, P=0.01, = 0.181],
extraversion [t(167)=5.58, P=0.001, =0.425], and openness [t(167)= 2.03,

Fig. 1. Mean perceived social support scores by Technical computer use and Neuroticism.

446

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

P=0.04, = 0.137]. A marginally signicant interaction eect was found for neuroticism and Information Exchange [t(167)= 1.82, P=0.07, = 0.707]. Visual
inspection of the mean ISEL scores for the four groups indicated that individuals
high in neuroticism and high in Information Exchange reported lower levels of perceived social support than any other group (Fig. 2).
The third analysis examined the interactive eects of Leisure and personality in
predicting the ISEL. The same procedure as noted above was utilized in entering the
variables into the equation to predict the ISEL. Once again, signicant main eects
were found for neuroticism [t(110)= 2.71, P=0.01, = 0.246] and extraversion
[t(110)=4.75, P=0.001, =0.438). A signicant interaction eect was also found for
agreeableness and Leisure [t(105)=2.38, P=0.02, =1.691]. An examination of the
four group means indicated that individuals high in agreeableness and high in Leisure
reported higher levels of perceived social support than any other group (Fig. 3).

3. Discussion
There were three major aims of this project. First, this study attempted to determine the association between Internet use and perceived availability of social support. Results of this study indicated no strong relationships between these variables.
However, a marginally signicant positive correlation was found between Leisure
and the ISEL. The factor of Leisure involves social Internet activities like instant
messaging and playing games with others online. The positive relationship between

Fig. 2. Mean perceived social support scores by Information Exchange computer use and Neuroticism.

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

447

Fig. 3. Mean perceived social support scores by Leisure computer use and Agreeableness.

these variables indicates that individuals who reported higher Leisure use perceived
greater social support when compared with individuals who reported less Leisurebased online activity. Although this positive correlation is suggestive, it is important
to note that this nding is only marginally signicant and SMR analyses did not
replicate this association.
The second aim of this project was to determine the relationship between Internet
use and ve basic personality factors. There were no signicant relationships found
between Technical Internet use and any of the personality traits. However, personality was marginally related to Information Exchange (email and accessing information) and Leisure (instant messaging and playing games). The personality
dimension of neuroticism seemed to be most consistently related to these types of
online activities. While other personality traits were correlated with Information
Exchange (agreeableness) and Leisure (conscientiousness, extraversion), these correlations were not supported by regression analyses and therefore do not merit further discussion. Regarding the eects of neuroticism, both correlation and
regression analyses revealed marginally signicant negative associations between
neuroticism and Information Exchange and neuroticism and Leisure. These ndings
indicate that individuals who are high in neuroticism are less likely to utilize these
types of Internet activities. While these ndings are consistent with some research
presented in the literature (Tuten & Bosnjak, 2001), these results contradict other
published results. Specically, although Hamburger and Ben-Artzi (2000) reported a
positive relationship between neuroticism and social-leisure activities, our research
was not supportive of this nding. We found a negative relationship between

448

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

neuroticism and Leisure activity. One explanation for the inconsistency between the
present results and past research concerns the degree of measurement specicity that
has been utilized when assessing Internet activity. In particular, whereas previous
studies have generally examined only global measures of activity on the Internet, the
present study measured online activities much more precisely (e.g., reported minutes
online). In addition, whereas most previous work has examined individual Internet
use variables as the unit of analysis, this study utilized principal component factors
in all inferential statistical analyses. This approach is generally regarded by psychometricians as being superior to individual variable analyses because principal components are typically more reliable than individual variables (Tabachnick & Fidell,
1989). Regardless of the merits of this study, the association between neuroticism
and leisure Internet use requires further attention in order to clarify the inconsistencies in the literature.
The third aim of this study was to examine whether personality serves to moderate
the association between Internet use and perceived social support. Both signicant
and marginally signicant interaction eects were found between personality and
Internet use. Regarding the marginal eects, neuroticism was found to interact with
Technical Internet use (bulletin board, chat room, web page, multiuser dungeon) in
that individuals high in neuroticism and high in Technical use reported lower perceived support than any other group. This same trend was found between neuroticism and Information Exchange. Individuals high in neuroticism and high in
Information Exchange reported lower perceived support compared with the other
groups. These eects imply that highly neurotic individuals who use these types of
Internet activities do seem to be at risk for lowered perception of social support.
However, the causal direction of this eect may not be so clear, and in fact, may
actually operate in a reverse manner. That is, highly neurotic individuals who have
very low levels of perceived support might seek out these types of Internet activities
in an eort to compensate for their lowered sense of support. While this issue is
beyond the purpose and scope of this study, future work in this area should try to
elucidate the nature and causal direction of the associations between neuroticism
and these types of Internet activities.
Finally, a signicant interaction eect was found between agreeableness and
Leisure. Participants who reported high levels of agreeableness and high levels of
Leisure Internet use perceived themselves as having higher levels of social support,
compared to the other groups. While the current study does not allow for any denitive explanation of this eect, perhaps it is the case that highly agreeable individuals experience more positive interactions when engaging in instant messaging and
online games, which leads to higher quality social interactions and higher levels of
perceived support. Obviously the merits of this positive social interaction
hypothesis cannot be discerned from the current study and hence necessitates further
research. It is also unclear as to why agreeableness was the only personality factor to
interact with Leisure in this manner. This issue would likewise benet from further
research.
In summary, while this study seems to invite as many questions as it addresses,
one should not be surprised by this, given that we are exploring a new area of

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

449

research, a new frontier. However, what can be said about these ndings is that
although Internet use alone may not strongly inuence perceived social support, it
does seem to interact with personality in an important way to inuence perceptions
of support. Furthermore, these ndings help to address some of the inconsistencies
that have been reported in the literature. Specically, research in this area has
alternately indicated that Internet use either facilitates or degrades social relationships and social support (Kraut et al., 1998a, 1998b; Parks & Floyd, 1996). What
this study demonstrates is that both of these eects can occur, it is not simply a
question of one or the other. To illustrate, high levels of neuroticism, when combined with high levels of specic types of Internet use, are associated with reduced
feelings of social support. In contrast, high levels of agreeableness, coupled with
high levels of Internet use, lead to an enhancement of perceived support.
While this study may help to address some of the inconsistencies in the Internetsocial support literature, we acknowledge that there are some limitations of the
present study that should be considered. First, while the participants were selected to
represent a wide variety of majors, the sample used in this study is still based on
college students and therefore is somewhat limited in its generalizability. Also, while
this study assessed Internet use more precisely than past studies, the measurement of
this variable was based on a self-report approach, rather than on objective behavioral criteria. Therefore, the assessment of Internet use may be somewhat biased due
to memory errors. In addressing these limitations, future research should attempt to
survey a more representative sample that includes both college students as well as
traditional adults. Furthermore, rather than relying on self-reported Internet use, a
behavioral measure (e.g., a computer program that records time spent online) could
be employed which would perhaps yield a more reliable measurement of Internet
activity.
To conclude, while this study has several limitations that need to be addressed in
future work, it nevertheless makes an important contribution to our understanding
of the eects of Internet use. Specically, these ndings indicate that researchers can
no longer look at bivariate relationships or simple main eects of online activity on
social support and expect to understand the complexity of the association between
these two constructs. Other relevant variables, such as personality factors, should
also be considered as they might exert important moderating eects. Future work
should attempt to understand why certain personality traits are associated with
benecial eects of Internet use (enhanced support) while other personality factors
seem to be associated with more problematic experiences (degraded support). In
focusing on such issues, a more accurate understanding of the relationship between
Internet use and social support may be possible.

Acknowledgements
The authors would like to thank Andy Abrams, Von Bakanic, and Walter Pharr
for allowing us to recruit participants in their classes.

450

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

References
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: a
meta-analysis. Personnel Psychology, 44, 126.
Block, J. (1995). A contrarian view of the ve-factor approach to personality description. Psychological
Bulletin, 117, 187215.
Bromberg, H. (1996). Are MUDs communities? Identity, belonging and consciousness in virtual worlds.
In R. Shields (Ed.), Cultures of the Internet: virtual spaces, real histories, living bodies (pp. 143152).
London: Sage.
Cohen, S., & Hoberman, H. M. (1983). Positive events and social supports as buers of life change stress.
Journal of Applied Social Psychology, 13, 99125.
Costa, P. T., & McCrae, R. R. (1992a). Four ways ve factors are basic. Personality and Individual Differences, 13, 653665.
Costa, P. T., & McCrae, R. R. (1992b). NEO PI-R professional manual. Odessa, Florida: Psychological
Assessment Resources.
Eysenck, H. J. (1967). The biological basis of personality. Springeld, Illinois: Charles Thomas.
Halamandaris, K. F., & Power, K. G. (1999). Individual dierences, social support and coping with the
examination stress: a study of the psychosocial and academic adjustment of rst year home students.
Personality and Individual Dierences, 26, 665685.
Hamburger, Y. A., & Ben-Artzi, E. (2000). The relationship between extraversion and neuroticism and
the dierent uses of the Internet. Computers in Human Behavior, 16, 441449.
Hittner, J. B., & Swickert, R. J. (2002). Modeling functional and structural social support via conrmatory factor analysis: evidence for a second-order global support construct. Journal of Social
Behavior and Personality (in press).
Joinson, A. (1998). Causes and implications of disinhibited behavior on the Internet. In J. Gackenbach
(Ed.), Psychology and the Internet: intrapersonal, interpersonal, and transpersonal implications (pp. 43
60). San Diego: Academic Press.
Jones, S. G. (1997). The Internet and its social landscape. In S. G. Jones (Ed.), Virtual culture: identity and
communication in cybersociety (pp. 735). London: Sage Publications.
Katz, J. E., & Aspden, P. (1997). A nation of strangers? Communications of the ACM, 40, 8186.
Kiesler, S., & Kraut, R. (1999). Internet use and ties that bind. American Psychologist, 54, 783784.
King, S. A., & Moreggi, D. (1998). Internet therapy and self-help groupsthe pros and cons. In
J. Gackenbach (Ed.), Psychology and the Internet: intrapersonal, interpersonal, and transpersonal implications (pp. 77109). San Diego: Academic Press.
Kraut, R., Kiesler, S., Mukopadhyay, T., Scherlis, W., & Patterson, M. (1998a). Social impact of the
Internet: what does it mean? Communications of the ACM, 41, 2122.
Kraut, R., Patterson, M., Lundmark, V., Kiesler, S., Mukopadhyay, T., & Scherlis, W. (1998b). Internet
paradox: a social technology that reduces social involvement and psychological well-being? American
Psychologist, 53, 10171031.
Lakey, B., & Drew, J. B. (1997). A social-cognitive perspective on social support. In G. R. Pierce,
B. Lakey, I. Sarason, & B. Sarason (Eds.), Sourcebook of social support and personality (pp. 107140).
New York: Plenum Press.
Mankowski, E. S., & Wyer, R. S. (1997). Cognitive causes and consequences of perceived social support.
In G. R. Pierce, B. Lakey, I. Sarason, & B. Sarason (Eds.), Sourcebook of social support and personality
(pp. 141168). New York: Plenum Press.
Martin, M. M., & Anderson, C. M. (1995). Roommate similarity: are roommates who are similar in their
communication traits more satised? Communication Research Reports, 12, 4652.
McCrae, R. R. (1996). Social consequences of experiential openness. Psychological Bulletin, 120, 323337.
McCrae, R. R., & Costa, P. T. (1999). A ve-factor theory of personality. In L. A. Pervin, & O. P. John
(Eds.), Handbook of personality: theory and research (pp. 139153). New York: Guilford.
McKenna, K. Y. A. (1998). The computers that bind: relationship formation on the Internet. Unpublished
doctoral dissertation, Ohio University.

R.J. Swickert et al. / Computers in Human Behavior 18 (2002) 437451

451

McKenna, K. Y. A., & Bargh, J. A. (2000). Plan 9 from cyberspace: the implications of the Internet for
personality and social psychology. Personality and Social Psychology Review, 4, 5775.
Mickelson, K. D. (1997). Seeking social support: parents in electronic support groups. In S. Kiesler (Ed.),
Culture of the Internet (pp. 157178). Mahwah, New Jersey: Lawrence Erlbaum Associates.
Newcomb, T. M. (1961). The acquaintance process. New York: Holt, Rinehart & Winston.
OBrien, T. B., & DeLongis, A. (1996). The interactional context of problem-, emotion-, and relationshipfocused coping: the role of the big ve personality factors. Journal of Personality, 64, 775813.
Parks, M. R., & Floyd, K. (1996). Making friends in cyberspace. Journal of Communication, 46, 8097.
Procidano, M. E. (1992). The nature of perceived social support: ndings of meta-analytic studies. In
C. D. Spielberger (Ed.), Advances in personality assessment (Vol. 9) (pp. 126). Hillsdale, NJ: Lawrence
Erlbaum Associates.
Silverman, T. (1999). The Internet and relational theory. American Psychologist, 54, 780781.
Tabachnick, B. G., & Fidell, L. S. (1989). Using multivariate statistics (2nd ed.). New York: HarperCollins.
Turner, R. J. (1999). Social support and coping. In A. V. Horwitz, & T. L. Scheid (Eds.), A handbook for
the study of mental health: social contexts, theories, and systems (pp. 198210). New York: Cambridge
University Press.
Tuten, T., & Bosnjak, M. (2001). Understanding dierences in web usage: the role of need for cognition
and the ve factor model of personality. Social Behavior and Personality, 29, 391398.
The UCLA Internet Report (2000). Surveying the digital future. Available: www.ccp.ucla.edu.
Watson, D., & Hubbard, B. (1996). Adaptational style and dispositional structure: coping in the context
of the ve-factor model. Journal of Personality, 64, 737773.
Winzelberg, A. (1997). The analysis of an electronic support group for individuals with eating disorders.
Computers in Human Behavior, 13, 393407.

Computers in Human Behavior 27 (2011) 18571861

Contents lists available at ScienceDirect

Computers in Human Behavior


journal homepage: www.elsevier.com/locate/comphumbeh

Internet use, happiness, social support and introversion: A more ne grained


analysis of person variables and internet activity
M.E. Mitchell a,, J.R. Lebow a, R. Uribe a, H. Grathouse a, W. Shoger b
a
b

Illinois Institute of Technology, Chicago, IL, United States


Private Practice, Oak Brook, IL, United States

a r t i c l e

i n f o

Article history:
Available online 19 May 2011
Keywords:
Internet use
Personality
Happiness
Introversion
Social support

a b s t r a c t
The Internet is no longer an advanced technology accessible to a select few. It has become a ubiquitous
tool for users ranging from professional programmers to casual surfers and young children. The exponential increase in time online has prompted curiosity and speculation about the interaction between this
technology and individual person variables. While general survey data exist regarding broad patterns
of Internet use, less is known about the relationship between specic usage and individual personality
dimensions, mood variables, or social activity. This study sought to clarify several of these relationships.
One hundred eighty-ve undergraduate student volunteers completed two detailed measures of Internet
use across various domains (for example: work/school, tasks/services, entertainment), as well as measures of happiness, perceived social support, and introversion. Specic types of Internet use, including
gaming and entertainment usage, were found to predict perceived social support, introversion and happiness. Use of the Internet for mischief-related activities (for example: downloading without payment,
fraud, snooping) was associated with lower levels of happiness and social support. These ndings support
the utility of and need for specic rather than general Internet research. Directions for future research
clarifying the role of the Internet in quality of life and interpersonal relations are suggested.
2011 Elsevier Ltd. All rights reserved.

1. Introduction
Members of almost every demographic background use the
Internet in order to stay better connected with loved ones, to
quickly and efciently complete daily tasks and transactions, and
stay abreast of the most up-to-date current events. Broad survey
studies conrm that Internet use continues to rise, and that previously cited gaps based on age, gender, technology access and socioeconomic status, are quickly disappearing (c.f., Fallows, 2004;
Lenhart, Madden, Macgill, & Smith, 2007; Madden, Fox, Smith, &
Vitak, 2007).
The Pew Internet and American Life Project represents one of
the main efforts to gather large-scale data on Internet use. Using
nationwide telephone surveys, most recently in December 2008
(N = 2253), Pew has been a leader in documenting the activities
of the Internet. Those data support and verify the rapid continued
expansion of Internet use. While the Pew project has characterized
teens as one of the most wired segments of the American population for the past 10 years, they also reported that Internet penetration reached 74% for all American adults in 2008 (Jones & Fox,
Corresponding author. Address: 3105 S. Dearborn, Illinois Institute of Technology, 252 LS, Chicago, IL 60616, United States. Tel.: +1 312 567 3501; fax: +1 312 567
3493.
E-mail address: mitchelle@iit.edu (M.E. Mitchell).
0747-5632/$ - see front matter 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.chb.2011.04.008

2009) reecting a sharp increase from 66%, 3 years earlier. The


same 2008 survey found that 93% of 1217 year olds and 89% of
1824 year olds regularly go online, and, though the trend decreases with age, 87% of those aged 3034, 78% of 5054-year-olds,
and 45% of those 7075 use the Internet regularly (Jones & Fox,
2009).
Use of the Internet also bridges previously reported (e.g., Lenhart, 2000) gender and ethnicity gaps. Responses to the Pew
2008 survey indicate that 75% of adult females and 73% of adult
males use the Internet. As well, 77% of Caucasian respondents,
64% of AfricanAmericans and 58% of Hispanic participants reported regular online use (Jones & Fox, 2009). Internet usage also
is relatively well-represented across most income and education
brackets, though usage trends increase in relation to annual income and education (Jones & Fox, 2009). Overall, general survey
data indicate that despite slight differences in prevalence, Internet
use is not limited to specic demographic proles. Further, across
almost every demographic group, online use has increased and
continues to do so rapidly. Within a relatively short period of time
the Internet has penetrated American society both swiftly and
thoroughly.
In addition to this trend research, other, less conclusive data
have been gathered regarding the consequences and benets of
this popular technology. Not only is Internet use in general on
the rise, the sorts of uses for the Internet are diversifying and

1858

M.E. Mitchell et al. / Computers in Human Behavior 27 (2011) 18571861

broadening in ways that, only a few years ago, would have seemed
unlikely or impossible. In addition to access to products, information, and transactions, users increasingly turn to the Internet for
social reasons. The Internet enables individuals to nd new relationships and fosters more efcient communication within existing relationships as well as offers multitudes of new ways to
develop and maintain friendships and romances. It is unclear however, how individual person variables, and interpersonal variables
interact with the burgeoning use.
Perceived social support has long been recognized (e.g., Barrera,
1986; Cohen & Wills 1985; Winemiller, Mitchell, Sutliff, & Cline,
1993) to provide a buffer in times of stress, increase happiness,
and enhance psychological well-being. Internet relationships offer
a new avenue for potential experiences of perceived social support,
in which relationships may exist entirely without any face-to-face
interaction. It is an empirical question whether or not interpersonal relationships developed and maintained predominantly or
even entirely over the Internet increase levels of perceived support
and/or convey the same benets that social support has been
shown to provide in the past.
Some (e.g., Parks & Floyd, 1996) contend that online interactions are shallow approximations of quality real life relationships,
and that cyberspace creates an easily-penetrated illusion of community. This argument suggests the possibility that time spent online, in lieu of participating in the face-to-face world, might
actually detract from an individuals assessment of perceived social
support. Indeed, preliminary survey data suggest that online relationships may not be equivalent to their face-to-face counterparts.
Virtual interactions are generally marked by higher levels of selfdisclosure than face-to-face interactions (Underwood & Findlay,
2004). Deception and misrepresentation on the Internet are easy
and frequent, and misinterpretation of specic interactions due,
in part, to the absence of nonverbal cues, also are common concerns (Wallace, 1999; Whitley, 1997). The somewhat limited data
are mixed; some studies support better outcomes in face-to-face
interactions, whereas others show evidence that online support
carries unique benets (Bargh, Katelyn, & McKenna, 2004). Controversy exists as a consequence of contradictory ndings and much
remains unknown regarding the benets and drawbacks of online
social support.
Initial investigations of the emotional benets and consequences of the Internet (Kraut et al., 1998; Shklovski, Kraut, &
Cummings, 2006) found high levels of Internet use to be associated
with depression and social isolation. Specically, increased time
online was associated with declines in individuals communication
with members of their household, declines in the size of their
face-to-face social circle and increased feelings of loneliness.
Amichai-Hamburger, Fine, and Goldstein (2004) found that Internet use was directly related to feelings of loneliness. These ndings
have not been consistently supported by other studies. For example, Bargh et al. (2004) reviewed the existing Internet research
and disputed the conclusion that Internet use contributes to
depression and loneliness characterizing those ndings as exaggerated media-friendly fallacies. Their review found that the Internet
helped reduce these symptoms, facilitating relationships with
long-distance friends and family members and enhancing feelings
of connectivity and community.
In addition to these studies, several authors (e.g. AmichaiHamburger et al., 2004; Kraut et al., 1998) have attempted to identify patterns of Internet use in relation to personality variables. The
notion that individuals may be predisposed to excessive use or
avoidance of Internet use is predicated on the view that individual
characteristics underlie this behavior in much the same way as
these same variables would inuence face-to-face behavior. Amichai-Hamburger et al. (2004) examined introversion, identity, and
level of neurotic behavior in relation to establishing group mem-

bership. Noting the anonymity of the Internet, the investigators


sought to examine ones sense of identity in relation to an individuals sense of the location of his/her true identity. He found that
extroverted individuals reported that the self interacting in real
time was a more accurate representation of their identity than
the individual portrayed in Internet interactions; conversely, introverts reported that the self portrayed in the virtual world more
accurately represented their real self.
Though these results were fairly clear, the relationship between
Internet use and introversion and extroversion does not appear
straightforward. Previous research on these personality variables
would lead one to hypothesize that introverts would have lower
levels of interaction if compared to extroverted individuals (c.f.,
Diener, Larsen, & Emmons, 1984; Fleeson, Malanos, & Achille,
2002; Laney, 2002). However, the capacity for Internet anonymity
and the fact that such behavior occurs while totally alone but can
virtually include others, or at least the idea of others, radically
shifts the understanding of what it means to be solitary, and opens
a path for introverts to reach out to others without actual, real life
interaction. Hence, behavioral patterns associated with individuals
on the basis of qualities such as extroversion or introversion may
no longer have relevance, as interaction has taken on a more complex meaning. As the Internet begins to play an increasingly large
role in developing and maintaining interpersonal relationships,
personality dimensions such as introversion and extroversion
may not account for variance in behavior in the same way as in
the past.
This study sought to develop a more specic and nuanced picture of Internet use in relation to introversion, levels of happiness,
social support by using various domains of Internet use. The current study identied six specic domains of usage to determine
if a model could be produced in which type of Internet use predicted levels of happiness, perceived support, and introversion
and to determine if there were differences between groups of individuals by type of Internet use.

2. Materials and methods


One hundred eighty-ve undergraduate students at a Midwestern technological university participated in the study and ranged
in age from 18 to 30 years (M = 20.5 years). The sample consisted
of 124 males and 61 females. The ethnic composition included
110 Caucasian, 44 Asian, 6 African American, 6 Hispanic, and 17
other designated students. Participant volunteers were emailed a
one-time use individualized link to an anonymous web-based
survey site at which demographic and background information
items and nine self-report measures were available for completion.
There were two measures of happiness, two measures of introversion, two measures of social support, and three measures of Internet use. Consent was obtained electronically at the start of the
survey.
The measures used were as follows: Bradburn Affect Balance
Scale (ABS; Bradburn, 1969) The ABS was designed to assess the
balance between positive and negative affect experienced during
the 4 weeks prior to administration. It consists of 10 yes or
no items that measure the affective component of subjective
well-being. The authors report that the ABS has adequate validity.
Subjective Happiness Scale (SHS; Lyubormirsky & Lepper,
1999). The SHS is a measure of global happiness and consists of
four items rated on a seven point Likert scale. The authors reported
excellent internal consistency across age, occupation, language,
and culture r :86. They also demonstrated strong testretest
reliability over time r :72. There was convergence with other
published measures of happiness and well-being r :62; it did
not correlate with constructs thought to be unrelated to happiness.

1859

M.E. Mitchell et al. / Computers in Human Behavior 27 (2011) 18571861

Social Support Appraisal Scale (SS-A; Vaux et al., 1986). The SSA is a 23-item questionnaire evaluating perceived support of
friends, family, and others. It has been shown to have good internal
consistency, and adequate concurrent, divergent, and convergent
validity.
Multidimensional Scale of Perceived Social Support (MSPSS; Zimet, Dahlem, Zimet, & Farley, 1988). The MSPSS is a 12-item measure designed to examine ones subjective assessment of social
support adequacy from family, friends, and other signicant others. The authors reported excellent total scale reliability r :88,
as well as excellent testretest reliability r :85.
MyersBriggs Type Indicator (MBTI; Hirsch & Kummerow,
1989). The MBTI provides a useful measure of personality based
on eight personality preferences. The eight preferences are organized into four bi-polar scales (extroversionintroversion; sensingintuiting; thinkingfeeling; judgingperceiving). For the
current study, only the extroversionintroversion scale was used.
Self-Assessment for Introverts (SAI; Laney, 2002). This measure assesses individuals level of introversion and contains 30
true or false items. Scores are categorized into three groups:
introvert, middle of the continuum, or extrovert. This measure
is unique because it conceptualizes introversion as distinct from
extroversion. No psychometric data have been reported for this
measure.
Internet Usage Survey (IUQ). This questionnaire was developed
for the current study to measure solitary Internet usage not involving interaction, real or virtual, with others. The measure consists of
thirty questions across six domains of solitary Internet activity:
purchasing, information-seeking, tasks/services, entertainment,
work/school-related activities and mischief. Each domain contains
ve items assessing the frequency of different aspects of use. The
ve items within each domain include endorsement (or not) of
participation in that domain at any time, acknowledgment (or denial) of current use, and an item asking about specic activities
within the domain. Examples of domain-specic activities are
illustrated in Table 1. The IUQ yielded a Cronbachs alpha coefcient of r :80. Test retest reliabilities for domains ranged from
r 1 to r :22 for information seeking; purchasing r :60 to
r :02; entertainment r :83 to r :22; mischief r :71 to
r :23; tasks and services r :73 to .11; work and school r 1
to r :19. The lowest reliabilities were associated with activities
that logically are low frequency.
Modes of Interaction Questionnaire (MOIQ). Developed for the
current study, this questionnaire assesses the different ways an
individual can interact with others on the Internet, and the frequency and duration of those activities. It is comprised of seven
categories of interactive activity including: instant messaging,
audio/video conferencing, virtual dating, interactive online
games/activities, emailing, chatrooms, and cybersex. Respondents
are asked to indicate the frequency with which they engage in each
mode of interaction. Within each domain respondents are asked
rst to endorse or deny ever having engaged in activities within
the domain, a second item then requires the respondent to
acknowledge or deny current activities within the domain. For
each item, the respondent is asked whether a real, virtual or a combination of virtual and real identities were used when engaging in
the activity, the frequency of the activity within the past 30 days
and then nally, the average amount of time spent per session in
the activity.
Coefcient alpha of the MOIQ was r :89. Spearman reliabilities ranged from r :81 to .64 for use of instant messaging; r 1
to r :57 on email use; r :66 to .05 on audiovideo conferencing; r :70 to r :21 on use of chat rooms; r :92 to r :27 on
virtual dating, r :87 to .12 on cybersex and nally r 1 to .17
for online gaming. The most unstable items were those reporting
levels of current use, suggesting that use varied rapidly within a

Table 1
Example activities associated with Internet domains.
Purchasing

Information seeking

Entertainment

Travel related

Job hunting/researching
employers
School application process

Online games

Mental and medical health


Politics

Movies
Viewing
pornography
Gambling
Internet surng
Literature
Sports
Fantasy sports
Other

Entertainment/
leisure
Food
Clothing and
accessories
Sporting goods
Books
Medical
Electronics
Other

Diet/exercise
Potential purchases
News
Reference
Housing
Transportation
Finance
Personal interests/hobbies
Electronic repairs
Religious/spiritual purposes
Environment
Other

Music

Tasks and/or services

Mischief

Work/school related
tasks

Banking
Bills
Investing

Hacking
Snooping/lurking
Downloading without
payment
Stealing

Online classes
Class assignments
Internet based work

Personal information
tasks
Electronic repairs
Selling
Other

Other

Fraud
Plotting
Other

brief period of time for video conferencing, cybersex, and virtual


dating. Use of email was reported by 100% of respondents.
Two additional measures were included to control for clinically
signicant levels of interpersonal distress that were thought to
potentially confound the measurement of introversion. These were
the Fear of Negative Evaluation (FNE; Watson & Friend, 1969) and
the Social Avoidance and Distress Scale (SAD; Watson & Friend,
1969). These were not found to be signicant and were eliminated
from further analysis.

3. Results
Almost all participants used the Internet for purchasing, information seeking, work/school, tasks and services and entertainment
purposes. Online mischief (e.g., theft, illegal downloads, etc.) was a
unique category with 119 participants reporting that they did not
engage in mischief whereas 63 endorsed such activities. Eightyfour participants indicated that they had engaged in mischief in
the past and 99 denied ever engaging in mischief activities.
A series of regressions were computed to determine if Internet
use across the various domains could predict level of social support, happiness, or introversion. Gaming and mischief predicted total support as measured by the SSAS in which R2 = .11; F = 10.66;
p < .000. Mischief predicted the level of happiness as measured
by the SHS in which R2 = .04; F = 6.68; p = .01. As expected, time
spent on solitary tasks predicted introversion as measured by the
SAI, R2 = .02; F = 3.92, p < .05. Entertainment predicted introversion
as measured by the MBTI, R2 = .04; F = 7.36, p < .007.
A comparison between groups by mischief (yes/no) demonstrated no difference in social support, happiness or introversion.
However, there were differences in time spent chatting online
and engaging in cybersex, in which the individuals endorsing mischief engaged in more online purchasing, chatting and cybersex.
The groups were then divided by the level of mischief as follows:

1860

M.E. Mitchell et al. / Computers in Human Behavior 27 (2011) 18571861

denial of any mischief, only illegal downloads, and serious mischief


activities, i.e. fraud or visiting bomb web sites. There was a significant difference by group F = 4.3; p < .015 in happiness as measured
by the SHS and in social support as measured by the MSPSS
(F = 3.55; p < .03). The group engaging in serious mischief had lower levels of happiness and the highest level of perceived support,
whereas the illegal downloads group had the lowest level of support. Additionally, the group with the highest frequency of mischief engaged in the most chatting F = 8.24; p < .000; cybersex
6.90, p < .001; purchasing F = 4.17, p < .02; tasks and services
F = 3.02, p < .05 and the mean level of activity decreased by group
membership on each variable with those not endorsing mischief
having the lowest levels of activity.
A series of regressions also were computed to determine if the
composite scores for internet use as measured by the IUQ, and
the MOIQ could predict happiness, social support, or introversion.
Total minutes of internet use as measured by the IUQ and the
MOIQ predicted total happiness (F = 34.7; p < .001; R2 = .92) as
measured by the SHS. There were no signicant ndings for social
support or introversion.

4. Discussion
The penetration of the Internet into both professional and interpersonal life domains has the potential for far-reaching implications for quality of life and social interactions. This study
conrmed Pew ndings (Jones & Fox, 2009) that the Internet has
become ubiquitous for a wide variety of uses, including entertainment, purchasing, information-seeking, tasks and services and
work and school purposes. Participants in this study varied in the
amount of time they spent online, in which higher levels of participation in certain activities signicantly predicted lower levels of
happiness, social support and higher levels of introversion. Specifically, these results suggest that heavy Internet use in specic domains (i.e. gaming and mischief) is associated with a diminution of
an individuals perceived social support, which would suggest that
there is risk for higher levels of a variety of problems since the relationship between social support and well being has been so robust.
Also, individuals who spent more time online engaged in activities
categorized as entertainment were more introverted. It appears
that merely examining overall use of the Internet in relation to
well-being or happiness may not be as useful as a more ne
grained analysis of Internet activities in relation to specic person
variables. This study was the rst step in developing a model that
can be tested to determine if the relationship between types of
internet use and person variables is of sufcient strength to have
utility in, for example, identifying youth at risk.
Internet use for the purpose of mischief was associated with
lower levels of happiness. Differences also were found in specic
types of Internet between participants who reported engaging in
mischief and those who did not. Participants endorsing mischief
spent signicantly more time online engaged in cybersex, purchasing and chatting. When these groups were further divided into
three groups, less-serious mischief (downloads only), serious mischief and no mischief, additional group differences were found.
Individuals endorsing serious mischief were less happy, and yet
had higher levels of perceived social support than the less serious
mischief and no mischief groups. These results suggest that there
might be a unique prole for individuals who engage in more serious mischief activities. The unexpected combination of high perceived social support and low happiness warrants further
investigation. A reasonable but perhaps counterintuitive interpretation of this combination of lower happiness and high social support might indicate the existence of a subpopulation of relatively
well-connected mischief engagers.

The specic combination of activities endorsed by the individuals within the most serious mischief group (i.e., higher levels of cyber-sexual activity, high levels of talking online and spending) is
consistent with behavior exhibited by individuals frequently described as hypomanic or bipolar. It is unclear, however, if this psychological prole would represent an accurate characterization of
overall adjustment or functioning of these participants, particularly since the sample was small and homogeneous. This also warrants additional investigation as such individuals are at risk for an
array of problems and mischief on the internet may be a good marker variable for early identication of such individuals.
Limitations of this study pertain directly to the sample size and
limits of generalizability. Participants were predominantly male
college students attending a science and technology-focused university. It is possible that Internet use may be somewhat different
in this sample as compared with the general population. Additionally, because the Internet is such a dynamic and rapidly changing
technology, it seems unlikely that the Internet use measures created for this study were able to tap into all the possible domains
of specic types of use. This study underscores the need for focused
research examining specic aspects of Internet use that take into
account the dynamic nature of the Internet and how it relates to
individual differences and interpersonal interaction.
Acknowledgments
The authors wish to thank other members of the research team
who contributed to this effort including Frank Connors, Sapna Ram,
Manasa Kasinath, Alexis Kramer, Bethany Grix, Jennifer Marola,
and Morgan Carey, Illinois Institute of Technology, Chicago, IL.
References
Amichai-Hamburger, Y., Fine, A., & Goldstein, A. (2004). The impact of Internet
interactivity and need for closure on consumer preference. Computers in Human
Behavior, 20, 103117.
Bargh, J., Katelyn, Y., & McKenna, A. (2004). The Internet and social life. Annual
Review of Psychology, 55, 573.
Barrera, M. Jr., (1986). Distinctions between social support concept, measures, and
models. American Journal of Community Psychology, 14, 413445.
Bradburn, N. (1969). The structure of psychological well being. Chicago, IL: Aldine.
Cohen, S., & Wills, T. A. (1985). Stress, social support, and the buffering hypothesis.
Psychological Bulletin, 98, 310357.
Diener, E., Larsen, R. J., & Emmons, R. A. (1984). Person  situation interactions:
Choice of situations and congruence response models. Journal of Personality and
Social Psychology, 47, 580592.
Fallows, D. (2004). The Internet and daily life: Many Americans use the Internet in
everyday activities, but traditional ofine habits still dominate. Pew Internet and
American life project. <http://pewInternet.org/pdfs/>.
Fleeson, W., Malanos, A. B., & Achille, N. M. (2002). An intra individual process
approach to the relationship between extraversion and positive affect: Is acting
extraverted as good as being extraverted. Journal of Personality and Social
Psychology, 83(6), 14091422.
Hirsch, S., & Kummerow, J. (1989). Life types. New York: Warner Books, Inc.
Jones, S., & Fox, S. (2009). Generations online in 2009. Pew Internet and American
life project. <http://www.pewInternet.org/Reports/2009/Generations-Online-in2009.aspx>.
Kraut, R., Lundmark, V., Patterson, M., Kiesler, S., Mukopadhyay, T., & Scherlis, W.
(1998). Internet paradox: A social technology that reduces social involvement
and psychological well-being? American Psychologist, 53(9), 10171031.
Laney, M. O. (2002). The introvert advantage: How to thrive in an extroverted world.
New York, NY: Workman Publishing Company, Inc..
Lenhart, A. (2000) Whos not online: 57% of those without Internet access say they
do not plan to log on. Pew Internet and American life project. <http://www.
pewInternet.org//media//Files/Reports/2000/Pew Those_Not_Online_ Report.
pdf.pdf>.
Lenhart, A., Madden, M., Macgill, A. R., & Smith, A. (2007). Teens and social media:
The use of social media gains a greater foothold in teen life as they embrace the
conversational nature of interactive online media. Pew Internet and American life
project. <http://www.pewInternet.org/pdfs/PIP_Teens_Social_Media_Final>.
Lyubormirsky, S., & Lepper, H. S. (1999). A measure of subjective happiness:
Preliminary reliability and construct validation. Social Indicators Research, 46(2),
137155.
Madden, M., Fox, S., Smith, A., & Vitak, J. (2007). Digital footprints: Online identity
management and search in the age of transparency. Pew Internet and American
life project. <http://www.pewInternet.org/pdfs/PIP_Digital_Footprints>.

M.E. Mitchell et al. / Computers in Human Behavior 27 (2011) 18571861


Parks, M., & Floyd, K. (1996). Making friends in cyberspace. Journal of
Communication, 46(1), 8098.
Pew Internet and American Life Project. (2008). Demographics of Internet users.
<http://www.pewInternet.org/Data-Tools/Download-Data/~/media/Infographics/
Trend%20Data/January%202009%20updates/Demographics%20of%20Internet%20
Users%201%206%2009.jpg>.
Shklovski, I., Kraut, R., & Cummings, J. (2006). Routine patterns of Internet use and
psychological well-being: Coping with a residential move. In CHI 2006
proceedings, online communities (pp. 969978), Montreal, Quebec, Canada.
Underwood, H., & Findlay, B. (2004). Internet relationships and their impact on
primary relationships. Behavior Change, 21(2).
Vaux, A., Philips, J., Holly, L., Thomson, B., Williams, D., & Steward, D. (1986). The
social support appraisals (SSA) scale: Studies of reliability and validity.
American Journal of Community Psychology, 14, 195219.

1861

Wallace, P. (1999). The Psychology of the Internet. New York, NY: Cambridge
University Press.
Watson, D., & Friend, R. (1969). Measurement of social-evaluative anxiety. Journal of
Consulting and Clinical Psychology, 33, 448457.
Whitley, E. (1997). In cyberspace all they see are your words: A review of the
relationship between body, behavior and identity drawn from the sociology of
knowledge. Information, Technology and People, 10(2), 147163.
Winemiller, D., Mitchell, M. E., Sutliff, J., & Cline, D. (1993). Measurement strategies
in social support. Journal of Clinical Psychology, 49, 638648.
Zimet, G. D., Dahlem, N. W., Zimet, S. G., & Farley, G. K. (1988). The
multidimensional scale of perceived social support. Journal of Personality
Assessment, 52, 3031.

Digital Investigation (2005) 2, 23e30

www.elsevier.com/locate/diin

Trojan defence: A forensic view


Dan Haagman*, Byrne Ghavalas
7 Safe Information Security, Ashwell Point, Babraham Road, Sawston,
Cambridge CB2 4LJ, United Kingdom

Abstract The Trojan defence; I didnt do it, someone else did e myth or
reality? This two-part article investigates the fascinating area of Trojan & network
forensics and puts forward a set of processes to aid forensic practitioners in this
complex and difficult area. Part I examines the Trojan defence, how Trojan horses
are constructed and considers the collection of volatile data. Part II takes this
further by investigating some of the forensic artefacts and evidence that may be
found by a forensic practitioner and considers how to piece together the evidence
to either accept or refute a Trojan defence.
2005 Elsevier Ltd. All rights reserved.

A background to the Trojan defence


This two-part article examines some of the issues
surrounding the Trojan defence from the perspective of the forensic practitioner. However, before we
start here are some comments worth considering:
A landmark trial recently found that illegal
pornography had been placed on an innocent
mans computer by a Trojan program .1
e BEWARE TROJANS BEARING GIFS
BY NEIL BARRETT, IT WEEK 03 JUNE 2003
Julian Green, 45, endured nine months of being
branded a paedophile before it was proved that
* Corresponding author. Tel.: C44 1223 830 007; fax: C44
1223 832 007.
E-mail address: dan.haagman@7safe.com (D. Haagman).
1
http://www.itweek.co.uk/comment/1141339.

the 172 images were caused by a computer


virus.2
e CHILD PORN VIRUS WRECKED MY LIFE
BY RICHARD ALLEN, EVENING STANDARD
31 JULY 2003
The acquittal of a teenager accused of carrying
out a high-profile hack attack has cast doubts
over future computer crime prosecutions, say
experts.3
e QUESTIONS CLOUD CYBER CRIME CASES
BBC NEWS UK EDITION 17 OCTOBER 2003
A forensic analysis of Caffreys computer revealed no trace of a Trojan. Graham Cluley, senior
technology consultant at the security firm Sophos,
2

h t t p : / / w w w. t h i s i s l o n d o n . c o . u k / n e w s / a r t i c l e s /
6026981?sourceZevening%20standard.
3
http://news.bbc.co.uk/1/hi/technology/3202116.stm.

1742-2876/$ - see front matter 2005 Elsevier Ltd. All rights reserved.
doi:10.1016/j.diin.2005.01.010

24
said The Caffrey case suggests that even if no
evidence of a computer break-in is unearthed on
a suspects PC, they might still be able to successfully claim that they were not responsible for what
their computer does, or what is found on its hard
drive.4
The Trojan defence places a lot of pressure on
the prosecution, which in turn places pressure on
the forensic investigators to prove, beyond all
reasonable doubt, that the accused is responsible
for the evidence located on the computer.
Mark Rasch of SecurityFocus, comments in his
article, The Giant Wooden Horse Did It!5 that
this defence is all the more frightening because it
could be true. He asks, .if you were a hacker,
would you want to store your contraband files on
your own machine, or, like the cuckoo, would you
keep your eggs in another birds nest?
Storing files on other systems is a common tactic
for attackers. Individuals who share copyright
protected materials store their contraband on
high-speed servers; hackers store their rootkits
or other tools on compromised systems or other
publicly accessible servers. No doubt many forensic practitioners have seen examples of this;
however, the Honeynet Project6 has several challenges, which show evidence of this practice.
Rasch further points out, In late December
2003, companies around the world began to report
a new kind of cyber-attack that had been apparently going on for about a year. Cyber extortionists
(reportedly from Eastern Europe) threatened to
plant child pornography on their computers and
then call the cops if they didnt agree to pay
a small fee. Unless the recipient pays a nominal
amount ($30), the hacker claims he will either
wipe the hard drive or plant child porn. The
possibility of Trojans and the relative ease with
which they could be used to promulgate such an
attack made the threats credible.
It is clear that the Trojan defence needs to be
carefully considered. As forensic practitioners, it is
important that whenever an examination is conducted, we should keep the Trojan defence possibility at the forefront of our minds. All existing
Trojans can be detected provided forensic examiners know how to identify and process the digital
traces. The methodologies used to conduct an investigation differ from practitioner to practitioner,
however this two-part article aims to show some

D. Haagman, B. Ghavalas
steps that should be considered which might substantiate or refute the Trojan defence.

Definitions
First, it is worth looking at the definition of
a Trojan and how it relates to backdoors.
According to Wikipedia7 a Trojan horse or Trojan
is a malicious program that is disguised as legitimate software . Trojan horse programs cannot
replicate themselves, in contrast to some other
types of malware, like viruses or worms. A Trojan
horse can be deliberately attached to otherwise
useful software by a programmer, or it can be
spread by tricking users into believing that it is
a useful program.
A Trojan is simply a delivery mechanism. It
contains a payload to be delivered elsewhere.
The payload may consist of almost anything such
as a piece of spyware, adware, a backdoor, implanted data or simply a routine contained within
a batch file. Additional tools such as keyloggers,
packet generation tools (for denial-of-service attacks) and sniffers may form part of the payload. It
is beyond the scope of this article to discuss each
of these in turn as we would simply not have
enough space so we will instead concentrate on
backdoors themselves as part of the overall Trojan
debate.
The above properties are important to an
analyst. Finding the original infection vector or
artefacts relating to the Trojan could influence the
timeline and validity of evidence. Locating the
actual Trojan and understanding its payload and
capabilities is exceptionally useful when building
(or defending) a case.
Wikipedia explains, A backdoor in a computer
system (or a cryptosystem, or even in an algorithm) is a method of bypassing normal authentication or obtaining remote access to a computer,
while intended to remain hidden to casual inspection. The backdoor may take the form of an
installed program (e.g., Back Orifice) or could be
a modification to a legitimate program. . Many
computer worms, such as Sobig and Mydoom,
install a backdoor on the affected computer
(generally a PC on broadband running insecure
versions of Microsoft Windows and Microsoft Outlook). Such backdoors appear to be installed so
that spammers can send junk email from the
machines in question.8

http://news.bbc.co.uk/1/hi/technology/3202116.stm.
http://www.theregister.co.uk/2004/01/20/the_giant_wooden_
horse_did/.
6
http://project.honeynet.org/scans/index.html.
5

http://en.wikipedia.org/wiki/Trojan_horse_%28computing%29.
8
http://en.wikipedia.org/wiki/Backdoor.

Trojan defence: A forensic view

GAME

25

TROJAN HORSE

the malware author can take a game called newgame.exe and some malicious payload called
malware.exe and bind them together.
Over time this process has become very simple
and widespread with graphical tools making the
process extremely simple.

BACKDOOR

Changing shape
Figure 1

The process of binding a backdoor to a game.

Again, the properties of the backdoor can influence the case. For example, many investigators
use an anti-virus (AV) tool to process the forensic
image. The AV tool will highlight files containing
malware, including backdoors. However, the mere
existence of the files does not necessarily mean
that the backdoor was ever active. Establishing
this fact can be crucial and will be revisited in Part
II of this article.

Trojan making: binders, wrappers and


joiners
Remember that Trojans are delivery vehicles for
some form of payload. But how are they made?
What tools are available to do this? Many Trojans
are created by Trojan-making kits, which are often
referred to as wrappers because they wrap the
functionality of malicious software into other
carrier software. The final innocent looking package is then distributed through whatever means
the malware author deems appropriate be it to
a mass audience, to targeted groups or direct to
individuals. Typical distribution mechanisms are:





P2P
email
file sharing and removable media
direct implant through hacking, etc.

The process for Trojan making has been around


for a very long time and the kits are widely
distributed and vary in quality and complexity.
Traditionally, we saw tools that would allow an
attacker to take their own preconfigured backdoor
and then wrap it with an executable of their
choice as per the diagram shown in Fig. 1
The kit (be it GUI or command-line driven) will
usually give options of how to unpack each piece of
software. Take for example, EliteWrap.9 In Fig. 2,
9
http://www.packetstormsecurity.org/trojans/elitewrap.zip
(originally from: http://www.holodeck.f9.co.uk/elitewrap).

The terms packer and compressor are often


used interchangeably to describe utilities that
essentially change the binary structure of a file
by drawing out or compressing unnecessary space
within a file. Simple examples of this are archive
compression utilities such as WinZip or a UNIX
equivalent of GZip.
Taking a well known backdoor which is readily
detectable by AV and compressing it using normal
archive compression such as WinZip would still
result in AV flagging the backdoor as found (this is
because the vendors will hold a signature for that
level of compression thus revealing a simple
match). However, there are many other types of
compression algorithm available, which the attacker has at his/her disposal which when run on
the normally detectable backdoor, will create a file
with a new binary signature. The result is that it
therefore becomes undetectable to AV until it is
decompressed (Fig. 3).
Naturally there are some people who do not
have the knowledge of how to run compression
utilities directly. Instead they download kits (packers)10 many of which are driven by a simple GUI.
They can then test to see whether their payloads
will trigger typical AV engines and the pattern files
within. Decompression would expose the malware
to the AV engine, but this can be overcome by
deploying an AV killer.

Anti-virus and personal-firewall killers


As the names of these tools suggest, they are
designed to shutdown or disable the protection
afforded by traditional AV and personal firewall
software on the client/victim machine. They exist
in several forms including standalone AV killers,
standalone Firewall killers or combination tools
that address both; for example, kILLer by illwill.11

10
11

http://www.programmerstools.org/packers.htm.
http://www.illmob.org.

26

D. Haagman, B. Ghavalas

Figure 2 EliteWrap used to wrap two pieces of software together; the original game (game.exe) will unpack in the
foreground when executed whilst malware.exe unpacks itself in a stealthy manner.

So how do these tools influence our work as


practitioners? The technology is always moving and
the vendors are continually developing their AV
and personal firewall software. Unfortunately, the
hackers are undertaking field tests on how their
killers work against the latest AV engine/pattern file, etc.
If a victim does receive and inadvertently
executes some malicious software, which deploys
an AV killer before launching the main payload, we
are unlikely to see any events or logs alerting us to
this incident. Furthermore, if a clean-up operation
is subsequently performed and AV reinstated, we
may not have enough recoverable evidence to
refute some claims. So is AV good enough for us
nowadays? We will address this issue in Part II of
this article.

important to understand what artefacts may be


found in an investigation. Generally, we classify
the classic backdoor/Trojan kit into three components:
1. Server e the backdoor itself, often wrapped up
into the overall Trojan; configured with specific
options and may also include other helper
modules termed plugins.
2. Client e used to control the backdoor from
a remote location.
3. Creation tool/kit e used to configure the
behaviour of the backdoor before it is released
to the intended victim/s.

Piecing evidence together

Of course if we were to find anything other than


the server part of the overall kit on a suspects
machine then questions would need to be raised as
to why a creation or remote control GUI was also
present.

When considering the Trojan defence and the issue


of Trojans and backdoors in their entirety, it is

Trojan scenarios

DETECTABLE

COMPRESSION

UNDETECTABLE

So what can the overall Trojan package do? What


evidence would be left behind? Let us now take
two scenarios, which we will build upon in Part II as
follows.

Scenario 1

Figure 3 Compression used to make a backdoor unrecognisable to AV.

In this scenario, the victim has up-to-date AV


present on their machine and downloads a game
from a Peer-To-Peer network such as KaZaa. The
game is in this case a Trojan horse designed to

Trojan defence: A forensic view

27

deliver a number of payloads including a backdoor


as shown in Fig. 4.
You will note that the backdoor has been
compressed so any AV engines (including that of
the investigator) will not necessarily detect the
backdoor payload. The problem for the attacker is
however, that as soon as the backdoor is released
and decompresses, this could possibly trigger an
AV response. To combat this, the Trojan first
delivers its AV killer designed to disable the AV
engine. Once complete, the backdoor is then
deployed and installs itself in stealth mode allowing the attacker access to the victim machine
remotely. At the same time, the backdoor notifies
its owner (the attacker) of its presence via
email, establishing an outbound connection over
a port, which is likely to be open in the users
personal firewall settings (SMTP TCP/25). If we
were to place a network sniffer between the
victims machine and the Internet, it would be
likely that the notification output would be captured and could be analysed (subject to no
encryption being used by the backdoor).

TROJAN HORSE

GAME.EXE

AV KILLER

COMPRISING

FW KILLER

COMPRESSED
BACKDOOR

FALSE
REGISTRY
ENTRIES

Figure 5 A more complex example of the potential


payloads a forensic investigator may have to contend
with when analysing a Trojan horse (or the traces of)
found on a system.

Scenario 2
If the above scenario was not bad enough then
consider the same type of Trojan deployment, but
also with an FW killer to disable personal firewall
software and a routine that could implant false
registry keys into the victims system. Such keys
could ensure stealthy start-up of rogue processes
or could even add falsified histories relating to
Internet surfing activity. The possibilities are numerous as shown in Fig. 5.
Whilst all this may seem rather complex and
possibly too difficult to achieve, remember that
tools have emerged that automate much of the
above. We now see all-in-one kits such as Optix-

TROJAN HORSE

GAME.EXE

COMPRISING

AV KILLER

COMPRESSED
BACKDOOR

Pro, which makes the overall Trojan, configures an


integral backdoor and has features such as AV
killing (Fig. 6).

Considering volatile evidence


To date forensic practitioners have developed
various methodologies for dealing with a computer
crime scene, which comply with various rules and
best practices. One of the primary rules for
processing a computer crime scene is [to] Acquire
the evidence without altering the original.12 To
this end, many forensic practitioners take the
approach of pulling the plug on a suspect computer. The rationale is that whilst volatile information such as running processes, network
connections and data stored in memory are lost,
the evidence on the hard disk should remain
intact. Naturally, there are pros and cons to every
option e as we know, simply doing nothing still
changes data and therefore the evidence.
It is our belief that, whenever possible and
especially considering a potential Trojan defence,
volatile information should be gathered. Very
often, this volatile data can be used to help an
12

Figure 4 Trojan horse game containing an AV killing


agent together with a compressed backdoor.

Kruse, Warren G. and Jay G. Heiser. Computer Forensics:


incident response essentials. Indianapolis: Addison-Wesley,
2002.

28

D. Haagman, B. Ghavalas

Figure 6

Part of the configuration options within OptixPro.

investigator during the offline investigation. A list


of open network ports can help support or refute
the presence of an active backdoor, memory often
contains useful information such as decrypted
applications or passwords, sometimes malicious
code that has not been saved to disk and only runs
from memory can be obtained (as in the case of
the Code Red worm).

Network evidence
Having a well-rehearsed plan for acquiring live
evidence is critical. Using trusted and forensically
sound tools is a must. Before gathering the
evidence from the suspect system, it could be
worth considering a network forensic approach by
sniffing the communication flows to and from the
suspect system. Unfortunately, this tends to be
easier said than done e both from a legal and
a technical perspective.
In some situations, such as a corporate environment or a home-networked environment, it may
be possible to intercept communication through
the use of the port spanning function of a switch.
Plugging in to an existing hub or placing a hub
between the suspect system and the network may
also be an option. The investigators machine
would then be configured to capture all traffic
to and from the suspect machine. It may be

preferable to capture the raw packets using Linux


and tcpdump, but various options exist, both free
and commercial, for Windows and Linux.
In other situations, such as a home user with
ADSL and a USB modem, it may be necessary to use
a proprietary device such as the DSL Phantom by
TraceSpan.13 This device is able to extract traffic
and dump it via USB to the analysis machine.
Some of these techniques require that the
connection be disrupted; in such cases, we usually
obtain the volatile information from the computer
before obtaining the network communications.
In the UK, the Regulation of Investigatory
Powers Act 200014 and The Telecommunications
(Lawful Business Practice) (Interception of Communications) Regulations 200015 govern the interception of communications. Similar protections
exist in the US, including the Electronic Communications Privacy Act (ECPA). It is recommended that
you obtain legal advice regarding the interception
of communications. Legally obtained information
from a packet capture could significantly influence
the investigation. The capture may provide evidence of a backdoor, an active compromise or it
may show ongoing activities that enhance the
case.
13
14
15

http://www.tracespan.com/2_2LI%20Monitoring.html.
http://www.hmso.gov.uk/acts/acts2000/20000023.htm.
http://www.hmso.gov.uk/si/si2000/20002699.htm.

Trojan defence: A forensic view

29

Figure 7

A screenshot of WFT.

A next step: volatile information from


a live system
After obtaining the network captures, the next
step involves gathering volatile information from
the system. One tool that should be part of every
responders toolkit is the Windows Forensic
Toolchest (WFT).16 This tool, written by Monty
McDougal as part of his SANS GCFA17 certification,
is designed to provide an automated incident
response on a Windows system and to collect
security-relevant information from the system in
a forensically sound manner. Encase Enterprise
Edition has most of these capabilities, apart from
processing memory dumps, and works on both
Windows and UNIX. Also, the Coroners Toolkit

option e using a scripted technique helps ensure


consistency and eliminate mistakes (Fig. 7).
By default, the tool will dump all sorts of
volatile information such as the current time,
process listings, service listings, system information, network information, auto-start information,
registry information and even a binary copy of
memory. Because WFT uses a configuration file
that ultimately tells WFT which external programs
to call and how to call them, by tweaking the
configuration file, additional information can be
obtained, or alternate techniques can be used for
obtaining the same information.
For example, WFT uses a version of dd18 modified by George Garner for generating a binary copy
of the physical memory in the machine, using the
following command:

dd ifZ\\.\PhysicalMemory ofZ!destO!img nameO


was developed for UNIX systems for this purpose,
including memory acquisitions using memdump.
See http://www.porcupine.org/forensics/tct.html.
WFT is an excellent incident response tool in that
it provides a simplified way of scripting these
responses using a sound methodology for data
collection. While running individual tools is an

16

http://www.foolmoon.net/security/wft/.
http://www.giac.org/practical/GCFA/Monty_McDougal_
GCFA.pdf.

This command could easily be replaced if a better


or more suitable tool is found.

Coming up in Part II
In the next article, we will show how the volatile
information we have gathered can be used to aid
an offline forensic analysis of the computer. We
will also discuss the virtues of network analysis and
the use of Virtual Machines to aid an investigation.

17

18

http://users.erols.com/gmgarner/forensics/.

30
The use and limitations of AV products and their
benefit to investigations will also be addressed.
Dan Haagman (BSc, CSTP, CFIA) and Byrne Ghavalas (CSTP,
CFIA, GCFA) instruct and practice in computer forensics for

D. Haagman, B. Ghavalas
7Safe e an independent Information Security practice delivering an innovative portfolio of services including: Forensic
Investigation, BS7799 Consulting, Penetration Testing & Information Security Training.

G Model
DRUPOL-1172; No. of Pages 7

ARTICLE IN PRESS
International Journal of Drug Policy xxx (2013) xxxxxx

Contents lists available at SciVerse ScienceDirect

International Journal of Drug Policy


journal homepage: www.elsevier.com/locate/drugpo

Commentary

Silk Road, the virtual drug marketplace: A single case study of user experiences
Marie Claire Van Hout a, , Tim Bingham b
a
b

School of Health Sciences, Waterford Institute of Technology, Waterford, Ireland


Irish Needle Exchange Forum, Ireland

a r t i c l e

i n f o

Article history:
Received 30 September 2012
Received in revised form 1 January 2013
Accepted 14 January 2013
Keywords:
Silk Road
Internet
Online drug forums
New psychoactive substances
Psychonautics
Ethnopharmacy

a b s t r a c t
Background: The online promotion of drug shopping and user information networks is of increasing public
health and law enforcement concern. An online drug marketplace called Silk Road has been operating on
the Deep Web since February 2011 and was designed to revolutionise contemporary drug consumerism.
Methods: A single case study approach explored a Silk Road users motives for online drug purchasing,
experiences of accessing and using the website, drug information sourcing, decision making and purchasing, outcomes and settings for use, and perspectives around security. The participant was recruited
following a lengthy relationship building phase on the Silk Road chat forum. Results: The male participant
described his motives, experiences of purchasing processes and drugs used from Silk Road. Consumer
experiences on Silk Road were described as euphoric due to the wide choice of drugs available, relatively easy once navigating the Tor Browser (encryption software) and using Bitcoins for transactions,
and perceived as safer than negotiating illicit drug markets. Online researching of drug outcomes, particularly for new psychoactive substances was reported. Relationships between vendors and consumers
were described as based on cyber levels of trust and professionalism, and supported by stealth modes,
user feedback and resolution modes. The reality of his drug use was described as covert and solitary
with psychonautic characteristics, which contrasted with his membership, participation and feelings of
safety within the Silk Road community. Conclusion: Silk Road as online drug marketplace presents an
interesting displacement away from traditional online and street sources of drug supply. Member support and harm reduction ethos within this virtual community maximises consumer decision-making and
positive drug experiences, and minimises potential harms and consumer perceived risks. Future research
is necessary to explore experiences and backgrounds of other users.
2013 Elsevier B.V. All rights reserved.

Introduction
The Internet is increasingly viewed as the driver of the contemporary drug markets by the promotion of drug shopping in web
based retail outlets and settings for user communication of information (Burillo-Putze, Domnguez-Rodrguez, Abreu-Gonzlez, &
Nogu Xarau, 2011; Califano, 2007; Corazza et al., 2011, 2012;
Davey, Corazza, Schifano, Deluca, & Psychonaut Web Mapping
Group, 2010; Davey, Schifano, Corazza, & Deluca, 2012; Davies,
2012; Eurobarometer, 2011; Forsyth, 2012; Hill & Thomas, 2011;
Jones, 2010; Measham, 2011; Oyemade, 2010; Prosser & Nelson,
2011; Psychonaut Web Mapping Research Group, 2009; Solberg,
2012; Sumnall, Evans-Brown, & McVeigh, 2011; Vardakou, 2011;
Winstock, Marsden, & Mitcheson, 2010). Research has underscored
how the cyber drug market has become increasingly dynamic and
innovative in its capacity to retail drugs, create new compounds

Corresponding author at: School of Health Sciences, Waterford Institute of Technology, Waterford, Ireland. Tel.: +353 51 302166.
E-mail address: mcvanhout@wit.ie (M.C.V. Hout).

and circumvent legislative controls (Brandt, Sumnall, Measham, &


Cole, 2010; EMCDDA, 2011a, 2011b; Grifths, Sedefov, Gallegos,
& Lopez, 2010; Inciardi et al., 2010). Organic responses to drug
product development and availability include the prevalence of
online drug website and chat forums operating to provide user
information on outcomes, experiences, popularity, availability and
sourcing mechanisms, optimum use and harm reduction practices (Davey et al., 2012; Gordon, Forman, & Siatkowski, 2006;
Wax, 2002). It is increasingly apparent that existing and new versions of illicit drugs are traded and discussed among users of
the Deep Web or Invisible Web which represent online content
not searchable by standard search engines such as Google. Novel
psychoactive substances (NPS) are commonly known as designer
drugs, legal highs, research chemicals, synthetic drugs and herbal
highs and are marketed as quality legal or labelled not for human
consumption substitutes for popular street drugs such as ecstasy,
amphetamine, cannabis and cocaine (De Luca et al., 2012). The
shift towards widespread global availability of all drugs is evident
in the recent online presence of drug marketplaces such as Silk
Road, Black Market Reloaded, The Armory and the General Store
(Christin, 2012).

0955-3959/$ see front matter 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.drugpo.2013.01.005

Please cite this article in press as: Hout, M. C. V., & Bingham, T. Silk Road, the virtual drug marketplace: A single case study of user experiences.
International Journal of Drug Policy (2013), http://dx.doi.org/10.1016/j.drugpo.2013.01.005

G Model
DRUPOL-1172; No. of Pages 7
2

ARTICLE IN PRESS
M.C.V. Hout, T. Bingham / International Journal of Drug Policy xxx (2013) xxxxxx

One such website experiencing heightened user interest is


called Silk Road and has operated anonymously on the Deep Web
since its launch in February 2011 (Chen, 2011; Norrie & Moses,
2011). According to Schumer (2011), Silk Road is a certiable onestop shop for illegal drugs that represents the most brazen attempt
to peddle drugs online that we have ever seen. It has revolutionised
Internet drug sourcing and has been described as an Ebay for Drugs
(Barratt, 2012, p. 683). Silk Road provides cyber buyers and vendors with the infrastructure to conduct online transactions, with
over 24,400 drug related products for sale (Christin, 2012). The Silk
Road Sellers Guide (2011) prohibits the sale of goods and services
intended to harm or fraud, with a conspicuous omission relating
to prescription drugs and narcotics, pornography and counterfeit
documents (Christin, 2012). Weapons and ammunition were permitted until March 2012, and since relisted on a sister site called
The Armory (Christin, 2012). It is only accessible to users of Tor
anonymising software which encrypts computer IP addresses (Tor
Project, 2011). At the time of writing in late 2012, Silk Road forum
statistics indicate close to 200,000 registrations and 199,538 forum
posts, up from close to 8000 registrations in 2011 (Silk Road Forums,
2012). Christin (2012) estimated total revenue by all vendors as
approximately $1.9 million per month.
Of note, is that the website operates similarly to Ebay (Barratt,
2012), by way of vendor and buyer ratings, and feedback on quality of transactions, speed of dispatch and prole of products. It
operates a professional dispute resolution mechanism and has
a forum dedicated to drug safety and harm reduction practices.
Once the user has registered for free, a wide variety of drugs
are easily located on the website and include cannabis, ecstasy,
psychedelics, opioids, stimulants, benzodiazepines and dissociatives (Barratt, 2012). Vendor and buyer identities are obscured, with
the site recommending vendors to disguise shipments and vacuum seal drugs potentially detected by smell. Buyers and vendors
use Bitcoins (often used for online gaming) to conduct all transactions, which is a non-government-controlled anonymous and
untraceable crypto-currency, used as peer-to-peer currency and
indexed to the US dollar to prevent excessive ination or deation
(Bitcoin, 2011; Davis, 2011). Transaction anonymity is optimised
by use of tumbler services of dummy and single use intermediaries between buyer and vendor (Christin, 2012). In addition to
public listings, Silk Road supports stealth listings which are used
for custom listings directed at certain consumers and operated
through out-of-band contact between vendor and buyer (Christin,
2012). Vendor authenticity and commitment to providing quality
goods is also controlled by the purchasing of new vendor accounts
through auctions to the highest bidders. The site offers sale campaigns and events, and users may also purchase gift certicates for
friends.
Research to date on Silk Road is scant and at present limited to
investigative site monitoring work by Monica Barratt and Nicolas
Christin (resp.). The Silk Road context and its member experiences are an under researched topic. We present here an intensive,
holistic and exploratory single case study analysis (see Baxter &
Jack, 2008; Flyvbjerg, 2011; Thomas, 2011; Stake, 2005; Yin, 2003)
of an active users experiences within the Silk Road setting in
order to gain a sharpened phenomenological understanding and
rich description of user motives for accessing the site, perceptions
of risk, purchasing processes and drug use outcomes within the
realities of Silk Road as contemporary virtual drug market. We
recognise that the chosen single case study approach is exploratory
and conned to the experiences of the participant, and therefore
cannot offer grounds for reliability or generalisation of ndings
(Baxter & Jack, 2008; Greenhalgh, 1997; Yin, 2003). Despite these
shortcomings, this holistic single case study which explores the
users experience of navigating Silk Road as online drug market is
unique and merited.

The single case study method


We originally intended to undertake a qualitative study using
interviews with Silk Road members. Ethical approval for the case
study was granted by the School of Health Sciences, Waterford
Institute of Technology, Ireland. Following a period of two months
site navigation on Silk Road and active participation in the Silk Road
forums, we requested permission from the website administrator
to undertake research on its member experiences and to upload
information and recruitment threads in the forums. Following
best practice protocols recommended in the literature (Barratt &
Lenton, 2010; Mendelson, 2007; Murgua & Tackett-Gibson, 2007;
Sixsmith, Boneham, & Goldring, 2003), we invited members to
partake in the research via a message board recruitment thread.
Recruitment of site users was hampered by negative and suspicious
reactions by forum participants. Information-oriented sampling
(Flyvbjerg, 2006) guided recruitment efforts, with an active Silk
Road member agreeing to be interviewed, following a lengthy
relationship building phase with Author 2 on the Silk Road chat
forum.
We recognised that in order to gain an in depth understanding
of user motives, experiences and navigation of the Silk Road site,
a single exploratory and holistic case study with a participant who
tted the bill could be deemed appropriate (Greenhalgh, 1997, p.
157). The case itself (in this instance the Silk Road user) was dened
as a phenomenon occurring within a bounded context, namely the
Silk Road site (Miles & Huberman, 1994). The study adhered to
recommended proposed purpose, approach, processes and quality control methodologies for data analysis of single case studies
(Baxter & Jack, 2008; Flyvbjerg, 2011; Greenhalgh, 1997; Thomas,
2011; Yin, 1984, 2003). Close collaboration between us and the Silk
Road participant enabled the participant to tell his story, iterate his
views on Silk Road and allow us understand his actions (Baxter &
Jack, 2008; Lather, 1992).
Interview topics and targets set for this single case study were
developed following the review of existing Silk Road literature
and media reporting, and in consultation with our experiences
navigating the site itself (see Darke, Shanks, & Broadbent, 1998).
We focused on the following areas of interest; participant drug
use history, motives for Internet drug sourcing and sites used,
experiences accessing and using Silk Road, drug information sourcing and decision making on Silk Road, Silk Road drugs of choice,
experiences of these drugs and settings for use, interaction with
Silk Road chat forums and the online Silk Road community and
future intentions for using Silk Road. The case was provided with
information outlining the research aims, and was informed that
his experiences would be documented and subsequently available in the public domain as journal paper. He was advised of the
permission to withdraw if desired. Informed verbal consent was
received prior to commencement of the recorded interview. Complete anonymity was ensured as the case and Author 2 used online
pseudonyms, with the interview conducted via visually deactivated Skype. No names or personal identiers were requested,
and the participant was advised not to verbalise any potentially
identifying names, places or otherwise. The interview was conducted in an open-ended, unordered conversational style and
lasted 70 min.
The interview was transcribed and read and reread several times
by both researchers. The interview data set contained rich series of
narratives, which built on the single case study plot in the form of
a heros journey within a sequence of events leading to accessing
Silk Road, subsequent interaction with the site, and experiences of
drugs purchased (Flyvbjerg, 2011). We recognise that the analysis
of narratives in this single case study runs the risk of committing
so called narrative fallacy, by virtue of over simplication of data
and researchers preconceived notions (Flyvbjerg, 2011). Extensive

Please cite this article in press as: Hout, M. C. V., & Bingham, T. Silk Road, the virtual drug marketplace: A single case study of user experiences.
International Journal of Drug Policy (2013), http://dx.doi.org/10.1016/j.drugpo.2013.01.005

G Model
DRUPOL-1172; No. of Pages 7

ARTICLE IN PRESS
M.C.V. Hout, T. Bingham / International Journal of Drug Policy xxx (2013) xxxxxx

brieng sessions were held between researchers in order to circumvent this and we strove to minimise bias towards verication by
checking for validity and reliability within the collection, analysis
and subsequent presentation of resultant themes. Data credibility
was further improved by employing a focused analysis by consistently exploring the data within the scope of the research questions
and existing literature on Silk Road (Darke et al., 1998; Miles &
Huberman, 1994; Russell, Gregory, Ploeg, DiCenso, & Guyatt, 2005;
Yin, 2003). The focused collection and comparison of these single
case study narratives with extant literature enhanced the quality of
resultant deconstruction and reconstruction of various Silk Road
phenomena by virtue of idea convergence and conrmation of ndings (Baxter & Jack, 2008; Kna & Breitmayer, 1989). Five themes
emerged from the data, and are presented in the following section: Participant drug use history, Internet drug sourcing and risk
perceptions, Preparing to access Silk Road, Silk Road purchasing
mechanisms and Drug use, testing and setting.

The single case study


Participant drug use history
The participant was male, aged 25 years and in professional
employment. He described himself as commencing drug use at age
15 years, and an experimental, recreational and psychonaut1 type
user of cannabis, ecstasy, cocaine and hallucinogens (LSD, mushrooms, 2CI, 2CB). He described using drugs as life enhancing tool
to expand his consciousness within a personal and lifestyle oriented journey, particularly whilst meditating, playing music in his
bedroom and outdoors to boost connection with nature.
At heart, I really am a psychonaut. I really do love anything that
is trippy but at the same time its restrained.
He appeared conscious of periods of excessive drug use where
loss of control was evident, and subsequent self-monitoring and
control of his drug consumption occurred.
It has woken me up to ask is what is an acceptable level of use
or is it acceptable to use at all.
Internet drug sourcing and risk perceptions
The case observed increased awareness of the possibilities of
Internet drug sourcing in 2010 via his use of social media, with
Silk Road and some sites selling research chemicals (i.e. the Buy
Research Chemicals (BRC) website) appearing to offer a legitimate,
safe and opportunistic channel for the sourcing of a variety of
drugs. He described some initial concerns that perhaps products
would not arrive. The BRC website was used several times, and
he reported favourable experiences with a tester pack of drugs
purchased, which supported his decision to continue purchasing
his favourite products. Mail order cannabis was the rst product
bought.
It was through reviews or through word of mouth that this place
is actually ok and they have some interesting things on offer,
that sparked the interest and I kept on going and things have
gone deeper since.

1
A person who intelligently experiments with mind-altering chemicals, sometimes to the extent of taking exact measurements and keeping records of
experiences. Also dened as a scientic explorer of inner space (Newcombe, 2008;
Newcombe & Johnson, 1999).

Comments were made around perceived buyer safety from legal


detection by using Silk Road and BRC and the limitation of risks
associated with interaction with street dealers.
The main reason for buying from the sites was because I felt
there was a safety factor from the law, I have never really had
any trouble with the police or the judicial system as a whole and
I have always wanted to limit any risk. I saw it as a safer way of
sourcing.
Perceived safety and intention to use these sites for drug sourcing appeared to hinge on the cases own capacity to research and
survey online user feedback on available channels of cyber drug
retailing, and levels of trusting social media connections between
users. He described exercising caution and observed that (in general) amongst the recognisable experienced drug user reports, a
host of relevant and useful information on drug retail sites and
drug products was evident. The participant recognised that despite
undertaking extensive online researching, products received were
untested, and therefore potentially harmful. He commented on the
role of the site moderators in discouraging potentially harmful dissemination of information.
When I was reviewing the sites, I didnt look at specic people,
I looked at the big picture, if there were ten people saying this
site was good and two people saying that it was bad, the chances
are, it was a good site, you have to be careful at the same time,
what one person experiences isnt necessarily what someone
else is going to experience.
Preparing to access Silk Road
Once deciding to use Silk Road, the participant used Google and
existing drug forums on Erowid 2 and Bluelight3 to discover how
to access the site.
On the open internet you have the details of what the Silk Road
is, they have a link to the Tor browser and they even include it
on Wikipedia, the link to the Silk Road as it is at the moment,
it is even easier now. . .when I joined the link wasnt available.
There is potential for lots more people to look.
In 2011, he reported spending close to one day navigating
the Tor website and gaining access onto Silk Road, which was
described as time consuming, but not difcult as he is an experienced computer user.
Getting onto Tor is quite easy really, you have the open Tor
project which is on a freely available website and you can download the software and all of a sudden you on the Deep Web once
youre on there, you only need the link to get onto Silk Road.
The participant observed the need for exercising personal
caution by encrypting computer hard drives and following the
guidelines for Tor encryption software.
All you have to do is to go onto the internet and look at the
Tor websites so if youre on Silk Road and look at the forums,
there are people there telling you how to use the site safely and
properly. As long as you follow the advice, you should never

2
Erowid Online is an online library containing information about psychoactive
drugs, plants, and research chemicals.
3
Bluelight is an international message board that educates the public about
responsible drug use by promoting free discussion.

Please cite this article in press as: Hout, M. C. V., & Bingham, T. Silk Road, the virtual drug marketplace: A single case study of user experiences.
International Journal of Drug Policy (2013), http://dx.doi.org/10.1016/j.drugpo.2013.01.005

G Model
DRUPOL-1172; No. of Pages 7

ARTICLE IN PRESS
M.C.V. Hout, T. Bingham / International Journal of Drug Policy xxx (2013) xxxxxx

encounter a problem, you just need to follow common sense


which can be learnt from the experience of others. Its your
responsibility to make it as secure as you can, so the Tor network itself that allows you to get onto the Silk Road helps to
provide anonymity, but the anonymity that you get is only as
good as you make it, if you leave the software left on your computer and you get raided by the police, then they will know that
you have been using this software to go on there.

sourcing on vendors located in countries renowned for producing


quality forms of that drug (i.e. the Netherlands). He appeared to
register some concern for the export of drugs from Asian countries.
Over time, he reported noticing that the same vendors used similar sourcing chains, and described his vendors as becoming trusted
sources by repeat transactions. Vendors can review buyers track
history in purchasing and instances using the resolution centre
before deciding to transact with a new buyer.

Bitcoins as currency within a peer to peer network were


described as useful but not without its problems. He had used
Bitcoins for purchases other than drugs but reported increasing
difculties in bank to bank transferral of funds using unnamed payments. The setting up of fraudulent accounts and use of Bitcoin
tumbler systems appeared to guarantee security.

Using the forum and doing intelligence investigations and


weighing things up before hand is very important, but the actual
buying process is ridiculously easy. If youre going to purchase
from someone, you need to do your research rst, just dont
go blindly and say I want that and buy it because although the
vendor might have a 5 star review, the last ve people who
might have bought from them might be turning around and
saying hey this guy is selling funked products and its not giving me the high it used to or its not doing anything for me
at all.

Getting hold of the Bitcoins, thats probably the only hard part.
I have managed to set up a bank account fraudulently. I suppose
probably the worse crime I have ever committed. I can now
deposit money under a false name and get it into this online
account, so I can then get it out and cipher it through Bit wallets of my own that are completely anonymous and not linked
to me any shape or form, and from there transfer them into
my account on Silk Road. It has become a lot more complicated
to do it very securely. But, the nature of Silk Road itself and
the fact that they have a tumbler system that the Bitcoin go
through, the chances of it being linked back to your bank account
would be slim to none really. I have not heard of any users being
arrested for going on Silk Road and if they have, it has not been
publicised.

Silk Road purchasing mechanisms


The participant reported a euphoric joyful experience once on
Silk Road. A wide variety of drug product hostings were visible
particularly for new psychoactive substances and drugs not easily
sourced within his locality.
I got on there and I was blown away by it really, it really kicked
me for six. There were things on there that I had wanted to try
for a long time, but have never had either the contacts or the
desire to go and source from the street.
He described Silk Road as the only trusted place to get both
information on the available drug products and in contrast to
street drug purchasing, the opportunity to receive quality products.
Overall quality of consumer experience and assistance in product
and vendor decision-making was supported by visible online vendor reviews, vendor accountability, buyervendor negotiations and
resolution modes.
The levels of protection and the quality of whats on there,
the quality of the service, the negotiations if something goes
wrong, you can go into resolution mode, if something doesnt
turn up or if you dont get exactly what you ordered as
described in the article. On the whole because there is a level of
accountability. . .there is a greater safety purchasing from Silk
Road because the level of self reporting. If people get something
they dont like, they will kick up a fuss. The vendors, if they are
reputable vendors, they will give the person a refund and say I
am very sorry dont let this mar our relationship.
For certain drugs (like MDMA), the participant described experiencing varying degrees of product quality via use of pill testing
websites and drug outcome, and as a result focused his MDMA

Purchasing process and visual layout of the site appeared very


like Ebay or Amazon, with buyers selecting products priced in
Bitcoins or US dollars by the weight and then proceeding to the
checkout. He described the advertising of multiple packages and
special offers. The participant observed a slight increase in the number of visible vendors on Silk Road and described how successful
vendors on reaching a quota of clientele would go into private
or stealth mode, and cease advertising openly on Silk Road. This
was observed to increase vendorbuyer transaction anonymity
and cement trade relationships. Street dealers in contrast to Silk
Road vendors were described as unaccountable to anyone. He commented that vendor prot margins were similar to street dealing,
but without the cutting down of products. He was aware of through
forum chatting on the site that some Silk Road vendors had been
arrested for street dealing.
Following review of the vendor and his/her products on Silk
Road, his initial purchase was made by providing his home address
with a false name. The payment exchange was nalised on receipt
of the product, which he attributed to the seller being UK based and
due to the small quantity purchased. He reported that all of his Silk
Road transactions arrived safely, despite some reports of this not
occurring for his virtual friends. Product packaging was described as
very professional using bubble wrap envelopes or multiple sheets
of paper, with warning labels and user guides absent.
I sit at home and wait for the postman to turn up and say good
morning to him and shake his hand and thank him for bringing
the post to me. I had a lovely experience when I purchased some
LSD off a German chap, he actually sent me a Christmas card with
a message in there, the LSD was hidden behind one of the glued
pieces on the card, I actually had to contact him to thank him
for the Christmas card, and ask where is the LSD, he told me to
look harder in the card, then I found what I was looking for. . .
Some of the packaging is incredible so not to draw the attention
of the postal service or customs.
Drug use, testing and setting
Silk Road was described as an online sweetie shop where you
can go and have a pick and mix and the participant observed this as
promoting ethno-pharmaceutical experimentation. He described
his favourite purchases as good quality cannabis, MDMA and new
psychoactive substances such as 2C-I.
Denitely if it hadnt been for Silk Road there are a lot of chemicals that I would never had the opportunity to try and wouldnt

Please cite this article in press as: Hout, M. C. V., & Bingham, T. Silk Road, the virtual drug marketplace: A single case study of user experiences.
International Journal of Drug Policy (2013), http://dx.doi.org/10.1016/j.drugpo.2013.01.005

G Model
DRUPOL-1172; No. of Pages 7

ARTICLE IN PRESS
M.C.V. Hout, T. Bingham / International Journal of Drug Policy xxx (2013) xxxxxx

have known about to try. It has given me access to things that I


wouldnt necessarily has access to before.
Seeking advice around optimum dosage and route of administration for his purchases was described as a two fold process where
he searched the online communities in Erowid for user reviews and
dose reports, and administered and re-administered the drug based
on common sense within the drug taking episode. He reported
never having a negative experience with drugs bought on Silk Road.
I have never had a bad trip because I have always had a good
set and setting, good mind frame, I know my surroundings, well
prepared in terms of food, drink and a bowl to be sick in if I need
to. I have always taken the necessary steps to ensure it is a good
trip and nothing will or could go wrong.
Drug use was described as covert and frequently alone. He
observed that his drug identity is very separate from my non drug
identity. He described fear of too many unknowns when drug taking within a group, and described his participation in such settings
as rare and only in the instance of trusted groups. He was aware
that many users of Silk Road attended mass outdoor events. Simply being part of the online drug using community on Silk Road
and Dope Tribe was described as facilitating feelings of safety and
legitimisation of drug purchasing and use within the network of
trusted virtual friends. When questioned around his perspective on
being part of such a community, the participant emphasised that
his drug use was for a personal journey, and not something shared
with others.

Discussion
Online research methodologies for the recruitment, surveying
of and engagement with drug users are increasingly utilised, given
the emergent importance of the Internet in peoples associational
day to day lives and recent explosion of online pharmacies, drug
user forums and sites selling new psychoactive substances (Barratt
& Lenton, 2010, De Luca et al., 2012; Fielding, Lee, & Blank, 2008;
Miller & Snderlund, 2010). Research to date has focused on the
web mapping of online retailing, marketing and use of drugs, and
equally the potential for Internet based interventions to reduce
harm (De Luca et al., 2012; Kypri, 2009; Sinadinovic, Wennberg,
& Beman, 2012). This unique exploratory single case study followed protocols advocated by Yin (2003, p. 184) and Flyvbjerg
(2011), and is dened as an empirical inquiry that investigates
a contemporary phenomenon within its real-life context; when the
boundaries between phenomenon and context are not clearly evident;
and in which multiple sources of evidence are used (Yin, 1984, p.
23). Resultant ndings provide a phenomenological insight into an
expert account of the cases experiences of Silk Road and associated drug taking. The hidden nature of Silk Road on the Deep
Web and its covert operation limits access to its members by
outsiders. This severely hampered recruitment and snowballing
efforts by the research team. Despite these shortcomings, the
case himself was an active participant in the Silk Road forums
and willing to be interviewed. He proved to describe his experiences in an intelligent and erudite manner, and illustrated site
characteristics and purchasing mechanisms corroborated by published literature, recent media reporting and law enforcement
statements. We recognise the limitations associated with this
potentially unveriable single case study, and recommend further
research into other unrelated users accounts and experiences of
Silk Road.
Accessing Silk Road was described as a joyful child in a sweet
shop type experience by virtue of its host of quality products and

vendors, and its capacity to offer an anonymous, safe, and speedy


transactioning without any of the risks associated with street drug
sourcing. The individual appeared act as drug connoisseur by virtue
of conducting extensive online and Silk Road research on product
testimonials and resolution centre outcomes, developing trusted
social media connections between chosen vendors and other users,
prior to selecting both a product and sourcing route. Silk Road
by way of its hidden location on the Deep Web, its use of Tor
encryption software, Bitcoins and tumbler systems, and its capacity to create networks of drug vendors with private consumer
bases appears embedded within a growing cyber culture of anonymous drug consumerism. Barratt (2012, p. 683) commented that
trust in sellers is built on reputation, with vendors reporting preference for selling through this website rather than street dealing,
due to its increased market reach across the globe, and ability to
reduce the risk of street violence. Similar to Christin (2012), of
interest is the illustration of reciprocal reviewing undertaken by
vendors in the event of potential transactioning, and the subsequent transferal onto stealth mode which allows them to operate
their business by invitation only, once a quota of trusted buyers is reached. The case described personal decisions to invest in
Silk Road as somewhat time intensive and requiring computer
expertise with creation of virtual friendships protected by use
of pseudonyms and private vendorbuyer relationships. Within
recognisable cyber group reciprocity, accountability and trust, the
notion of safe drug use within the Silk Road community was
described.
It appears that for this individual, simply being part of the
global online drug community served to compartmentalise his
drug consumerism with risk decisions undertaken within normative collective frameworks of virtual connectivity (see Furlong
& Cartmel, 1997; Miller, 2005) whilst adhering to conventional
societal norms in reality (Peretti-Watel, 2003). Of interest in this
case study, is that the individual despite clearly reaping the benets of Silk Road community membership, in reality strove to
maintain a non drug using identity with his drug use occurring
alone, and for personal, psychonautic and transcendental purposes (see also Leary, Metzner, & Alpert, 1964; Newcombe, 2008;
Shulgin & Shulgin, 1992, 1997; Turner, 1994). Anonymity is key,
as Silk Road members do not need to publicly assume a drug
user identity in order to converse freely about their drug use
(Barratt, 2012). Indeed, the existence of a drug users parallel
life has been commented on by Moore and Miles (2004). The
importance of cyber navigation and learning appeared facilitated
and supported by an existing host of experienced drug users.
Cyber communities in this sense appeared to provide a series of
nested support systems (Stockdale et al., 2007, p. 1868) which
in turn fuelled information sourcing and exchange, user connectivity, identication of trusted and reliable sourcing routes, and
mutual user supports. Whilst recognising the potential harm in
using unregulated and un-tested products sourced from Silk Road,
he described his own (subjective) drug taking experiences, personal awareness of optimal settings for drug consumption and
harm reduction practices as informed by prior experience of similar analogues and Silk Road user forums (parallels can be drawn
from similar theorists in underground drug cultures see Becker,
1963; Jay, 1999; Lilly, 1972; Miller, 2005; Tart, 1971, 1972; Zinberg,
1984). In this sense, this single case study holds some promise
in illustrating Silk Roads capacity to encourage harm reduction
within a vary hard to reach drug using population (Stimson, 1995),
considering the lack of scientic knowledge around pharmacological properties and toxicity of available substances on the net
(Dixon, 2010; Govier, 2011; Hughes & Winstock, 2012; Karila
& Reynaud, 2011; Rosenbaum, Carreiro, & Babu, 2012; Schepis,
Marlowe, & Forman, 2008; Schifano et al., 2011; Wood et al.,
2010).

Please cite this article in press as: Hout, M. C. V., & Bingham, T. Silk Road, the virtual drug marketplace: A single case study of user experiences.
International Journal of Drug Policy (2013), http://dx.doi.org/10.1016/j.drugpo.2013.01.005

G Model
DRUPOL-1172; No. of Pages 7

ARTICLE IN PRESS
M.C.V. Hout, T. Bingham / International Journal of Drug Policy xxx (2013) xxxxxx

References
Barratt, M. (2012). Letters to the editor Silk Road: Ebay for drugs. Addiction, 107,
683684.
Barratt, M., & Lenton, S. (2010). Beyond recruitment? Participatory online research
with people who use drugs. International Journal of Internet Research Ethics, 3,
6986.
Baxter, P., & Jack, S. (2008). Qualitative case study methodology: Study design
and implementation for novice researchers. The Qualitative Report, 13(4),
544559.
Becker, H. (1963). Outsiders: Studies in the sociology of deviance. London: Free Press
of Glencoe.
Bitcoin. (2011). Bitcoin. Bitcoin P2P digital currency. Retrieved from http://bitcoin.org/
(27.09.12).
Brandt, S. D., Sumnall, H. R., Measham, F., & Cole, J. (2010). Second generation mephedrone. The confusing case of NRG-1. British Medical Journal, 341,
3564.
Burillo-Putze, G., Domnguez-Rodrguez, A., Abreu-Gonzlez, P., & Nogu Xarau, S.
(2011). Khat, mefedrona y dolor torcicom. Medicina Clnica [Medicina Clinica
(Barcelona)], 137, 712713.
Califano, J. A. (2007). Press release: Youve Got Drugs! Retrieved from http://
www. casacolumbia.org/absolutenm/templates/PressReleases.aspx?articleid=
492and zoneid=65 (22.09.12).
Chen, A. (2011). The underground website where you can buy any drug
imaginable. Retrieved from http://gawker.com/5805928/the-undergroundwebsite-where-you-canbuy-any-drug-imaginable (20.09.12).
Christin, N. (2012). Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace. July, Id: paper.tex 1286 2012-07-30 21:29:14Z nicolasc.
Corazza, O., Schifano, F., Simonato, P., Fergus, S., Assi, S., Stair, J., et al. (2012).
Phenomenon of new drugs on the Internet: The case of ketamine derivative methoxetamine. Human Psychopharmacology: Clinical and Experimental, 27,
145149.
Corazza, O., Schifano, F., Farre, M., Deluca, P., Davey, Z., Drummond, C., et al. (2011).
Designer drugs on the internet: A phenomenon out-of-control? The emergence of hallucinogenic drug Bromo-Dragony. Current Clinical Pharmacology,
6, 125129.
Darke, P., Shanks, G., & Broadbent, M. (1998). Successfully completing case study
research: Combining rigour, relevance and pragmatism. Information Systems
Journal, 8, 273289.
Davey, Z., Corazza, O., Schifano, F., Deluca, P., & Psychonaut Web Mapping Group.
(2010). Mass-information: Mephedrone, myths, and the new generation of legal
highs. Drugs and Alcohol Today, 10, 2428.
Davey, Z., Schifano, F., Corazza, O., & Deluca, P. (2012). e-Psychonauts: Conducting research in online drug forum communities. Journal of Mental Health, 21,
386394.
Davies, B. (2012). Dangerous drugs online. The Australian Prescriber, 35, 3233.
Davis, J. (2011). The crypto-currency. The New Yorker. Cond Nast., p. 62
De Luca, P., Davey, Z., Corazza, O., Di Furia, L., Farre, M., Holmefjord Flesland, L., et al.
(2012). Identifying emerging trends in recreational drug use; outcomes from
the Psychonaut Web Mapping Project. Progress in Neuro-Psychopharmacology
and Biological Psychiatry (Early Online).
Dixon, B. (2010). Worries over legal drugs. Current Biology, 20, 298299.
EMCDDA. (2011a). The state of the drugs problem in Europe Annual report. Lisbon,
Portugal: European Monitoring Centre for Drugs and Drug Addiction.
EMCDDA. (2011b). Report on the risk assessment of mephedrone in the framework of
the Council decision on new psychoactive substances. Luxembourg: Publications
Ofce of the European Union.
Eurobarometer. (2011). Eurobarometer: Youth attitudes on drugs. Analytical
report. Retrieved from http://ec.europa.eu/public opinion/ash/ 330 en.pdf
(20.09.12).
Fielding, N. G., Lee, R. M., & Blank, G. (Eds.). (2008). The handbook of online research
methods. London: Sage.
Flyvbjerg, B. (2011). Case study. In K. Norman, Denzin, S. Yvonna, & Lincoln (Eds.),
The Sage handbook of qualitative research (4th ed., pp. 301316). Thousand Oaks,
CA: Sage.
Flyvbjerg, B. (2006). Five misunderstandings about case study research. Qualitative
Inquiry, 12(2), 219245.
Forsyth, A. J. M. (2012). Virtually a drug scare: Mephedrone and the impact of the
Internet on drug news transmission. International Journal of Drug Policy, 23,
198209.
Furlong, A., & Cartmel, F. (1997). Young people and social change: Individualization
and risk in late modernity. Buckingham: Open University Press.
Gordon, S. M., Forman, R. F., & Siatkowski, C. (2006). Knowledge and use of the Internet as a source of controlled substances. Journal of Substance Abuse Treatment,
30, 271274.
Govier, M. (2011). Research chemicals: An approach to lling the information gap.
Drugs and Alcohol Today, 11, 7176.
Greenhalgh, T. (1997). How to read a paper: The basics of evidence based medicine. UK:
BMJ Publishing Group., pp. 151162
Grifths, P., Sedefov, R., Gallegos, A., & Lopez, D. (2010). How globalization and market innovation challenge how we think about and respond to drug use: Spice
a case study. Addiction, 105, 951953.
Hill, S., & Thomas, S. H. (2011). Clinical toxicology of newer recreational drugs.
Clinical Toxicology, 49, 705719.
Hughes, B., & Winstock, A. R. (2012). Controlling new drugs under marketing regulations. Addiction, http://dx.doi.org/10.1111/j.1360-0443.2011.03620.x

Inciardi, J. A., Surratt, H. L., Cicero, T. J., Roseblum, A., Ahwah, C., Bailey, E., et al.
(2010). Prescription drugs purchased through the internet: Who are the end
users? Drug and Alcohol Dependence, 110, 2129.
Jay, M. (1999). Articial paradises: A drugs reader. London: Penguin.
Jones, A. L. (2010). Legal highs available through the Internetimplications and
solutions? Quarterly Journal of Medicine, 103, 535536.
Karila, L., & Reynaud, M. (2011). GHB and synthetic cathinones: Clinical effects and
potential consequences. Drug Testing and Analysis, 3, 552559.
Kna, K., & Breitmayer, B. J. (1989). Triangulation in qualitative research: Issues of
conceptual clarity and purpose. In J. Morse (Ed.), Qualitative nursing research: A
contemporary dialogue (pp. 193203). Rockville, MD: Aspen.
Kypri, K. (2009). New technologies in the prevention and treatment of substance
use problems. Drug and Alcohol Review, 28, 12.
Lather, P. (1992). Critical frames in educational research: Feminist and poststructural perspectives. Theory into Practice, 31(2), 8799.
Leary, T., Metzner, R., & Alpert, R. (1964). The psychedelic experience. NY: Citadel
Press.
Lilly, J. (1972). The Centre of the Cyclone: An autobiography of inner space. London:
Marion Boyars.
Measham, F. (2011). Legal highs: The challenge for government. Criminal Justice
Matters, 84, 2830.
Mendelson, C. (2007). Recruiting participants for research from online communities.
Computers, Informatics, Nursing, 25, 317323.
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook. CA: Sage.
Miller, P. G., & Snderlund, A. L. (2010). Using the internet to research hidden populations of illicit drug users: A review. Addiction, 105, 15571567.
Miller, P. (2005). Scapegoating, self-condence and risk comparison: The functionality of risk neutralisation and lay epidemiology by injecting drug users.
International Journal of Drug Policy, 16, 246253.
Moore, K., & Miles, S. (2004). Young people, dance and the sub-cultural consumption
of drugs. Addiction Research and Theory, 12, 507523.
Murgua, E., & Tackett-Gibson, M. (2007). The new drugs Internet survey: A portrait
of respondents. In E. Murgua, M. Tackett-Gibson, & A. Lessem (Eds.), Real drugs
in a virtual world: Drug discourse and community online (pp. 4558). Lanham,
MD: Lexington Books.
Newcombe, R. (2008). Ketamine case study: The phenomenology of a ketamine
experience. Addiction Research and Theory, 16, 209215.
Newcombe R., & Johnson, M. (1999, November). Psychonautics: A model and method
for exploring the subjective effects of psychoactive drugs. Paper presented at
Club Health 2000 First International Conference on Nightlife and Substance
Use, Royal Tropical Institute, Amsterdam, Netherlands.
Norrie; J., & Moses, A. (2011). Drugs bought with virtual cash. The Sydney
Morning Herald. Fairfax Media. Retrieved from http://www.smh.com.
au/technology/technology-news/drugs-bought-with-virtual-cash-201106111fy0a.html (20.09.12).
Oyemade, A. (2010). Meow Meow or Miaow Miaow a new drug of concern. Psychiatry, 7, 10.
Peretti-Watel, P. (2003). Neutralization theory and the denial of risk: Some evidence
from cannabis use among French adolescents. The British Journal of Sociology, 54,
2142.
Prosser, J. M., & Nelson, L. S. (2011). The toxicology of bath salts: A review of synthetic
cathinones. Journal of Medical Toxicology, 8, 3342.
Psychonaut Web Mapping Research Group. (2009). Mephedrone report. London, UK:
Institute of Psychiatry.
Rosenbaum, C. D., Carreiro, S. P., & Babu, K. M. (2012). Here Today, Gone
Tomorrow. . .and Back Again? A review of herbal marijuana alternatives (K2,
Spice), synthetic cathinones (bath salts), kratom, salvia divinorum, methoxetamine, and piperazines. Journal of Medical Toxicology, 8, 1532.
Russell, C., Gregory, D., Ploeg, J., DiCenso, A., & Guyatt, G. (2005). Qualitative research.
In A. DiCenso, G. Guyatt, & D. Ciliska (Eds.), Evidence-based nursing: A guide to
clinical practice (pp. 120135). St. Louis, MO: Elsevier Mosby.
Schepis, T., Marlowe, D. B., & Forman, R. F. (2008). The availability and portrayal of
stimulants over the Internet. Journal of Adolescent Health, 42, 458465.
Schifano, F., Albanese, A., Fergus, F., Stair, J. L., Deluca, P., Corazza, O., et al. (2011).
Mephedrone 4-methylmethcathione: meow meow: Chemical pharmacological and clinical issues. Psychopharmacology, 214, 593602.
Schumer, C. (2011). Schumer pushes to shut down online drug marketplace. Associated Press (NBC New York). Retrieved from http://www.nbcnewyork.com/
news/local/123187958.html (20.09.12).
Shulgin, A., & Shulgin, A. (1992). PIHKAL: A chemical love story. Berkeley, CA: Transform Books.
Shulgin, A., & Shulgin, A. (1997). TIHKAL: The continuation. Berkeley, CA: Transform
Books.
Silk Road forums. (2012). Retrieved from http://dkn255hz262ypmii.onion
(20.09.12).
Silk Road Sellers Guide. (2011). Restricted items. Sellers guide, Silk Road.
Retrieved from http://ianxz6zefk72ulzz.onion/index.php/silkroad/sellers guide
(22.09.12).
Sinadinovic, K., Wennberg, P., & Beman, A. H. (2012). Targeting problematic users of
illicit drugs with Internet-based screening and brief intervention: A randomized
controlled trial. Drug and Alcohol Dependence (Early Online)
Sixsmith, J., Boneham, M., & Goldring, J. E. (2003). Accessing the community: Gaining
insider perspectives from the outside. Qualitative Health Research, 13, 578589.
Solberg, U. (2012). Websites as a source of new drugs/legal highs. Recreational Drugs
European Network (RedNet News), 8.

Please cite this article in press as: Hout, M. C. V., & Bingham, T. Silk Road, the virtual drug marketplace: A single case study of user experiences.
International Journal of Drug Policy (2013), http://dx.doi.org/10.1016/j.drugpo.2013.01.005

G Model
DRUPOL-1172; No. of Pages 7

ARTICLE IN PRESS
M.C.V. Hout, T. Bingham / International Journal of Drug Policy xxx (2013) xxxxxx

Stake, R. E. (2005). Qualitative case studies. In N. K. Denzin, & Y. S. Lincoln (Eds.),


The Sage handbook of qualitative research. 3rd ed. Thousand Oaks, CA: Sage,
pp. 443466.
Stimson, G. V. (1995). An environmental approach to reducing drug-related harm.
In J. W. T. Dickerson, & G. V. Stimson (Eds.), Health in the inner city Drugs in the
city. London: Royal Society of Health.
Stockdale, S. E., Wells, K. B., Tang, L., Belin, T. R., Zhang, L., & Sherbourne, C.
D. (2007). The importance of social context: Neighborhood stressors, stressbuffering mechanisms, and alcohol, drug, and mental health disorders. Social
Science and Medicine, 9, 18671881.
Sumnall, H., Evans-Brown, M., & McVeigh, J. (2011). Social, policy, and public health
perspectives on new psychoactive substances. Drug Test Analysis, 3, 515523.
Tart, C. (1971). On being stoned: A psychological study of marijuana intoxication. Palo
Alto, CA: Science & Behaviour.
Tart, C. (Ed.). (1972). Altered states of consciousness. NY: Doubleday.
Thomas, G. (2011). A typology for the case study in social science following
a review of denition, discourse and structure. Qualitative Inquiry, 17(6),
511521.

Tor Project. (2011). Anonymity online. Retrieved from http://www.torproject.org/


(20.09.12).
Turner, D. M. (1994). The essential psychedelic guide. San Francisco: Panther Press.
Vardakou, I. (2011). Drugs for youth via Internet and the example of mephedrone.
Toxicology Letters, 201, 191195.
Wax, P. M. (2002). Just a click away: Recreational drug web sites on the Internet.
Pediatrics, 109, e96.
Winstock, A. R., Marsden, J., & Mitcheson, I. (2010). What should be done about
mephedrone. British Medical Journal, 340, c1605.
Wood, D. M., Davies, S., Puchnarewicz, M., Button, J., Archer, R., Ovaska, H., et al.
(2010). Recreational use of mephedrone (4-methylmethcathinone, 4-MMC)
with associated sympathomimetic toxicity. Journal of Medical Toxicology, 6,
327330.
Yin, R. K. (1984). Case study research: Design and methods. Beverly Hills, CA: Sage.
Yin, R. K. (2003). Case study research: Design and methods (3rd ed.). Thousand Oaksm,
CA: Sage.
Zinberg, N. (1984). Drug, set and setting: The basis for controlled intoxicant use. New
Haven: Yale University Press.

Please cite this article in press as: Hout, M. C. V., & Bingham, T. Silk Road, the virtual drug marketplace: A single case study of user experiences.
International Journal of Drug Policy (2013), http://dx.doi.org/10.1016/j.drugpo.2013.01.005

By BIN HE, MITESH PATEL, ZHEN ZHANG, and


KEVIN CHEN-CHUAN CHANG

ACCESSING THE

DEEP WEB

Attempting to locate and quantify material on the Web


that is hidden from typical search techniques.
The Web has been rapidly deepened by massive databases online and current
search engines do not reach most of the data on the Internet [4]. While the surface
Web has linked billions of static HTML pages, a far more significant amount of
information is believed to be hidden in the deep Web, behind the query forms of
searchable databases, as Figure 1(a) conceptually illustrates. Such information may not
be accessible through static URL links because they are assembled into Web pages as
responses to queries submitted through the query interface of an underlying database.
Because current search engines cannot effectively crawl databases, such data remains
largely hidden from users (thus often also referred to as the invisible or hidden Web).
Using overlap analysis between pairs of search engines, it was estimated in [1] that
43,00096,000 deep Web sites and an informal estimate of 7,500 terabytes of
data exist500 times larger than the surface Web.
I l l u s t r a t i o n b y PETER HOEY

94

May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM

COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5

95

By BIN HE, MITESH PATEL, ZHEN ZHANG, and


KEVIN CHEN-CHUAN CHANG

ACCESSING THE

DEEP WEB

Attempting to locate and quantify material on the Web


that is hidden from typical search techniques.
The Web has been rapidly deepened by massive databases online and current
search engines do not reach most of the data on the Internet [4]. While the surface
Web has linked billions of static HTML pages, a far more significant amount of
information is believed to be hidden in the deep Web, behind the query forms of
searchable databases, as Figure 1(a) conceptually illustrates. Such information may not
be accessible through static URL links because they are assembled into Web pages as
responses to queries submitted through the query interface of an underlying database.
Because current search engines cannot effectively crawl databases, such data remains
largely hidden from users (thus often also referred to as the invisible or hidden Web).
Using overlap analysis between pairs of search engines, it was estimated in [1] that
43,00096,000 deep Web sites and an informal estimate of 7,500 terabytes of
data exist500 times larger than the surface Web.
I l l u s t r a t i o n b y PETER HOEY

94

May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM

COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5

95

He fig 1a (5/07) - 26.5 picas

Make:
Acura

Author:

Model:
All

Title:

Price:
Any

Subject:
Subject word(s)

Within:
30

mi

GO

Start of last name

Start(s) of title word(s)


Start of subject

Exact name

Exact start of title

Start(s) of subject word(s)

ISBN:
Publisher:

Cars.com
City:
State:
Bedrooms:
Rent:

The deep Web site Bn.com

Search Now

First name/initials and last name


Title word(s)

Your ZIP:

He fig 1b (5/07) - 26.5 picas

book
database

Amazon.com

Select a State
Doesnt matter
0

to

Biography.com

Apartments.com

Search

9999

Over 25,000 personalities!

dollars

Last Name (required)

GO!

401carfinder.com

First Name
State (required)

Region

411 locate.com

Select a State

Make

All Regions

Model

All Makes

to

Search

to

Type of Vehicle Domestic

Keywords

Clear Fields

Search Tips

Price
all prices

Format
all formats

Age
all age ranges

Subjects
all subjects

advanced search
SEARCH

simple search

Artist Name
Search
Album Title
Clear
Song Title

Artist

SEARCH

Artist
Title
Song
All

Instrument
Label

Narrow My Choices by Style


All Styles

FIND YOUR CAR!

Figure 1a. The conceptual


view of the deep Web.

With its myriad dataQ5 in this survey). Therefore, such an independence


bases and hidden conassumption
seems rather unrealistic, in which case the
He fig 1a (5/07)
- 39.5 picas
tent, this deep Web is an
result is significantly underestimated. In fact, the vioimportant yet largely unexplored frontier for infor- lation of this assumption and its consequence were
mation search. While we have understood the sur- also discussed in [1].
face Web relatively well, with various surveys [3,
Our survey took the IP sampling approach to
7]), how is the deep Web differcollect random server samples for
ent? This article reports our surestimating the global scale as well
vey of the deep Web, studying
as facilitating subsequent analythe scale, subject distribution,
sis. During April 2004, we
search-engine coverage, and
acquired and analyzed a random
other access characteristics of
sample of Web servers by IP samonline databases.
pling. We randomly sampled
Equation 1. 1,000,000 IPs (from the entire space of
We note that, while the study
conducted in 2000 [1] established
2,230,124,544
He equation
1 (5/07) valid IP addresses, after removing
interest in this area, it focused on only the scale aspect, reserved and unused IP ranges according to [8]).
and its result from overlap analysis tends to under- For each IP, we used an HTTP client, the GNU
estimate (as acknowledged in [1]). In overlap analysis, free software wget [5], to make an HTTP connecthe number of deep Web sites is estimated by exploit- tion to it and download HTML pages. We then
ing two search engines. If we find na deep Web sites identified and analyzed Web databases in this samin the first search engine, nb in the second, and n0 in ple, in order to extrapolate our estimates of the
both, we can estimate the total number as shown in deep Web.
Our survey distinguishes three related notions
Equation 1 by assuming the two search engines randomly and independently obtain their data. However, for accessing the deep Web: site, database, and
as our survey found, search engines are highly corre- interface. A deep Web site is a Web server that prolated in their coverage of deep Web data (see question vides information maintained in one or more back-

With its myriad databases and hidden content, this deep Web is an
important yet largely unexplored frontier for information search.
May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM

Authors Name

Search

simple search
Keyword
Title
Author
Keyword
ISBN

You can narrow your search by selecting one or more options below:

Year
Price

96

advanced search
Title of Book

Start Your Search

City

music
database

Figure 1b. Site, bn.com, the simple book search in Figure 1(b) is
end Web databases, each of
databases, and
which is searchable through one
interface. present in almost all pages.
survey Web databases and deep Web
or more HTML forms as its
He fig 1b (5/07)Second,
- 39.5 we
picas
query interfaces. For instance, as Figure 1(b) sites based on the discovered query interfaces.
shows, bn.com is a deep Web site, providing several Specifically, we compute the number of Web dataWeb databases (a book database, a music database, bases by finding the set of query interfaces (within
among others) accessed via multiple query inter- a site) that refer to the same database. In particular,
faces (simple search and advanced search). Note for any two query interfaces, we randomly choose
that our definition of deep Web site did not five objects from one and search them in the other.
account for the virtual hosting case, where multiple We judge that the two interfaces are searching the
Web sites can be hosted on the same physical IP same database if and only if the objects from one
address. Since identifying all the virtual hosts interface can always be found in the other one.
within an IP address is rather difficult to conduct Finally, the recognition of deep Web site is rather
in practice, we do not consider such cases in our simple: A Web site is a deep Web site if it has at
survey. Our IP sampling-based estimation is thus least one query interface.
accurate modulo the effect of virtual hosting.
When conducting the survey, we first find the RESULTS
number of query interfaces for each Web site, then (Q1) Where to find entrances to databases? To
the number of Web databases, and finally the num- access a Web database, we must first find its
ber of deep Web sites.
entrances: the query interfaces. How does an interFirst, as our survey specifically focuses on online face (if any) locate in a site, that is, at which
databases, we differentiate and exclude non-query depths? For each query interface, we measured the
HTML forms (which do not access back-end data- depth as the minimum number of hops from the
bases) from query interfaces. In particular, HTML root page of the site to the interface page.1 As this
forms for login, subscription, registration, polling, study required deep crawling of Web sites, we anaand message posting are not query interfaces. Sim- lyzed one-tenth of our total IP samples: a subset of
ilarly, we also exclude site search, which many 100,000 IPs. We tested each IP sample by making
Web sites now provide for searching HTML pages HTTP connections and found 281 Web servers.
on their sites. These pages are statically linked at Exhaustively crawling these servers to depth 10, we
the surface of the sites; they are not dynamically found 24 of them are deep Web sites, which conassembled from an underlying database. Note that tained a total of 129 query interfaces representing
our survey considered only unique interfaces and 34 Web databases.
removed duplicates; many Web pages contain the
same query interfaces repeatedly, for example, in 1Such depth information is obtained by a simple revision of the wget software.

COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5

97

He fig 1a (5/07) - 26.5 picas

Make:
Acura

Author:

Model:
All

Title:

Price:
Any

Subject:
Subject word(s)

Within:
30

mi

GO

Start of last name

Start(s) of title word(s)


Start of subject

Exact name

Exact start of title

Start(s) of subject word(s)

ISBN:
Publisher:

Cars.com
City:
State:
Bedrooms:
Rent:

The deep Web site Bn.com

Search Now

First name/initials and last name


Title word(s)

Your ZIP:

He fig 1b (5/07) - 26.5 picas

book
database

Amazon.com

Select a State
Doesnt matter
0

to

Biography.com

Apartments.com

Search

9999

Over 25,000 personalities!

dollars

Last Name (required)

GO!

401carfinder.com

First Name
State (required)

Region

411 locate.com

Select a State

Make

All Regions

Model

All Makes

to

Search

to

Type of Vehicle Domestic

Keywords

Clear Fields

Search Tips

Price
all prices

Format
all formats

Age
all age ranges

Subjects
all subjects

advanced search
SEARCH

simple search

Artist Name
Search
Album Title
Clear
Song Title

Artist

SEARCH

Artist
Title
Song
All

Instrument
Label

Narrow My Choices by Style


All Styles

FIND YOUR CAR!

Figure 1a. The conceptual


view of the deep Web.

With its myriad dataQ5 in this survey). Therefore, such an independence


bases and hidden conassumption
seems rather unrealistic, in which case the
He fig 1a (5/07)
- 39.5 picas
tent, this deep Web is an
result is significantly underestimated. In fact, the vioimportant yet largely unexplored frontier for infor- lation of this assumption and its consequence were
mation search. While we have understood the sur- also discussed in [1].
face Web relatively well, with various surveys [3,
Our survey took the IP sampling approach to
7]), how is the deep Web differcollect random server samples for
ent? This article reports our surestimating the global scale as well
vey of the deep Web, studying
as facilitating subsequent analythe scale, subject distribution,
sis. During April 2004, we
search-engine coverage, and
acquired and analyzed a random
other access characteristics of
sample of Web servers by IP samonline databases.
pling. We randomly sampled
Equation 1. 1,000,000 IPs (from the entire space of
We note that, while the study
conducted in 2000 [1] established
2,230,124,544
He equation
1 (5/07) valid IP addresses, after removing
interest in this area, it focused on only the scale aspect, reserved and unused IP ranges according to [8]).
and its result from overlap analysis tends to under- For each IP, we used an HTTP client, the GNU
estimate (as acknowledged in [1]). In overlap analysis, free software wget [5], to make an HTTP connecthe number of deep Web sites is estimated by exploit- tion to it and download HTML pages. We then
ing two search engines. If we find na deep Web sites identified and analyzed Web databases in this samin the first search engine, nb in the second, and n0 in ple, in order to extrapolate our estimates of the
both, we can estimate the total number as shown in deep Web.
Our survey distinguishes three related notions
Equation 1 by assuming the two search engines randomly and independently obtain their data. However, for accessing the deep Web: site, database, and
as our survey found, search engines are highly corre- interface. A deep Web site is a Web server that prolated in their coverage of deep Web data (see question vides information maintained in one or more back-

With its myriad databases and hidden content, this deep Web is an
important yet largely unexplored frontier for information search.
May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM

Authors Name

Search

simple search
Keyword
Title
Author
Keyword
ISBN

You can narrow your search by selecting one or more options below:

Year
Price

96

advanced search
Title of Book

Start Your Search

City

music
database

Figure 1b. Site, bn.com, the simple book search in Figure 1(b) is
end Web databases, each of
databases, and
which is searchable through one
interface. present in almost all pages.
survey Web databases and deep Web
or more HTML forms as its
He fig 1b (5/07)Second,
- 39.5 we
picas
query interfaces. For instance, as Figure 1(b) sites based on the discovered query interfaces.
shows, bn.com is a deep Web site, providing several Specifically, we compute the number of Web dataWeb databases (a book database, a music database, bases by finding the set of query interfaces (within
among others) accessed via multiple query inter- a site) that refer to the same database. In particular,
faces (simple search and advanced search). Note for any two query interfaces, we randomly choose
that our definition of deep Web site did not five objects from one and search them in the other.
account for the virtual hosting case, where multiple We judge that the two interfaces are searching the
Web sites can be hosted on the same physical IP same database if and only if the objects from one
address. Since identifying all the virtual hosts interface can always be found in the other one.
within an IP address is rather difficult to conduct Finally, the recognition of deep Web site is rather
in practice, we do not consider such cases in our simple: A Web site is a deep Web site if it has at
survey. Our IP sampling-based estimation is thus least one query interface.
accurate modulo the effect of virtual hosting.
When conducting the survey, we first find the RESULTS
number of query interfaces for each Web site, then (Q1) Where to find entrances to databases? To
the number of Web databases, and finally the num- access a Web database, we must first find its
ber of deep Web sites.
entrances: the query interfaces. How does an interFirst, as our survey specifically focuses on online face (if any) locate in a site, that is, at which
databases, we differentiate and exclude non-query depths? For each query interface, we measured the
HTML forms (which do not access back-end data- depth as the minimum number of hops from the
bases) from query interfaces. In particular, HTML root page of the site to the interface page.1 As this
forms for login, subscription, registration, polling, study required deep crawling of Web sites, we anaand message posting are not query interfaces. Sim- lyzed one-tenth of our total IP samples: a subset of
ilarly, we also exclude site search, which many 100,000 IPs. We tested each IP sample by making
Web sites now provide for searching HTML pages HTTP connections and found 281 Web servers.
on their sites. These pages are statically linked at Exhaustively crawling these servers to depth 10, we
the surface of the sites; they are not dynamically found 24 of them are deep Web sites, which conassembled from an underlying database. Note that tained a total of 129 query interfaces representing
our survey considered only unique interfaces and 34 Web databases.
removed duplicates; many Web pages contain the
same query interfaces repeatedly, for example, in 1Such depth information is obtained by a simple revision of the wget software.

COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5

97

25%

Proportion of Web Databases

Proportion of Web Databases

30%
25%
20%
15%
10%
5%
0%
0

20%

15%

10%

5%

0%

10

be

Depth

ci nm en rs he go rg sc ed ah

si

re ot

Subject Categories

We found that query inter- and the number of query inter- Figure 2b. Distribution
databases over
faces
tend to
locatepicas
shallowly in faces as shown in Equation 4 (the of Websubject
category.
2a (5/07)
- 19.5
He
fig
2b
(5/07)
19.5
picas
their sites: none of the 129 query results are rounded to 1,000).
interfaces had depth deeper than The second and third columns of
5. To begin with, 72% (93 out of 129) interfaces Table 1 summarize the sampling and the estimawere found within depth 3. Further, since a Web tion results
respectively. We also compute the con25%
30%
database may be accessed through multiple inter- fidence interval of each estimated number at 99%
25% measured its depth as the minimum
faces, we
level of
20% confidence, as the 4th column of Table 1
depths 20%
of all its interfaces: 94% (32 out of 34) Web shows, which evidently indicates the scale of the
15%
databases appeared within depth
deep Web is well on the order of
15% 2(a) reports the depth
3; Figure
105 sites. We also observed the
10%
distribution
of
the
34
Web
datamultiplicity of access on the
10%
bases. Finally, 91.6% (22 out of
deep Web. On average, each
5%
5%
24) deep
Web sites had their
deep Web site provides 1.5 datadatabases
within
depth
3.
(We
bases, and each database sup0%
0%
Equation 2.
be ci nm en rs he go rg sc ed ah si re ot
6
7
8
9 10
0
1 ratios
2
3
4 as5 depthrefer to these
ports 2.8 query interfaces.
Depth
Subject Categories
three coverage, which will guide
The earlier survey of [1] estiour further larger-scale crawling
mated 43,000 to 96,000 deep
in Q2.)
Web sites by overlap analysis
He equation 2 (5/07)
(Q2) What is the scale of the
between pairs of search engines.
He fig
2a
(5/07)
15
picas
He
fig
2b
(5/07)
- 15[1]picas
deep Web?
We then tested and Equation 3.
Although
did not explicitly
analyzed all of the 1,000,000 IP
qualify what it measured as a
samples to estimate the scale of
search site, by comparison, it
the deep Web. As just identified,
still indicates that our estimation
He equation 3 (5/07)
with the high depth-three coverof the scale of the deep Web (on
age, almost all Web databases can
the order of 105 sites), is quite
be identified within depth 3. We Equation 4.
accurate. Further, it has been
thus crawled to depth 3 for these
expanding, resulting in a 37 times increase in the
one million IPs.
four years from 20002004.
He equation
(5/07)
The crawling found 2,256 Web servers,
among 4 (Q3)
How structured is the deep Web? While
which we identified 126 deep Web sites, which information on the surface Web is mostly unstruccontained a total of 406 query interfaces represent- tured HTML text (and images), how is the nature
ing 190 Web databases. Extrapolating from the s of the deep Web data different? We classified Web
=1,000,000 unique IP samples to the entire IP databases into two types: unstructured databases,
space of t = 2,230,124,544 IPs, and accounting for which provide data objects as unstructured media
the depth-three coverage, we estimate the number (text, images, audio, and video); and structured
of deep Web sites as shown in Equation 2, the databases, which provide data objects as structured
number of Web databases as shown in Equation 3, relational records with attribute-value pairs. For
98

Proportion of Web Databases

Proportion of Web Databases

Figure 2a. Distribution


of Web databases
over
depth.
He
fig

May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM

instance, cnn.com has an unstructured database of ed, ah, si, re, otwhich together occupy 51% (97
news articles, while amazon.com has a structured out of 190 databases), leaving only a slight minordatabase for books, which returns book records (for ity of 49% to the rest of commerce sites (broadly
example, title = gone with the wind, format = defined). In comparison, the subject distribution of
the surface Web, as charpaperback, price =
acterized in [7], showed
$7.99).
that commerce sites
By manual querying The entire deep Web
dominated with an 83%
and inspection of the Google.com (32%)
share. Thus, the trend of
190 Web databases sam- Yahoo.com (32%)
deepening emerges not
pled, we found 43 MSN.com (11%)
only across all areas, but
unstructured and 147 All (37%)
also relatively more sigstructured. We similarly
0% 5%
37%
100%
nificantly in the nonestimate their total
commerce ones.
numbers to be 102,000
Figure 3. Coverage of search
and 348,000 respec(Q5) How do search engines cover the deep Web?
engines.
tively, as summarized in
Since some deep Web sources also provide
Table 1. Thus, the deep
browse directories with URL links to reach the
He fig 3 (5/07)
Web features mostly structured data sources, with a hidden content, how effective is it to crawl-anddominating ratio of 3.4:1 versus unstructured index the deep Web as search engines do for the
sources.
surface Web? We thus investigated how popular
Table 1. Sampling and estimation of
searchscale.
engines index data on the deep Web. In par(Q4) What is the subject distribution oftheWeb
deep-Web
databases? With respect to the top-level categories ticular, we chose the three largest search engines
of the yahoo.com directory as our taxonomy, we Google (google.com), Yahoo (yahoo.com), and
MSN (msn.com).
manually categorized the
We randomly selected
sampled 190 Web dataSampling Results Total Estimate 99% Confidence Interval
20
Web databases from
bases. Figure 2(b) shows Deep Web sites
126
307,000
236,000 - 377,000
the
190 in our sampling
the distribution of the Web databases
190
450,000
366,000 - 535,000
result.
For each database,
14 categories: Business
43
unstructured
102,000
62,000 - 142,000
first,
we
manually sam& Economy (be), Com147
structured
348,000
275,000 - 423,000
pled five objects (result
puters & Internet (ci),
406
1,258,000
1,097,000 - 1,419,000
pages) as test data, by
News & Media (nm), Query interfaces
querying the source with
Entertainment
(en),
Recreation & Sports (rs), Table 1. Sampling and estimation some random words. We then, for each object colof the deep Web scale.
queried every search engine to test whether
Health (he), GovernHe table 1lected,
(5/07)
the page was indexed by formulating queries specifment (go), Regional (rg),
ically matching the object page. (For instance, we
Society & Culture (sc),
Education (ed), Arts & Humanities (ah), Science used distinctive phrases that occurred in the object
page as keywords and limited the search to only the
(si), Reference (re), and Others (ot).
The distribution indicates great subject diversity source site.)
Figure 3 reports our finding: Google and Yahoo
among Web databases, indicating the emergence
and proliferation of Web databases are spanning both indexed 32% of the deep Web objects, and
well across all subject domains. While there seems MSN had the smallest coverage of 11%. However,
to be a common perception that the deep Web is there was significant overlap in what they covered:
driven and dominated by e-commerce (for exam- the combined coverage of the three largest search
ple, for product search), our survey indicates the engines increased only to 37%, indicating they were
contrary. To contrast, we further identify non-com- indexing almost the same objects. In particular, as
merce categories from Figure 2(b)he, go, rg, sc, Figure 3 illustrates, Yahoo and Google overlapped

While there seems to be a common perception that the deep Web is


driven and dominated by e-commerce (for example, for product search),
our survey indicates the contrary.
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5

99

25%

Proportion of Web Databases

Proportion of Web Databases

30%
25%
20%
15%
10%
5%
0%
0

20%

15%

10%

5%

0%

10

be

Depth

ci nm en rs he go rg sc ed ah

si

re ot

Subject Categories

We found that query inter- and the number of query inter- Figure 2b. Distribution
databases over
faces
tend to
locatepicas
shallowly in faces as shown in Equation 4 (the of Websubject
category.
2a (5/07)
- 19.5
He
fig
2b
(5/07)
19.5
picas
their sites: none of the 129 query results are rounded to 1,000).
interfaces had depth deeper than The second and third columns of
5. To begin with, 72% (93 out of 129) interfaces Table 1 summarize the sampling and the estimawere found within depth 3. Further, since a Web tion results
respectively. We also compute the con25%
30%
database may be accessed through multiple inter- fidence interval of each estimated number at 99%
25% measured its depth as the minimum
faces, we
level of
20% confidence, as the 4th column of Table 1
depths 20%
of all its interfaces: 94% (32 out of 34) Web shows, which evidently indicates the scale of the
15%
databases appeared within depth
deep Web is well on the order of
15% 2(a) reports the depth
3; Figure
105 sites. We also observed the
10%
distribution
of
the
34
Web
datamultiplicity of access on the
10%
bases. Finally, 91.6% (22 out of
deep Web. On average, each
5%
5%
24) deep
Web sites had their
deep Web site provides 1.5 datadatabases
within
depth
3.
(We
bases, and each database sup0%
0%
Equation 2.
be ci nm en rs he go rg sc ed ah si re ot
6
7
8
9 10
0
1 ratios
2
3
4 as5 depthrefer to these
ports 2.8 query interfaces.
Depth
Subject Categories
three coverage, which will guide
The earlier survey of [1] estiour further larger-scale crawling
mated 43,000 to 96,000 deep
in Q2.)
Web sites by overlap analysis
He equation 2 (5/07)
(Q2) What is the scale of the
between pairs of search engines.
He fig
2a
(5/07)
15
picas
He
fig
2b
(5/07)
- 15[1]picas
deep Web?
We then tested and Equation 3.
Although
did not explicitly
analyzed all of the 1,000,000 IP
qualify what it measured as a
samples to estimate the scale of
search site, by comparison, it
the deep Web. As just identified,
still indicates that our estimation
He equation 3 (5/07)
with the high depth-three coverof the scale of the deep Web (on
age, almost all Web databases can
the order of 105 sites), is quite
be identified within depth 3. We Equation 4.
accurate. Further, it has been
thus crawled to depth 3 for these
expanding, resulting in a 37 times increase in the
one million IPs.
four years from 20002004.
He equation
(5/07)
The crawling found 2,256 Web servers,
among 4 (Q3)
How structured is the deep Web? While
which we identified 126 deep Web sites, which information on the surface Web is mostly unstruccontained a total of 406 query interfaces represent- tured HTML text (and images), how is the nature
ing 190 Web databases. Extrapolating from the s of the deep Web data different? We classified Web
=1,000,000 unique IP samples to the entire IP databases into two types: unstructured databases,
space of t = 2,230,124,544 IPs, and accounting for which provide data objects as unstructured media
the depth-three coverage, we estimate the number (text, images, audio, and video); and structured
of deep Web sites as shown in Equation 2, the databases, which provide data objects as structured
number of Web databases as shown in Equation 3, relational records with attribute-value pairs. For
98

Proportion of Web Databases

Proportion of Web Databases

Figure 2a. Distribution


of Web databases
over
depth.
He
fig

May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM

instance, cnn.com has an unstructured database of ed, ah, si, re, otwhich together occupy 51% (97
news articles, while amazon.com has a structured out of 190 databases), leaving only a slight minordatabase for books, which returns book records (for ity of 49% to the rest of commerce sites (broadly
example, title = gone with the wind, format = defined). In comparison, the subject distribution of
the surface Web, as charpaperback, price =
acterized in [7], showed
$7.99).
that commerce sites
By manual querying The entire deep Web
dominated with an 83%
and inspection of the Google.com (32%)
share. Thus, the trend of
190 Web databases sam- Yahoo.com (32%)
deepening emerges not
pled, we found 43 MSN.com (11%)
only across all areas, but
unstructured and 147 All (37%)
also relatively more sigstructured. We similarly
0% 5%
37%
100%
nificantly in the nonestimate their total
commerce ones.
numbers to be 102,000
Figure 3. Coverage of search
and 348,000 respec(Q5) How do search engines cover the deep Web?
engines.
tively, as summarized in
Since some deep Web sources also provide
Table 1. Thus, the deep
browse directories with URL links to reach the
He fig 3 (5/07)
Web features mostly structured data sources, with a hidden content, how effective is it to crawl-anddominating ratio of 3.4:1 versus unstructured index the deep Web as search engines do for the
sources.
surface Web? We thus investigated how popular
Table 1. Sampling and estimation of
searchscale.
engines index data on the deep Web. In par(Q4) What is the subject distribution oftheWeb
deep-Web
databases? With respect to the top-level categories ticular, we chose the three largest search engines
of the yahoo.com directory as our taxonomy, we Google (google.com), Yahoo (yahoo.com), and
MSN (msn.com).
manually categorized the
We randomly selected
sampled 190 Web dataSampling Results Total Estimate 99% Confidence Interval
20
Web databases from
bases. Figure 2(b) shows Deep Web sites
126
307,000
236,000 - 377,000
the
190 in our sampling
the distribution of the Web databases
190
450,000
366,000 - 535,000
result.
For each database,
14 categories: Business
43
unstructured
102,000
62,000 - 142,000
first,
we
manually sam& Economy (be), Com147
structured
348,000
275,000 - 423,000
pled five objects (result
puters & Internet (ci),
406
1,258,000
1,097,000 - 1,419,000
pages) as test data, by
News & Media (nm), Query interfaces
querying the source with
Entertainment
(en),
Recreation & Sports (rs), Table 1. Sampling and estimation some random words. We then, for each object colof the deep Web scale.
queried every search engine to test whether
Health (he), GovernHe table 1lected,
(5/07)
the page was indexed by formulating queries specifment (go), Regional (rg),
ically matching the object page. (For instance, we
Society & Culture (sc),
Education (ed), Arts & Humanities (ah), Science used distinctive phrases that occurred in the object
page as keywords and limited the search to only the
(si), Reference (re), and Others (ot).
The distribution indicates great subject diversity source site.)
Figure 3 reports our finding: Google and Yahoo
among Web databases, indicating the emergence
and proliferation of Web databases are spanning both indexed 32% of the deep Web objects, and
well across all subject domains. While there seems MSN had the smallest coverage of 11%. However,
to be a common perception that the deep Web is there was significant overlap in what they covered:
driven and dominated by e-commerce (for exam- the combined coverage of the three largest search
ple, for product search), our survey indicates the engines increased only to 37%, indicating they were
contrary. To contrast, we further identify non-com- indexing almost the same objects. In particular, as
merce categories from Figure 2(b)he, go, rg, sc, Figure 3 illustrates, Yahoo and Google overlapped

While there seems to be a common perception that the deep Web is


driven and dominated by e-commerce (for example, for product search),
our survey indicates the contrary.
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5

99

on 27% objects of their 32% coverage: a 84% over- range of 0.2%3.1%. We believe this extremely
lap. Moreover, MSNs coverage was entirely a sub- low coverage suggests that, with their apparently
Table 2. Coverage of deep-Web directories.
set of Yahoo, and thus a 100% overlap.
manual classification of Web databases, such direcThe coverage results
tory-based indexing serreveal some interesting
vices can hardly scale for
Number of Web Databases
Coverage
phenomena. On one
the deep Web.
completeplanet.com
70,000
15.6%
hand, in contrast to the
lii.org
14,000
3.1%
common perception,
CONCLUSION
turbo10.com
2,300
0.5%
the deep Web is probaFor further discussion,
bly not inherently hidwe summarize the findinvisible-web.net
1,000
0.2%
den or invisible: the
ings of this survey for the
major search engines
deep Web in Table 3 and
Coverage of deep
were able to each index one-third (32%) of the Table -2.19.5
make the following conpicas
Web directories.
He table 2 (5/07)
data. On the other hand, however, the coverage
clusions. While imporseems bounded by an intrinsic limit. Combined,
tant for information
these major engines covered only marginally more search, the deep Web remains largely unexplored
than they did individually, due to their Number
significant
and is
currently neither well supported nor well
of Web Databases
Coverage
overlap. This phenomenoncompleteplanet.com
clearly contrasts 70,000
with understood.
The poor coverage of both its data (by
15.6%
the surface Web where, as lii.org
[7] reports, the overlap
search
engines)
and databases
(by directory serTable 3. Summary
of findings
in our survey.
14,000
3.1%
between engines is low, and combining them (or vices) suggests that access to the deep Web is not
turbo10.com
0.5%
metasearch) can greatly improve
coverage. In 2,300
this adequately
supported. In seeking to better underinvisible-web.net
1,000
0.2%
case, for the deep Web, the fact
Aspect
Findings
that 63% objects were not
scale
The deep Web is of a large scale of 307,000 sites, 450,000 databases, and 1,258,000 interfaces.
indexed by any engines indiIt has been rapidly expanding, with 37 times increase between 20002004.
cates certain inherent barriers
diversity 2 (5/07)
The deep Web
diversely distributed across all subject areas. Although e-commerce is a
He table
- 15is picas
for crawling and indexing data.
main driving force, the trend of deepening emerges not only across all areas, but also
relatively more significantly in the non-commerce ones.
Most Web databases remain
structural
Data sources on the deep Web are mostly structured, with a 3.4 ratio outnumbering
invisible, providing no linkcomplexity unstructured sources, unlike the surface Web.
based access, and are thus not
depth
Web databases tend to locate shallowly in their sites; the vast majority of 94% can be found
indexable by current crawling
at the top-3 levels.
techniques; and even when
search
The deep Web is not entirely hidden from crawlingmajor search engines cover about
engine
one-third of the data. However, there seems to be an intrinsic limit of coveragesearch
crawlable, Web databases are
coverage
engines combined cover roughly the same data, unlike the surface Web.
rather dynamic, and thus crawldirectory
While some deep-Web directory services have started to index databases on the Web, their
ing cannot keep up with their
coverage
coverage is small, ranging from 0.2% to 15.6%.
updates.
(Q6) What is the coverage of
Table 3. Summary of
stand the deep Web, weve determined that in some
deep Web directories? Besides
survey findings.
table 3 (5/07)
aspects He
it resembles
the surface Web: it is large,
traditional search engines, sevfast-growing, and diverse. However, they differ in
eral deep Web portal services
have emerged online, providing deep Web directo- other aspects: the deep Web is more diversely disries that classify Web databases in some tax- tributed, is mostly structured, and suffers an inheronomies. To measure their coverage, we surveyed ent limitation of crawling.
To support effective access to the deep Web,
four popular deep Web directories, as summarized
in Table 2. For each directory service, we recorded although the crawl-and-index techniques widely
the number of databases it claimed to have used in popular search engines have been quite sucindexed (on their Web sites). As a result, com- cessful for the surface Web, such an access model
pleteplanet.com was the largest such directory, may not be appropriate for the deep Web. Crawlwith over 70,000 databases.2 As shown in Table 2, ing will likely encounter the limit of coverage,
compared to our estimate, it covered only 15.6% which seems intrinsic because of the hidden and
of the total 450,000 Web databases. However, dynamic nature of Web databases. Further, indexother directories covered even less, in the limited ing the crawled data will likely face the barrier of
structural heterogeneity across the wide range of
deep Web data. The current keyword-based index2However, we noticed that completeplanet.com also indexed site search, which we
have excluded; thus, its coverage could be overestimated.
ing (which all search engines do), while serving the
100

May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM

surface Web pages well, will miss the schematic


structure available in most Web databases. This situation is analogous to being limited to searching
for flight tickets by keywords only; not destinations, dates, and prices.
As traditional access techniques may not be
appropriate for the deep Web, it is crucial to develop
more effective techniques. We speculate that the
deep Web will likely be better served with a database-centered, discover-and-forward access model. A
search engine will automatically discover databases
on the Web by crawling and indexing their query
interfaces (and not their data pages). Upon user
querying, the search engine will forward users to the
appropriate databases for the actual search of data.
Querying the databases will use their data-specific
interfaces and thus fully leverage their structures. To
use the previous analogy of searching for flight information, we can now query flights with the desired
attributes. Several recent research projects, including
MetaQuerier [2] and WISE-Integrator [6], are
exploring this exciting direction. c
References
1. BrightPlanet.com. The deep Web: Surfacing hidden value; brightplanet.com/resources/details/deepweb.html.
2. Chen-Chuan Chang, K., He, B., and Zhang, Z. Toward large scale
integration: Building a metaquerier over databases on the Web. In Proceedings of the 2nd CIDR Conference, 2005.
3. Fetterly, D., Manasse, M., Najork, M., and Wiener, J. A large-scale
study of the evolution of Web pages. In Proceedings of the 12th International World Wide Web Conference, 2004, 669678.
4. Ghanem, T.M. and Aref, W.G. Databases deepen the Web. IEEE
Computer 73, 1 (2004), 116117.
5. GNU. wget; www.gnu.org/software/wget/wget.html.
6. He, H., Meng, W., Yu, C., and Wu, Z. Wise-integrator: An automatic
integrator of Web search interfaces for e-commerce. In Proceedings of
the 29th VLDB Conference, 2003.
7. Lawrence, S. and Giles, C.L. Accessibility of information on the Web.
Nature 400, 6740 (1999), 107109.
8. ONeill, E., Lavoie, B., and Bennett, R. Web characterization;
wcp.oclc.org.

Bin He (binhe@uiuc.edu) is a research staff member at IBM


Almaden Research Center in San Jose, CA.
Mitesh Patel (miteshp@microsoft.com) is a developer at Microsoft
Corporation.
Zhen Zhang (zhang2@uiuc.edu) is a graduate research assistant in
computer science at the University of Illinois at Urbana-Champaign.
Kevin Chen-Chuan Chang (kcchang@cs.uiuc.edu) is an
assistant professor of computer science at the University of Illinois at
Urbana-Champaign.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and the full citation on
the first page. To copy otherwise, to republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee.

2007 ACM 0001-0782/07/0500 $5.00

COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5

101

on 27% objects of their 32% coverage: a 84% over- range of 0.2%3.1%. We believe this extremely
lap. Moreover, MSNs coverage was entirely a sub- low coverage suggests that, with their apparently
Table 2. Coverage of deep-Web directories.
set of Yahoo, and thus a 100% overlap.
manual classification of Web databases, such direcThe coverage results
tory-based indexing serreveal some interesting
vices can hardly scale for
Number of Web Databases
Coverage
phenomena. On one
the deep Web.
completeplanet.com
70,000
15.6%
hand, in contrast to the
lii.org
14,000
3.1%
common perception,
CONCLUSION
turbo10.com
2,300
0.5%
the deep Web is probaFor further discussion,
bly not inherently hidwe summarize the findinvisible-web.net
1,000
0.2%
den or invisible: the
ings of this survey for the
major search engines
deep Web in Table 3 and
Coverage of deep
were able to each index one-third (32%) of the Table -2.19.5
make the following conpicas
Web directories.
He table 2 (5/07)
data. On the other hand, however, the coverage
clusions. While imporseems bounded by an intrinsic limit. Combined,
tant for information
these major engines covered only marginally more search, the deep Web remains largely unexplored
than they did individually, due to their Number
significant
and is
currently neither well supported nor well
of Web Databases
Coverage
overlap. This phenomenoncompleteplanet.com
clearly contrasts 70,000
with understood.
The poor coverage of both its data (by
15.6%
the surface Web where, as lii.org
[7] reports, the overlap
search
engines)
and databases
(by directory serTable 3. Summary
of findings
in our survey.
14,000
3.1%
between engines is low, and combining them (or vices) suggests that access to the deep Web is not
turbo10.com
0.5%
metasearch) can greatly improve
coverage. In 2,300
this adequately
supported. In seeking to better underinvisible-web.net
1,000
0.2%
case, for the deep Web, the fact
Aspect
Findings
that 63% objects were not
scale
The deep Web is of a large scale of 307,000 sites, 450,000 databases, and 1,258,000 interfaces.
indexed by any engines indiIt has been rapidly expanding, with 37 times increase between 20002004.
cates certain inherent barriers
diversity 2 (5/07)
The deep Web
diversely distributed across all subject areas. Although e-commerce is a
He table
- 15is picas
for crawling and indexing data.
main driving force, the trend of deepening emerges not only across all areas, but also
relatively more significantly in the non-commerce ones.
Most Web databases remain
structural
Data sources on the deep Web are mostly structured, with a 3.4 ratio outnumbering
invisible, providing no linkcomplexity unstructured sources, unlike the surface Web.
based access, and are thus not
depth
Web databases tend to locate shallowly in their sites; the vast majority of 94% can be found
indexable by current crawling
at the top-3 levels.
techniques; and even when
search
The deep Web is not entirely hidden from crawlingmajor search engines cover about
engine
one-third of the data. However, there seems to be an intrinsic limit of coveragesearch
crawlable, Web databases are
coverage
engines combined cover roughly the same data, unlike the surface Web.
rather dynamic, and thus crawldirectory
While some deep-Web directory services have started to index databases on the Web, their
ing cannot keep up with their
coverage
coverage is small, ranging from 0.2% to 15.6%.
updates.
(Q6) What is the coverage of
Table 3. Summary of
stand the deep Web, weve determined that in some
deep Web directories? Besides
survey findings.
table 3 (5/07)
aspects He
it resembles
the surface Web: it is large,
traditional search engines, sevfast-growing, and diverse. However, they differ in
eral deep Web portal services
have emerged online, providing deep Web directo- other aspects: the deep Web is more diversely disries that classify Web databases in some tax- tributed, is mostly structured, and suffers an inheronomies. To measure their coverage, we surveyed ent limitation of crawling.
To support effective access to the deep Web,
four popular deep Web directories, as summarized
in Table 2. For each directory service, we recorded although the crawl-and-index techniques widely
the number of databases it claimed to have used in popular search engines have been quite sucindexed (on their Web sites). As a result, com- cessful for the surface Web, such an access model
pleteplanet.com was the largest such directory, may not be appropriate for the deep Web. Crawlwith over 70,000 databases.2 As shown in Table 2, ing will likely encounter the limit of coverage,
compared to our estimate, it covered only 15.6% which seems intrinsic because of the hidden and
of the total 450,000 Web databases. However, dynamic nature of Web databases. Further, indexother directories covered even less, in the limited ing the crawled data will likely face the barrier of
structural heterogeneity across the wide range of
deep Web data. The current keyword-based index2However, we noticed that completeplanet.com also indexed site search, which we
have excluded; thus, its coverage could be overestimated.
ing (which all search engines do), while serving the
100

May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM

surface Web pages well, will miss the schematic


structure available in most Web databases. This situation is analogous to being limited to searching
for flight tickets by keywords only; not destinations, dates, and prices.
As traditional access techniques may not be
appropriate for the deep Web, it is crucial to develop
more effective techniques. We speculate that the
deep Web will likely be better served with a database-centered, discover-and-forward access model. A
search engine will automatically discover databases
on the Web by crawling and indexing their query
interfaces (and not their data pages). Upon user
querying, the search engine will forward users to the
appropriate databases for the actual search of data.
Querying the databases will use their data-specific
interfaces and thus fully leverage their structures. To
use the previous analogy of searching for flight information, we can now query flights with the desired
attributes. Several recent research projects, including
MetaQuerier [2] and WISE-Integrator [6], are
exploring this exciting direction. c
References
1. BrightPlanet.com. The deep Web: Surfacing hidden value; brightplanet.com/resources/details/deepweb.html.
2. Chen-Chuan Chang, K., He, B., and Zhang, Z. Toward large scale
integration: Building a metaquerier over databases on the Web. In Proceedings of the 2nd CIDR Conference, 2005.
3. Fetterly, D., Manasse, M., Najork, M., and Wiener, J. A large-scale
study of the evolution of Web pages. In Proceedings of the 12th International World Wide Web Conference, 2004, 669678.
4. Ghanem, T.M. and Aref, W.G. Databases deepen the Web. IEEE
Computer 73, 1 (2004), 116117.
5. GNU. wget; www.gnu.org/software/wget/wget.html.
6. He, H., Meng, W., Yu, C., and Wu, Z. Wise-integrator: An automatic
integrator of Web search interfaces for e-commerce. In Proceedings of
the 29th VLDB Conference, 2003.
7. Lawrence, S. and Giles, C.L. Accessibility of information on the Web.
Nature 400, 6740 (1999), 107109.
8. ONeill, E., Lavoie, B., and Bennett, R. Web characterization;
wcp.oclc.org.

Bin He (binhe@uiuc.edu) is a research staff member at IBM


Almaden Research Center in San Jose, CA.
Mitesh Patel (miteshp@microsoft.com) is a developer at Microsoft
Corporation.
Zhen Zhang (zhang2@uiuc.edu) is a graduate research assistant in
computer science at the University of Illinois at Urbana-Champaign.
Kevin Chen-Chuan Chang (kcchang@cs.uiuc.edu) is an
assistant professor of computer science at the University of Illinois at
Urbana-Champaign.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and the full citation on
the first page. To copy otherwise, to republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee.

2007 ACM 0001-0782/07/0500 $5.00

COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5

101

Eur. Phys. J. B 73, 633643 (2010)


DOI: 10.1140/epjb/e2010-00039-0

THE EUROPEAN
PHYSICAL JOURNAL B

Regular Article

Dynamics of hate based Internet user networks


P. Sobkowicza and A. Sobkowicz
KEN 94/140, 02-777 Warsaw, Poland
Received 23 July 2009 / Received in nal form 6 December 2009
c EDP Sciences, Societ`
Published online 2 February 2010 
a Italiana di Fisica, Springer-Verlag 2010
Abstract. We present a study of the properties of network of political discussions on one of the most popular
Polish Internet forums. This provides the opportunity to study the computer mediated human interactions
in strongly bipolar environment. The comments of the participants are found to be mostly disagreements,
with strong percentage of invective and provocative ones. Binary exchanges (quarrels) play signicant role
in the network growth and topology. Statistical analysis shows that the growth of the discussions depends
on the degree of controversy of the subject and the intensity of personal conict between the participants.
This is in contrast to most previously studied social networks, for example networks of scientic citations,
where the nature of the links is much more positive and based on similarity and collaboration rather than
opposition and abuse. The work discusses also the implications of the ndings for more general studies of
consensus formation, where our observations of increased conict contradict the usual assumptions that
interactions between people lead to averaging of opinions and agreement.

1 Introduction
Most of the studies of social networks concentrate on properties of groups formed due to attraction among participants. In such situations the links form between actors
sharing some similarity, for example common interest (e.g.
scientic collaboration networks or cross-linked Internet
sites) or likeness of views (e.g. political associations). Differences and conicts are viewed as limitations and barriers to network formation. Frequent reaction to meeting
with someone who holds an opposing view is not an attempt to convince (phenomenon assumed widely in the
consensus formation models), but rather to cutting o
the connection. In face-to-face encounters this avoidance
limits growth of networks based on contrariness. Perhaps
the most known form of links between communities based
on hate is provided by long term family or tribal feuds, and
there are usually limited in scope. However, the advance
of modern technologies has provided opportunities for indirect contacts, where it is possible to express hate and
aggression without the risk of reciprocal physical injury
or personal danger. This bravery of being out of range
allows hate based networks to form and ourish. In this
paper we present a study of specic communities that grow
thanks to disagreements, without any attempt to nd consensus. Despite fundamental motivational dierence, some
properties of the studied groups are similar to the more
common friendship networks.
The system we study is based on records of linked
user comments related to news items published on the
Internet. Technology provides the necessary ease of access and anonymity. From the research point of view, such
a

e-mail: pawelsobko@gmail.com

discussions are relatively easy to document and can provide necessary data for meaningful statistical work. We
have chosen discussion forums at one of the most popular
Internet portals and news sites in Poland, http://www.
gazeta.pl. We have limited our research to discussions
spurred by the Politics subsection of the news. Current
situation in Poland makes it an almost ideal ground for
such a study. There is almost clearly bipolar split between
the two main parties (Platforma Obywatelska, PO, and
Prawo i Sprawiedliwosc, PiS). The conict shown at the
highest positions of the state is even more visible in the
group of active readers of Internet portals.
The reason for choosing this particular forum is that
while preserving anonymity of the users, it also provides
relative recognizability of participants. This results from
the fact that only registered users are allowed to post
comments, and can be identied by registered nicknames.
We may assume that participant XYZ in one discussion
thread is the same person as participant XYZ in a different one. Of course this leaves open the possibility of
a single real person using multiple Internet personalities.
However, even with this limit on true identities, it is possible to try to nd hubs of communication, both in the
comment writing and in reaction to published comments.
Our goal is to nd if the change in motivations for
linking from positive to negative inuences general network properties.

2 Methods
The data for the study have been gathered using a dedicated program, written for the purpose of loading and initial analysis of the discussion threads at the selected site.

634

The European Physical Journal B

The program performed automated tasks of data collection and cleaning and enabled the next step, which consisted of assigning political stance to discussion participants and to classication of the comments. This part of
the analysis, by far most time consuming, had to be done
by a human, by reading all the comments in a thread.
It should be noted that in almost all cases (with a single exception only), the whole discussions were actually
linked not to the original news article but to the rst comment. This is a result of operational process of the portal,
and the fact that pushing the comment button in typical
situations links the post not to the original source but to
the earliest existing post. To avoid spurious statistics this
phenomenon has been corrected for by the program.
The participants were assigned three possible types
(called nodeclass). For commentators whose viewpoints
were visibly in agreement with each of political factions
we assigned nodeclassses A and B. The remaining participants, for whom it was impossible to clearly determine
political views, were given nodeclass NA.
The comments were classied according to the following scheme:
Agr comment agrees with the covered material (either
the original news coverage or the preceding comment
in a thread);
Dis comment disagrees with the covered material;
Inv comment is a direct invective and personal abuse of
the previous commentator;
Prv provocation - comment is aimed at causing dissent,
often only weakly related to the topic of discussion;
Neu comment is neutral in nature, neither in obvious
agreement or disagreement;
Jst just stupid comment, which is totally unrelated to
the topic of discussion, but without malicious intent;
Swi comment signifying a switch in participants position leading to agreement between two previously opposing commentators.
Other works on computer mediated discussions in closed
communities used dierent message classication themes,
for example Jeong [1,2] has proposed grouping comments
by categories such as Arg argument for a given thesis
(corresponding to our Agr category); But a challenge
(corresponding to our Dis category), Expl for posts giving explanations, and Evid for posts giving factual evidence. In our case, the explanations and evidence posts
were rather scarce, due to political nature of the disputes.
We have therefore opted for categorization that reected
the emotional nature of communication, rather than factual one.
Following the process of categorization we performed
standard analyses typical for network systems. As the literature on the subject is very rich we refer here to the
general overviews, for example [36]. It should be noted
that the average size of the network formed by posts related to a single news item was relatively small (from a few
tens to a few hundreds of comments, thus the statistical
spread of results for single discussion threads was rather
signicant.

In our analysis we have used publicly available program GUESS, developed and maintained by Eytan Adar
(http://graphexploration.cond.org/index.html, see
also [7]).
To understand the data we have developed a computer simulation model, which has resulted in quantitatively comparable system characteristics, allowing to understand the role of the most important factors driving
the growth of comment networks. Details of the model
are given in Section 3.5. The programs and scripts used in
analysis are available from the authors on request.

3 Results
3.1 General statistics of discussions and temporal
dynamics
The statistical properties of discussion threads depend,
obviously, on the visibility of the news stories they relate
to. Some of the news are featured on the portal opening page, so one would expect that this should receive a
greater number of user comments. Our observations do not
conrm these expectations the advantage of the front
page news is not signicant. The users activity does not
follow editors choices. Within the Politics category the
portal carries between 5 and 20 news items each day. On
a typical graphics display, a visitor sees 46 most recent
news items (although the web page has also a most commented section, providing a short cut to older, but popular stories). While the screen space and graphical clues
give no preferences (order of presentation is strictly temporal), the number of comments spurred by each story
varies signicantly, depending on their content.
Discussion size distribution shows a denite fat-tail behaviour. In addition to news items that raise no commentary at all, and weakly commented ones (below 5070 comments), there are quite a few mid-sized discussions (up to
about 200 comments) and occasional extended discussions
(between 200 and 500 posts).
The news related discussions are, by their very nature,
short-lived. While the portal allows to view and comment
stories backdating more than 2 weeks, the comment frequency vanishes rapidly with time. Usually, there are very
few comments later than 24 h after publication, and practically none after 48 h. Numerical analysis of the threads
shows for many discussions a reasonably good t with exponential decay timeline, with half-life of between 1 and
4 h. There are some exceptions, for example news which
gain popularity many hours after publication (this happens usually for stories published at night and commented
during business hours) or stories which get a second life
due to a quarrel between a few participants.
It is important to note the dierence between the time
scales typical for individual news items and related comments and time scales of user activities and interactions.
While the comments have on the spot, non-deliberative
characteristics, the user relationship network has persistence time scales of at least the duration of recorded observations (2 months). The interplay between the short-lived

P. Sobkowicz and A. Sobkowicz: Dynamics of hate based Internet user networks

comment network and long-lived human relationships is


one of the dierentiating elements of the studied system.
3.2 User and comment statistics
Our rst task was to study basic network properties,
namely the activity of users measured by their indegree
and outdegree statistics. It should be stressed here that
the comment structure grows by addition of unique postto-post links. Translating this post-to-post network to a
user-to-user one introduces, by necessity, multiple connections between users. For the purpose of calculating the indegree and outdegree we shall treat these connections as
separate. Thus, outdegree ko of a user, which corresponds
to a number of comments a given he or she posts in a discussion (or cumulatively in many discussions), measures
the productivity. We have attempted to t the outdegree
distributions for mid-sized and large discussions (where
such measurements can be meaningful) with a modied
power law P (ko ) = A(ko + co ) . Such function has been
used previously by Newman et al. [8] in their analysis of
the connectivity of internet sites. For most discussions the
value of the co constant is rather small (|co | < 1).
The user indegree value ki measures response to his
or her posts rather than the authors activity, so it is in
some way related to the interest value, or, as often turns
out, to the amount of controversy they raise. We note
that there are many posts that do not elicit any comment,
i.e. with indegree equal to zero. A reasonably good t
for ki is given by modied power law P (ki ) = A(ki +
ci ) ; however, the value of ci is much larger than co . The
exponent for indegree is also signicantly larger than
for the outdegree, (2.34 vs. 1.89), indicating a much faster
drop of the typical popularity than of productivity. While
power law distributions are typical for social interactions
grown via preferential attachment via the rich get richer
principle [9], in our case we observe additionally a dierent
phenomenon.
In almost all mid-sized and extended Politics discussions we found small groups of participants with unusually
high indegree and outdegree values. These users are responsible for deviations from the predictions of the powerlaw observed for individual discussion threads. The explanation of the origin of the phenomenon is based on the
observation that they result not from the random process
of preferential attachment, but from extended exchanges
of posts between pairs of users. Interestingly, we observed
that the same user names were visible in dierent comment threads. Most of such exchanges were confrontational, lled with disagreements and abuse. For this reason
well use the term quarrels to describe them. Such verbal
duels are easily visible in the graphical view of the comments web page. Because of this visibility, they attract
additional comments from supporters of each of the quarrelling participants. Quarrels increase ko and ki of their
participants and thus change the general degree distributions. Typically, quarrels longer than 57 exchanges take
only between 3 and 7 percent of the total number of comments in a thread, but we have observed discussions where

635

such the ratio was much higher, for example 21% of the
220 posts in one of the discussions resulted from just two
long exchanges.
Recurrence of user nicknames connected with quarrels
in various threads has added plausibility to a hypothesis that they are largely due to the presence of duellists
users seeking each others comments almost regardless
of the topic and creating/joining in the ghts. For such
users the growth of ki and ko should be correlated. To
test these ideas, we have performed cumulative statistical analysis for all participants in a set of 58 discussions.
This has been done using assumption that the identity of
people remains xed to nicknames within the whole scope
of the portal. Results are quite interesting. Out of almost
2000 users there were only a few with very high values
of indegree (16 with ki 50). Similarly, there were only
23 users with ko 50. Eleven users belonged to both
groups. The average outdegree was ko  = 4.62, while indegree (excluding references to the original news sources,
to count only post-to-post links) was ki  = 3.01.
In addition to duellists in the studied discussions we
have found a group of hyperactive users specializing in
abusive comments (known as trolls) who, while publishing a lot of comments, receive much smaller number of
replies. For example one user has posted 236 times receiving only 51 replies. Although trolls post highly provocative
comments, they are frequently ignored most users seem
to know the rule dont feed the troll .
Figure 1 shows the cumulative network topology for
the 58 analysed discussions. There were 1977 users, 9135
posts, out of which 3194 were linked directly to news
items. In the gures we have removed all such links, leaving only connections between the users. The two views
focus on outdegree (upper panel) and indegree. Multiple
connections between pairs of users are emphasized by link
width. We can clearly see how a few of the users dominate
the whole forum. Figure 2 presents the cumulative distributions P (ki ) and P (ko ), as well as correlation between
ki and ko values. The two quantities are highly correlated,
especially for high values, with overall correlation coecient of 0.85.
Additional information about the network properties
may be provided by its correlation coecient. As the studied network is a directed one, this quantity is sometimes
called transitivity. To focus on relationships between the
users, we rst remove all links to the original news sources.
Because of the presence of multiple connections between
the same users there are two options for characterising
the network. In the rst option, we simply register the
presence of link between the users regardless of the actual number of connections. In this option, which we call
unweighted, all links have the same strength. In the second option, weighted, the weight of the link between two
users is naturally given by the number of comments. For
this scenario there are many ways of dening the correlation coecient. We follow use the geometric mean
method proposed by Opsahl and Panzarasa [10]. The
calculated value of unweighted CiU (P olitics) = 0.0665,
while for weighted option CiW (P olitics) = 0.0866. Large

636

The European Physical Journal B

Fig. 1. Two views of the topology of the network connecting the users participating in 58 large and mid-size discussions within
one month. Right panel: size of the nodes corresponds to outdegree of the user. Left panel: size of the node corresponds to
indegree. Links width reects the number of communications between the users binary exchanges are clearly identiable. Some
users have been identied by their nicknames. This allows to identify notorious trolls (such as wrojoz and koloratura1), who
have many posts but relatively few responses, and controversy leaders, such as tuskomatolek and junkier (who have more
responses that the posts). Despite the fact that almost 2000 users have participated in the discussions, only a few of them
dominate the exchanges, by their posting activity and by the concentration of responses, such as kralik111. A perfect example
of a user whose participation in discussions is motivated by negation and abuse of a particular opponent is given by rooboy
whose main target is kralik111.

Fig. 2. Cumulative indegree and outdegree distributions for 58 Politics mid-size and large discussions over a period of 30 days.
The third panel shows correlation between ki and ko , with two squares indicating the notorious trolls, i.e. individuals posting
a lot of comments but getting only a few answers. The triangles indicate the controversy leaders, receiving signicantly more
comments than they post. To be able to show the posts that have not resulted in any comment (with indegree equal zero) on
the log-log scales we have articially shifted them to ki = 0.1.

dierence reects the fact that correlations involve exactly


the users with high numbers of binary exchanges. Both
values are much higher than the ones obtained for a random network with the same ratio of nodes to links, where
CiU (Random) = 0.00164 and CiW (Random) = 0.00167.
From the texts of the comments we know that in many
cases the motivation for posting was to achieve popularity
(or at least notoriety). It is interesting to compare this
notion with the observations of Huberman et al. [11], who
identied the drive to achieve fame and visibility as one

of the distinct factors determining the number of posts


on YouTube. The authors have shown that the productivity exhibited in crowdsourcing exhibits a strong positive
dependence on attention. Conversely, a lack of attention
leads to a decrease in the number of videos uploaded and
the consequent drop in productivity, which in many cases
asymptotes to no uploads whatsoever . This observation
is supported in our case by the fact that the identities of
the most productive and most commented on participants,
summed over the set of threads are highly correlated (third

P. Sobkowicz and A. Sobkowicz: Dynamics of hate based Internet user networks


Table 1. Statistics of comment type between various groups of
users (two identied factions A and B and neutral or unidentiable class NA).

Agr
Dis
Inv
Prv
Neu
Jst
Swi

Intra-faction Inter-faction Factions-NA Intra-NA


(A-A,B-B)
(A-B)
(A-NA, B-NA) (NA-NA)
16.9%
0.6%
2.5%
0.6%
2.1%
32.8%
11.1%
1.2%
0.2%
17.1%
3.1%
0.7%
1.1%
2.7%
1.9%
0.2%
1.1%
0.8%
1.7%
0.2%
0.4%
0.0%
0.4%
0.1%
0.1%
0.0%
0.2%
0.1%

panel in Fig. 2). We observe a crowd of one-comment participants and several popular and prolic ones. Moreover,
the duellists recognize each other and tend to join in the
sub-threads simply to spur new rounds of abuse.
An interesting psychological observation is the existence of impersonators of famous commentators. They
choose a nickname that is on the rst glance identical to
the original one, for example by adding unobtrusive parts
to the user name, such as changing from XYZ to XYZ.
which often goes unnoticed. This is the most aggressive
form of trolling. In most cases the views of the original user
and the impostor are radically dierent. The trolls intention is to create chaos and confusion, as an unsuspecting
reader often nds comments with radically dierent views
or even exchanges between apparently the same participant, quarrelling with himself.

3.3 Comment classification


In addition to links statistics between posts and users we
wanted to study the content and tone of the comments.
Our goal was to check if the forum provided any chance
of reaching a consensus, or at least decrease in the level
of disagreement. This part of the analysis required human
evaluation, and the results of assignment of each comment
to a given class are obviously less objective. In some cases
we admit to being unable to classify a post, but for most
comments the task was rather straightforward.
Due to extremely time consuming nature of the process, we have chosen 20 threads from the whole set
of 58 full discussions. They contained between 20 and
250 posts.
Table 1 presents percentages of various types of comments between the users (i.e. omitting the classication
of comments addressed to the source messages). The reason for this omission was to decouple our analysis from
the individual responses to the news article, which has
served mostly to distinguish the nodeclass value. Aggressive posts (Dis, Prv, Inv) accounted for almost 75% of all
communications between users and we should remember
that provocative posts directed to the source news story
also add to the discussion temperature. Agreements between factions were extremely rare, and are only slightly
more frequent between those users with declared opinion

637

(nodeclass A or B) and the unidentied ones (nodeclass


NA).
Analysing individual discussions we found that subthreads related to neutral posts usually died much faster
than those due to confrontational or abusive ones. Long
chains of invective-invective and invective-disagreement
are very frequent, while exchanges between users of the
same group, agreeing with each other, are usually shorter,
rarely extending beyond 34 consecutive posts.
3.4 Factual and emotional content considerations
To see if these features are indeed characteristic for the
strongly polarized political forum, we have gathered similar data for two other topics: sport and science (which
are separate sections of the Web page). Common sense
would suggest comparable level of emotions to be present
for a sport column, while much lower level for science related news. As we can see from Figure 3, in all three cases
there is a main group of discussions, with size distribution
P (L) falling roughly in power law, but there are discussions with very high number of posts L. We are interested
in the possible origins of such large threads.
We note rst that to our surprise, the Sport forum
discussion statistics has shown much faster fall with increasing thread size L, as well as much higher proportion of news items that do not attract any comment. This
is most likely due to the fact that sports reporting consists of many items covering all disciplines (from soccer
through tennis to NBA) and the interests of readers would
be distributed over the disciplines. During the time we
have been gathering our data a few major sports events
took place (e.g. Australian Open tournament, Handball
World Championships), in which Polish participants were
expected to be successful, and where high emotional reactions were present. Indeed, these were the news stories that resulted in high user participation, equalling in
size those of the political forum. However, the topology
of sport discussion networks of comments was dierent
from the political ones. For example, the largest discussion, shortly after a dramatic win by the Polish team
in the handball championships, involved 249 participants
and 336 posts. But 203 users have posted only a single
comment, attached to the source message and expressing
their joy. The longest exchange involved just four posts.
As it turns out, the sport forum provides almost perfect
contrast to the politics: it is not dominated by two factions
and the moods and opinions of participants are usually in
sync.
This is in contrast with the main topic of our study,
political comments with high degree of conict. Only there
we found large proportion of mid-sized (L > 100) discussions. There were signicantly fewer stories without any
comment. For news items with small response the network consisted of weakly connected posts, which related
directly to the news source. But, as the size of discussion grew, the proportion of quarrels and provocative output by trolls increased strongly. What is important, the
proportion of disagreements is very high and extremely

638

The European Physical Journal B

Fig. 3. Top left panel: distribution P (L) of discussions sizes L for three forum topics: politics, sport and science. Filled points
show the number of news items that did not elicit any comments. Lines show power law ts. The large dierence of exponent
for the sport forum is due to much larger number of news items, most of which get no or almost no reaction at all. Bottom left
panel: average outdegree ko , for the Politics forum as function of L. Right panel: correlation between the ko  and percentage
of discussion spent on binary exchanges (quarrels). Points show percentages of quarrels longer than 6 posts and longer than
4 posts. The data support supposition that large ko  values are due to extended quarrels between a few participants.

aggressive behaviour (abuse and provocations) increased


even more strongly with the size of the discussion. It is
as if the most combative users were seeking each other
on the most active battleelds, neglecting the less active
threads.
Compared to sport, discussions on Science forum were
still dierent. Most of them were very short and rather
dull, with practically no network structure and links only
to source. But, from time to time, a strictly scientic topic
was transformed by the users to one of highly loaded,
emotional and political subjects. Such was the case of
description of advances in pre-natal research which was
discussed from religious/ethical point of view. A story describing details of the last Ice Age was discussed from the
point of view of current global warming. In such cases
we have observed dominance of binary exchanges in the
general topology, stronger even than in Politics. First of
all, the number of participants was usually much smaller
than the number of posts, In one case only 12 people produced 215 posts, with 8 of them responsible for 208 comments. Binary exchanges longer than 8 consecutive posts
took 72% of the discussion. It should be noted that while
the users disagreed with each other, only a few comments
were abusive, and many used evidence, references and
logical arguments. In another, smaller thread, a discussion between two participants consisted of 40 posts (out
of a total of 117), while in yet another, two users have

exchanged 41 posts (out of 178), most of them rather long


and scientic in spirit. Thus, we propose that when the
topic of a science news is received by the readers as related to important world-view issue, the chance for localized conict of pairs of users is very high. This is coupled
with a natural barrier of accessibility (many participants
in political discussions simply do not look into the science
section at all), so that the proportion of the users capable
of adding rational arguments to the discussion is higher.
As can be guessed from the above discussion, the correlation coecients for the cumulative networks for the
sports and science sections are quite dierent from the politics forum. In the case of sport related comments removing the links to source stories leaved the network highly
unconnected, so that the statistics become almost meaningless. On the other hand, the same operation for the science forum leaves a very highly connected network with
CiU (Science) = 0.233 and CiW (Science) = 0.325.
3.5 Computer model and simulations
To get more detailed insight on the relative importance
of the processes described above we have constructed a
simple computer model of the community and discussion process. The discussion participants are simulated by
agents with the following characteristics: nodeclass (we

P. Sobkowicz and A. Sobkowicz: Dynamics of hate based Internet user networks

639

Fig. 4. Indegree and outdegree distributions obtained from computer simulations without quarrels. The third panel shows poor
correlation between ki and ko .

have kept only A and B classes, no unknowns or neutrals,


although they are within the program capabilities) and activity class (high or low level of activity). The model used
2000 agents (similar to the number of users of the portal
in the observation period). We assumed 300 of these to be
active agents, and the remaining 1700 to be passive
ones.
The simulation process is rather typical: agents are selected randomly, and then choose the post they wish to
comment on, from the set of earlier posts. Certain proportion of the agents look directly to the source message. For others we assume preferential attachment rules
for probability of picking the target post. Specically, the
chance of choosing a post is proportional to its total degree (outdegree of a post is always 1, indegree may be
quite high). After choosing the target post, the agent then
decides whether or not to comment on it. In the simple
model used here, agents of the same class would always
agree with each other, while agents of dierent nodeclass
would always disagree. The probability of posting a comment depends on the activity class and on the comparison of nodeclasses of author and target agents. We have
assumed that the probability of a disagreeing comment
by an active agent is 30%, by a passive one is 10%. For
agreements the values were decreased by a factor of 2,
to simulate lesser motivation to post a comment. Thus
the expected ratio of agreements to disagreements should
be 33/67. The model parameters were aimed at achieving qualitative rather than quantitiative agreement with
observations. Proper values should be derived via psychological tools, such as user interviews.
Simulation is run until preassigned number of posts
are placed, at which time suitable statistics are measured.
Results of such model are presented Figure 4. The indegree distribution shows some similarity with observations and good agreement with power-law distribution,
expected for preferential attachment rules. On the other
hand, outdegree shows signicantly faster, exponential decrease rather than power-law. This is not surprising, as it

corresponds to probabilistic choice of agents posting the


comments inherent in the model. There are no agents with
unusually high values of indegree and outdegree, nor is
there any signicant correlation between ki and ko . We
conclude that the model needs modication to be able to
describe our observations.
The key enhancement of the model is an additional
step in the simulation process. Specically, after the agent
has posted a comment, the author of the target post is
given a chance to respond. The probability of such response is assumed higher than for a normal post (to reect the situation when an agent might feel personally
interested in responding). The probabilities of such direct
response used in simulations were 90% for active agents
and 45% for passive ones. If the response is placed, the
roles of the two agents are reversed, and again a chance
for counter-response is evaluated. This chain is continued
until one of the agents decides to quit. The exchange
between the two is decoupled from the rest of simulation and it is possible to derive simple analytical formulae
for the mean length of such exchange. The values of response probabilities have been chosen by qualitative comparison the lengths of simulated and statistics of observed
quarrels. Statistic distribution of quarrel lengths is a partially independent measure of agreement between simulation and reality.
When the quarrels are added into the simulations, the
results become much closer to the observations, as shown
in Figure 5. The similarity goes beyond the indegree and
outdegree distributions and extends to details of the network structure, for example for the correlation coecient
between ki and ko . Also, the ratio of agreements to disagreements falls to 16/84, resembling the observations for
the Politics forum. Thus, even though radically simplied
(no neutral posts, fully polarized opinions of agents, simple process), the model yields results surprisingly close to
reality.
We note that the computer model described above is
able to reproduce the characteristic results for the three

640

The European Physical Journal B

Fig. 5. Indegree and outdegree distributions obtained from computer simulations including quarrels. The third panel shows
strong correlation between ki and ko .

mentioned forums, by simple adjustments of probabilities


of posting and entering into duel.
One of the reasons for the success of the simulations
using mindless automation might be the fact that in large
part of the analysed discussions people behave like
mindless automata. Many comments are almost automatic, knee-jerk responses to abuse in form of further
abuse. We observed canned accusations of the supporters
of the opposing political faction for being liars, thieves,
idiots or worse, without any connection to the arguments
raised by the opponent. There are very few explanatory
or evidence posts (unlike in topical forums where many
users aim at helping each other). Even if such evidence appears, it is immediately, almost automatically questioned,
as coming from the other side. In fact, after reading so
many posts we could with high probability predict the
character of the post by looking up the names of the participants, without reading the discussion.

4 Discussion
4.1 Internet networks and friendly discussion forums
Modern Internet activities contain many examples of social activities based on similarity of interests: music communities, online role playing games, friendship websites.
Several such networks have been studied by Grabowski
et al. [12,13]. It seems interesting to compare the comment
network and the ones formed by interlinked web sites. Here
again we nd some resemblance and some dierences. The
most important dierences are that web sites and their
links are much more stable than comments usually more
thought is given by the authors when deciding what their
pages should be linked to. Also, these links are usually
driven by common interest and views. One seldom nds
links to web sites showing opposite viewpoint . The last
dierence might be that generally there is less emotion and
more content in traditional web pages. Despite these differences the general properties of both types of networks

are very similar, for example degree distribution for Web


pages is also well represented by modied power-law, with
exponent close to 2 [3,4].
To move to link structures where there are more
personal inuences we have decided to compare the observations of the highly abusive and combative Politics
forum with communities where the main motivation is
mutual help. We have chosen a WEB site of computer acionados (http://peb.pl/), and within it two discussion
boards, one devoted to conguring computer hardware
(hardware forum), the other one to solving problems in
Windows operating system (windows forum). The boards
are moderated and the individual discussions are helpful
and friendly. Perhaps the most irksome posts are those
where some people brag about their congurations. The
hardware forum contains more than 18258 individual discussions, out of which 107 are longer than 40 posts, the
longest one with 207 posts on the day the statistics were
gathered. The windows forum is smaller, with 6859 discussions, out of which 67 were longer than 25 posts. There
were 897 registered users in the hardware forum and 604
in the windows one.
The cumulative statistics of discussions within the two
boards show very similar behaviour of the discussion size
statistics, which for discussions longer than 2030 posts
show power-law behaviour P (L) L with exponent
3 for both communities, so that the overall decrease
was much faster than in the case of the hate powered forums. Also, most of the individual discussions exhibited
power-law-like behaviour in user activity, with a few active
users and more one-two comment ones. The small number
of long discussions where we found binary exchanges were
quite dierent from politics quarrels. For example out of
38 discussions on windows forum, only 4 contained extended question/answer exchanges typical for help desk
conversations between a user in trouble and a helpful guru.
Another observed dierence was that most users participated in only a few discussions, despite similarity of
topics such as how to congure a PC for under 500$ or

P. Sobkowicz and A. Sobkowicz: Dynamics of hate based Internet user networks

. . . below 550$. Most users followed their direct interests,


without involving themselves in other discussions. For example only 62 (out of 691) users have posted comments on
more than 4 subtopics in the windows board (the ratio
for the hardware forum was 42 out of 1603 users). Despite this focus of interest, individual user activity statistics showed power-law behaviour with P (k) k where
for the windows forum the exponent = 2.04 and for the
hardware one = 1.91 (we included only participants
in discussions longer than, respectively, 30 or 40 posts in
these statistics, to keep the statistics closer to the hate networks study). The only signicant deviation from powerlaw was a presence (only in the hardware forum) of a
handful of gurus who provided the users with advice on
many topics. The most prolic was one of the moderators
who participated in 61 discussions, posting 449 comments.
And this was but a tiny part of the person total activity on
the Web site, totalling at more than 13 000 posts (more
than 14 posts a day). There was only one more person
with more than 200 posts on the hardware forum, and
6 persons with more than 100 posts. For the windows
forum the highest value was 73 postings.
Overall, with the exception of a few ocial or selfappointed experts the participants of the described forums
were interested in solutions to their problems and there
were practically no instances of one participant seeking
another across individual topics to continue a discussion,
for example to argue about the superiority of AMD over
Intel or Linux over Windows. This is, we expect, thanks to
strictly enforced rules of a moderated medium, stressing
its cooperative nature. There is a huge dierence with the
Politics discussions, where specic topics often serve only
as springboards for quarrels, often wandering far from the
original news story.
4.2 Blogs and hate groups
The news related discussions share a lot of features with
blogs, where the personal and emotional content is dominant. A study of blogging behaviour in strongly polarized
environment of US 2004 Elections has been published by
Adamic and Glance [14]. It has shown preference for links
between blogs of the same political orientation the links
between opposing blogs were present, but not numerous,
limited to about 15% of the total number. This stands in
contract to our observations. A plausible explanation relies again on the dierence between the puprose of the two
types of communication. Blogs, especially election campaign ones, are written with the main aim of promoting
particular party or candidate. The conict with the opponents, if present, is secondary. On the other hand, the
large percentage of abuse and invectives in the discussion
posts whows that the main purpose is to vent the emotions and possibly to incite hate. Thus the percentage of
cross-group links is much higher.
Another detailed study of blogging behaviour by
Leskovec et al. [15] shows similar power law behaviour for
the indegree and outdegree distribution, albeit with different exponent values. An important dierence between

641

the two systems is the lack of observed signicant correlation between indegree and outdegree for the blogs, with
ki and ko correlation coecient of only 0.16, much lower
than 0.85 in the Politics network dominated by bilateral
exchanges. Leskovec et al. propose a cascading model of
blog links and provide data on relative probability of various patterns of link connections. The binary exchanges of
our approach which would correspond to linear topology
of the cascade model are relatively less probable in the
blog case, where the cascades tend to be wide rather than
deep.
The discussions studied in this work are by no means
the only examples of hate present in the vast space of
the Internet. Chau et al. [16,17] study the network structure of Hate groups. These studies are important for us
for two reasons. First, they focus on bloggers, who enjoy
a lot of freedom to express their opinions and emotions.
Second, the authors use networking methods, similar to
the ones employed here. The network of users is formed
through formal subscriptions between blogs and through
impromptu comments posted to each other. This last aspect corresponds directly to our situation. While the political views studied by Chau and Xu are probably more extreme than the ones of the readers of the www.gazeta.pl
portal, the emotional reactions seem to be as strong. It
is quite interesting that the degree distribution for the
giant component of 273 nodes in Chau and Xu network
exhibits power-law behaviour, P (k) k with exponent
1.38.

4.3 Citation networks


One of the topics where network approach has a long history of successful use is the analysis of scientic citations
[1825]. While the dierences between scientic collaboration and citations and the systems studied here are obvious, we note some similarities in the resulting statistical
behaviour. The general shape of indegree and outdegree
distributions both types of networks are remarkably close.
The main dierence is much more pronounced role of a few
highly connected users in our case, which we attribute to
quarrelling individuals phenomenon absent in scientic
publications, where such exchanges are relatively rare and
procedurally limited to a single remark/response cycle.
An interesting similarity relates to the popularity of
individual threads within a discussion, which can be compared to popularity of research publications. Analysis of
the rst mover advantage in citation networks has been
done by Newman [26]. He has shown that there is signicant bias promoting citations to early papers in a distinct
eld of research, suggesting, tongue-in-cheek, that it is a
better strategy to write the rst paper in the eld than to
write the best one. This bias is, in a sense, related to publication and citation mechanisms, not to actual content.
At the same time, Newman points out that even where the
rst-mover eect is strong, a small number of later papers
attract signicant attention in deance of advantage of
the earlier ones.

642

The European Physical Journal B

Despite the fact that the motivation for posting is radically dierent, the same phenomenon is observed in the
network discussions. We expect that the reason is again
technical. We noted that most of the heated exchanges
were related to early posts. This is due to the way the
discussion is visually fragmented into pages containing
100 posts viewing the later comments requires more effort. Thus late comments linking directly to the original
news story are not immediately visible and at a disadvantage compared to early ones. Only in rare cases, if there is
an interesting discussion, some later posts might get high
response rate despite this disadvantage.
4.4 Implications for consensus formation modelling
The last conclusion from our observation relates to a different domain of social modelling. We refer to computer
models of opinion formation (for recent review see [27]).
Most such models use so called agent based societies and
assume that consensus is achieved through a series of exchanges between agents. Some models postulate a form
of averaging of opinions towards a mean value (for example [28,29]), other use assumption that as a result of
interaction between two agents one of of them changes his
or her opinion to t the others [30]. Unfortunately in large
part the studies concentrate of mathematical formalisms
or Monte Carlo simulations, and not on descriptions of
real-life phenomena. The need of bringing simulations and
models closer to reality has been realized and voiced quite
a few times [3133].
An interesting result from the present study is that
the exchanges studied here (voicing of opinions in a quasianonymous medium) may not lead to consensus formation at all, despite repeated interactions between participants. In certain situations, such interactions lead rather
increased rift between the participants. This eect should
be studied in more detail, as it possibly suggests modications of the models of consensus formation in other
situations.
On one hand, we could assume that this is a phenomenon specic to computer mediated interactions, with
their lack of face-to-face eects of increased responsibility, shyness, induced submissiveness and even sympathy.
Anonymity and lack of fear of retribution might embolden
the participants and also promote additional mischievousness (clearly visible in the presence of provocative posts).
Thus one might assume that the studied form of exchanges
is an exception to the general rules of opinions getting
closer as result of interactions.
But everyday experience shows that even when people
meet face-to-face, with full use of non-verbal and emotional communication, the conicting views may remain
stable. Both history and literature are full of examples of
undying feuds, where acts of aggression follow each other,
from Shakespearean Verona families to modern political
or ethnic strife. Observations of the Internet discussions
should therefore be augmented by sociological data on
esh-and-blood conicts and arguments, and the dynamics of the opinion shifts. But even before such studies are

done or referred to (which the present authors feel is beyond their competence) the basic assumptions of the sociophysical modelling of consensus formation should be expanded. This is a very interesting task, because ostensibly
we are faced with two incompatible sets of observations:
Hard data and evidence to support their viewpoint,
participants in the studied Internet discussions tend
to hold to their opinions, strengthening their resolve
with each exchange. Within the analysed subset of the
discussions the conversion of opinion even a simple
agreement to a statement from opposing side was
virtually absent. Interactions do not seem to lead to
opinion averaging or switching.
Yet, most of the participants do have well dened opinions. These must have formed in some way. There are
studies indicating genetic/biological base for some of
the political tendencies [3437]. So perhaps the participants in our discussions did have a built-in tendency
to pick one of the sides of the divide, and to stick to it.
Regardless of genetic considerations the political attitudes are thought to be dependant on fairly stable
elements, such as childhood environment, which again
decreases the chances of reaching a consensus. But specic opinions on concrete events or people can not be
genetically coded nor due to general cultural formation they must be reached individually in each case.
Where do such inuences come from?
Judging by content of the analysed posts, we suggest in
our case existence of two mechanisms: fast consensus formation within ones own group (including adoption of
common, stereotyped views and beliefs); and persistence
of dierences with other groups. An interesting experimental conrmation of such phenomenon has been published recently [38]. Knobloch-Westerwick and Meng note
that their ndings demonstrate that media users generally
choose messages that converge with pre-existing views. If
they take a look at the other side, they probably do not
anticipate being swayed in their views. [. . . ] The observed
selective intake may indeed play a large role for increased
polarization in the electorate and reduced mutual acceptance of political views. This nding is in full agreement
with the behaviour we report.
The persistence of dierences of opinions exhibited in
online discussions studied in this work stands in contrast
to observations of Wu and Huberman [39,40], who measured a strong tendency towards moderate views in the
course of time for book ratings posted on Amazon.com.
However, there are signicant dierences between book
ratings and expression of political views. In the rst case
the comments are generally favourable and the voiced
opinions are not inuenced by personal feuds with other
commentators. Moreover, the spirit of book review is a
positive one, with the ocial aim of providing useful information for other users. This helpfulness of each of the
reviews is measured and displayed, which promotes prosociality and good behaviour. In the case of political disputes it is often the reception in ones own community
that counts, the show of force and verbal bashing of the
opponents. The goal of being admired by supporters and

P. Sobkowicz and A. Sobkowicz: Dynamics of hate based Internet user networks

hated by opponents promotes very dierent actions than


in the cooperative activities. For this reason, there is little
to be gained by a commentator when placing moderate,
well reasoned posts neither the popularity nor status is
increased.
Our results suggest possible future models of consensus formation that would take into account not only factors leading to convergence of opinions, but also those
that strengthen their divergence. Nonlinear interplay between these tendencies might lead to interesting results,
and decoupling the technical basis of the interactions (in
our case the comment network) from the human perspective of opinions and sympathies is an interesting topic for
further studies.

References
1. A.C. Jeong, The American Journal of Distance Education
17, 25 (2003)
2. A.C. Jeong, Distance Education 26, 367 (2005)
3. S.N. Dorogovtsev, J.F.F. Mendes, Evolution of Networks
From Biological Nets to the Internet and WWW (Oxford
University Press, 2003)
4. R. Albert, A.L. Barab
asi, Rev. Mod. Phys. 74, 67 (2002)
5. M.E.J. Newman, J. Stat. Phys. 101, 819 (2000)
6. M.E.J. Newman, D.J. Watts, S.H. Strogatz, Proc. Natl.
Acad. Sci. USA 99, 2566 (2002)
7. E. Adar, in Proceedings of the SIGCHI conference on
Human Factors in computing systems, ACM Press (2006),
pp. 791800
8. M.E.J. Newman, S.H. Strogatz, Duncan J. Watts, Phys.
Rev. E 64, 026118 (2001)
9. A.L. Barab
asi, R. Albert, Science 286, 509 (1999)
10. T. Opsahl, P. Panzarasa, Social networks 31, 155 (2009)
11. B.A. Huberman, D.M. Romero, F. Wu, Arxiv preprint
arXiv:0809.3030, (2008)
12. A. Grabowski, N. Kruszewska, R.A. Kosi
nski, Eur. Phys.
J. B 66, 107 (2008)
13. A. Grabowski, Eur. Phys. J. B 69, 605 (2009)
14. L.A. Adamic, N. Glance, in Proceedings of the 3rd international workshop on Link discovery (2005), pp. 3643
15. J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance,
M. Hurst, in SIAM International Conference on Data
Mining (SDM 2007) (2007)

643

16. M. Chau, J. Xu, International Journal of HumanComputer Studies 65, 57 (2007)


17. M. Chau, H.K. Pokfulam, J. Xu, in Pacific-Asia
Conference on Information Systems, Kuala Lumpur,
Malaysia (2006)
18. M.E.J. Newman, in Complex Networks, edited by E.
Ben-Naim, H. Frauenfelder, Z. Toroczkai (Springer, Berlin,
2004), V. 64, pp. 337370
19. M.E.J. Newman, Phys. Rev. E 64, 016131 (2001)
20. M.E.J. Newman, Proc. Natl. Acad. Sci. USA 98, 5955
(2001)
21. M.E.J. Newman, Proc. Natl. Acad. Sci. USA 101, 5200
(2004)
22. S. Redner, Eur. Phys. J. B 4, 131 (1998)
23. Z.K. Silagadze, Complex Syst. 11, 487 (1997)
24. H.M. Gupta, J.R. Campanha, R.A.G. Pesce, Brazilian
Journal of Physics 35, 981 (2005)
25. A. Vazquez, Arxiv preprint arXiv: cond-mat/0105031,
(2001)
26. M.E.J. Newman, EPL (Europhys. Lett.) 86, 68001 (2009)
27. C. Castellano, S. Fortunato, V. Loreto, Rev. Mod. Phys.
81, 591 (2009)
28. G. Deuant, D. Neau, F. Amblard, G. Weisbuch, Adv.
Complex Syst. 3, 87 (2000)
29. R. Hegselmann, U. Krause, Journal of Artical Societies
and Social Simulation 5, 3 (2002)
30. K. Sznajd-Weron, J. Sznajd, Int. J. Mod. Phys. C 11, 1157
(2000)
31. S. Moss, B. Edmonds, Journal of Articial Societies and
Social Simulation 8, 13 (2005)
32. J.M. Epstein, Journal of Articial Societies and Social
Simulation 11, 12 (2008)
33. P. Sobkowicz, Journal of Articial Societies and Social
Simulation 12, 11 (2009)
34. J.T. Jost, J. Glaser, A.W. Kruglanski, F.J. Sulloway,
Psychological Bulletin 129, 339 (2003)
35. J.R. Alford, C.L. Funk, J.R. Hibbing, American Political
Science Review 99, 153 (2005)
36. D.M. Amodio, J.T. Jost, S.L. Master, C.M. Yee, Nature
Neuroscience 10, 1246 (2007)
37. J. Haidt, J. Graham, Social Justice Research 20, 98 (2007)
38. S. Knobloch-Westerwick, J. Meng, Communication
Research 36, 426 (2009)
39. F. Wu, B.A. Huberman, in Proceedings of the Workshop
on Internet and Network Economics (2008)
40. F. Wu, B.A. Huberman, Arxiv preprint arXiv:0805.3537,
(2008)

Indexing and Access for Digital Libraries and the Internet: Human, Database, ...
Marcia J Bates
Journal of the American Society for Information Science (1986-1998); Nov 1998; 49, 13;
ABI/INFORM Global
pg. 1185

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

IS HUMANITIES COMPUTING A DISCIPLINE?


Tito Orlandi
Universit degli Studi di Roma "La Sapienza"
(Texte publi dans Jahrbuch fr Computerphilologie, 2002, n4, p. 51-58)
Abstract
At the beginning, the article mentions the results of an enquiry on how humanities computing is
being introduced into the curricula of the European universities; and the most important topics
recently discussed around the theme of a theory of humanities computing. It appears that most
experts agree on the opinion that humanities computing is an independent discipline, and as
such it should be introduced into the faculties of humanities. The article then explains how the
foundation of the discipline should be understood, on the basis of computing theory and the
methodology of the different humanities disciplines.
I have often wondered how it happens that two scholars who mostly share the same ideas on
humanities computing, like Lou Burnard and myself, are quite in opposition about the question
whether it is an independent discipline or not. This comes possibly from the fact that we do not
see the problem from the same point of view, especially since concepts like discipline and
humanities (not to speak of computing) are rather equivocal. They may mean, beside other
things, both a field of study and the way it is taught and learnt.
For instance, if we accept the opinion of Robert Proctor [1] that humanities is a view of the
human education calling for the imitation of classical, as opposed to medieval, Latin, and for
the study of Roman, and to a lesser extent Greek, literature, history, and moral philosophy as
guides to individual and collective behavior (p. XXIV), and then degenerated (as he says) into
pure scholarship (p.88 etc.), the computer applications cannot but be limited to a sort of
technological help, if ever needed, extraneous to its real (original) essence. In this case
humanities computing is not a discipline, and what is generally called computer
alphabetization is largely sufficient to the scholars. But if we accept as a progress, and not a
degeneration, the historical and philological methodologies proposed by the great (mainly
German) tradition of the late XIX century, then it is possible that computing becomes an
essential element inside those methodologies, thereby acquiring a very different status. If we
also observe that the influence of computing remains more or less the same across the different
kinds of data which are the object of individual humanities disciplines (history, archaeology, arts,
et cetera), then it is reasonable to suggest the birth of an independent discipline.
A few years ago, as part of the task of the committee of the Socrates Program Advanced
Computing in the Humanities, of the European Community [2], I have carried on an enquiry on
how humanities computing is being introduced into the curricula of the European universities [3].
I found a lot of different approaches, which is worth while synthesizing here. A number of
technically oriented computer literacy courses are given at the majority of humanities faculties,
often linked with individual courses in linguistics, history, literature, et cetera. Typically, there are
courses which either involve discipline-specific techniques, or which deal with the practical use
of a machine: familiarity with operating systems, text processing, databases, Internet access
and programming et cetera. Although the provision of courses in computer literacy seems only
common sense, such courses often deal with the technical side of computing and miss the
special symbiosis created between computing procedures and humanities methodology. A few
universities teach humanities computing in a more systematic way. They have organized a
group of courses focusing on the methodology of computer applications, and put them together
with more technically oriented courses, to form special units (sometimes departments or
schools) of humanities computing within the Faculty of Arts. It is remarkable that some
universities fully equipped with high level facilities and competent personnel in the field have
failed to realize dedicated and coherent programs for humanities computing. Some institutions
with good centres in humanities computing offer ample opportunities for highly motivated

students to acquire advanced competence through individual study or optional courses, but if
there is no integrated and required training in the subject, it is probable that only a minority of
students will take advantage of them, and thus benefit from the full potential. Some universities
have chosen the opposite point of view, by organizing courses dedicated to the students of the
humanities within their computer science departments.
In recent times (but mainly in anglo-american environment!) attention has been increasingly
brought on the essence of cumanities computing, investigated from the point of view of teaching
and from the point of view of institutional organization. I have registered the important
conference on Is Humanities Computing an Academic Discipline?, held under the auspices of
the Institute for Advanced Technology in the Humanities (IATH), at the University of Virginia [4]
, a gathering of prominent individuals in the fields of computing and communications science,
and arts and humanities research [5]; two contributions of Willard McCarty [6]; a lively
discussion inside the Humanist Discussion Group [7], active on the internet; the book produced
by the Socrates European Program, cited above; and an interesting paper by John Lavagnino
[8].
The result of all these contributions is that the essential questions have been indicated, and
most of the right answers. I shall resume the points that I consider as definitely settled, although
they do not completely solve the problem which we are discussing in this paper.
These are the opinions of Willard McCarty. Just a tool: otherwise intelligent colleagues refer to
the computer as just a tool or simply a bunch of techniques, as if ways of knowing did not
have much to do with what is known. Because the computer is a meta-instrument a means of
constructing virtual instruments or models of knowing we need to understand the effects of
modelling on the work we do as humanists. Creative expression and mechanical analysis: What
is the relationship between creative expression and mechanical analysis? What scholarly role
can the algorithmic machine play in the life of the mind as practising scholars live it, and how
might this role best be carried out? The effects of computing may easily be overemphazised,
and often are, but we have good reason to suspect that fundamental changes are afoot.
Mediation of thought by the machine: From the beginning it has been quite clear that humanities
computing is centred on the mediation of thought by the machine and the implications and
consequences of this mediation for scholarship. We are reminded by the cultural sea-change of
which the computer is a most prominent manifestation, that our older scholarly technologies,
such as alphabetic writing, the codex, and printing, are technologies, and that they also shape
our thinking. Methodologies: What jumps immediately into focus is the importance of
methodologies. When you teach humanities computing what immediately becomes obvious is
that the only subject you have to talk about is the methodology. Computing and the humanities
not separated: That computing and the humanities are fundamentally separate is an illusion
caused by a lack of historical perspective and perpetuated in the discipline-based structure of
our institutions. Philosophical training: In the broad sense, philosophical questions naturally
arise out of a machine that mediates knowledge and whose modelling of cognition reflects back
on the question of how we know what we know. Philosophical training would seem a sine qua
non because of its disciplined and systematic focus on logic and critical thinking skills, as well
as a concern with how to interpret diverse representations of knowledge, including what
philosophers and literary critics jointly refer to as hermeneutics. Computing not purely utilitarian:
The assumption that computing mimics what we already do, that it is purely utilitarian would
meant that projects were thoughtlessly undertaken, software then written and put out into the
field, but it seems that we can save much grief by prior thought about the questions we'd want
to ask. The labour-saving myth: We know this myth to be silly; we know that only the dull,
unimaginative scholar would not be inclined to do a better job with the time liberated from
mechanical. We also know that the computer does not so much save labour as change the
nature as well as scope of what we labour at. Research methods: We must objectify our
research methods before we can compute the artefacts we study, and in so doing we bring out
into the open what has formerly been hidden from view. Part of the problem has been the
attitude in the humanities by which the physical bits and craftsmanship of research, its
technology, are relegated to a lesser status.

Roly Sussex [9], about a new epistemology, observed that what is interesting about
computational methods is that these methods are providing us with both a new methodology
and a new epistemology. The notion of data is undergoing a reworking. Humanists are
learning to interpret statistical reports on what our software says the text is doing. This whole
process is tending to bring some areas of the Humanities closer to questions of methodology in
other disciplines, and indeed to make the Humanities more scientific.
Manfred Thaller [10]: We are dealing with methods, that is, the canon (or set of tools) needed to
increase the knowledge agreed to be proper to a particular academic field. Computer science is
a very wide ranging field. At one extreme, it is almost indistinguishable from mathematics and
logic; at another, it is virtually the same as electrical engineering. This, of course, is a
consequence of the genealogy of the field. Having widely different ancestors in itself, computer
science in turn became parent to a very mixed crowd of offspring. The existence of this wide
variety of disciplines, related to or spun off from computer science in general, implies two things.
First, there must be a core of computer science methods, which can be applied to a variety of
subjects. Second, for the application of this methodological core, a thorough understanding of
the knowledge domain to which it is applied is necessary. The variety of area specific computer
sciences is understandable from the need for specialized expertise in the knowledge domain of
each application. The core of all applied computer sciences is more than the sum of its
intellectual ancestors, which may themselves be inextricably associated with particular
knowledge domains. If we accept the assumption that the successful application of
computational methods strongly depends on the domain of knowledge to which it is applied,
then we also have to accept that applying computational methods without an understanding of
that domain will be disastrous.
We conclude that it is pointless to teach computer science to humanities scholars or students
unless it is not directly related to their domain of expertise. We conclude that humanities
computing courses are likely to remain a transient phenomenon, unless they include an
understanding of what computer science is all about.
As I said, I consider as settled the points so far examined, but this does not solve entirely the
problem which we are discussing now. Before illustrating my opinion on it, it seems convenient
to clarify the reason why it is important to discuss the problem, and the limits of the discussion.
In fact, all this would not be worth spending our time, if it has not practical consequences in the
academic organization. A discipline exists independently from the will of the scholars. It can be
acknowledged or refused, but if it really exists, it cannot be either created or destroyed. Beside
this, knowledge is theoretically unitarian and interdisciplinary, and the separation of the
disciplines is only valid as a useful mean for teaching, and partially for research.
The proposal of a (new) discipline concerns the official academic organization of the different
states. They at last are beginning to acknowledge the importance of teaching computer
applications (in this case, to the humanities) to the students, but, as we have seen above, their
approach is far from consistent. We must distinguish between the simple alphabetization, which
may be usefully left to the informaticians, and the teaching of applications for research. In this
case it is important to pose the problem of how the teachers themselves will be formed. The
idea of blending mechanically some courses of general computer science with the normal,
traditional courses of humanities in a curriculum is dangerous, and, for what we can assume
from the present experience, disastrous.
If it is not too late, we must try and persuade the academic organizations that humanities
computing as a discipline in fact exists, and how it is shaped. These are my arguments. I begin
by observing that the application of computing is not the same as the application of computers.
The computer as I see it, is not the type of the machine, of which the tokens are in front of us,
on our desks or laps or palms, but the set of devices (not one device!) described by von
Neumann, as the realisation (we add) of the Turing universal machine, along the lines of the
construction of the ENIAC, EDVAC, and the Mark I.
The Turing machine is central in my approach to the problem of humanities computing, because

it is the abstract, logic (I prefer to avoid the term mathematical) model underlying every
realisation of a computer. Only an abstract, logic model can clarify the methodological problems
raised by the meeting of humanities with the computers. In other words, I am separating the
concept of a normal machine, like the book or the typewriter or the calculator, from that of the
universal machine, of the automaton per se.
Such a view may of course be disputed, but if it is accepted, the next step is to realize that the
computer may be used in two different ways: (1) to simulate the behaviour of another machine,
because the computer can simulate any possible machine; (2) in its full capacity of computing
machine, that is, for the peculiarity which distinguishes the computer from all other machines,
which consists in the possibility to do computation as developed in the theory of recursive
functions. The first option is that adopted by those who would be content with teaching
alphabetization courses. The second requires as a matter of course the institution of and
independent discipline.
The distinction is important, because it helps to establish why the application of computers
raises methodological problems, and to what extent it does so. Because it seems evident that
when the computer is applied in the humanities only so far as it simulates (does the work of) a
traditional machine, then no new methodological problems arise, because there is no
substantial difference from the traditional procedures, if not of speed and convenience.
On the contrary, when the computer is applied in its full capacity of running algorithms,
humanities are confronted with a radically new situation, for which there is no commonly
recognised methodology. Something new happened in the field of epistemology when A. Turing
proposed his famous paper On Computable Numbers, because after it some of the rules which
help to build our knowledge were changed in a basic way.
The use of computers may require (or sometimes produces) a change in our minds. I would say
that the Turing machine is in fact a way of thinking, the formal way of thinking, which might have
remained restricted to the discipline of mathematics, had it not given birth, as a by-product, to
the computers. Although some of the elements of the new methodology were present in many
disciplines before the advent of computers, the systematic use of the Turing scheme, and the
possibility to use computers in humanities, is fundamentally altering part of all humanities
disciplines.
In order to be used in a proper way, that is, in order that it may give good results, or in any case
the wanted results, the Turing machine dictates some conditions, and particularly it dictates the
formalization of reasoning, and the formalization of data. If we accept this, we understand the
importance of teaching a good theory of formalization, and especially one which is valid in the
field of the humanities. As often in such instances, everybody has an intuitive idea of what
formalization is, but only a specialist in humanities computing can teach the right idea.
Computation is introducing in the humanities new methodological concepts and procedures,
especially for what concerns the formalization of problems and data representation.
On the other hand, it is easy to realise (a) that part of the humanities was computed well
before computers were used, and (b) that even where the computer is used as it were a
common machine, it imposes some constraints on the form of data, which did not exist before.
The reflection on, and clarification of all these fundamental issues seems both necessary and
urgent, as it is, as a consequence, the foundation of an independent scientific discipline,
humanities computing, which studies the problems of formalization and models, crossing all
humanities disciplines (linguistic, literature, history, archaeology, history of art, history of music),
but which none of them can fully develop by itself.
(17. Mai 2002)

NOTES

[1] Robert Proctor: Defining the Humanities. Bloomington/Indianapolis: Indiana University. Press 1998 (IId.
ed.). Another reason for me to cite this book is the interesting part about humanities curriculum, which should
be carefully considered, although, of course, the idea to introduce humanities computing is far from his view.
[2] See the URL <http://www.hd.uib.no/AcoHum/aco-hum.html> (21.4.2002).
[3] <http://www.hd.uib.no/AcoHum/book/> (21.4.2002), chapter 2: European studies on formal methods in the
humanities. Cf. also the list of academic centres of humanities computing by W. McCarty and M.
Kirschenbaum, in : Humanities computing units and institutional resources,
<http://www.kcl.ac.uk/humanities/cch/wlm/hcu/> (21.4.2002).
[4] Guy Fawkes Day 1999, cf. URL: <http://www.iath.virginia.edu/hcs/> (21.4.2002).
[5] Sponsored by The Computer Science and Telecommunications Board (CSTB) of the National Research
Council, in an attempt to explore the complexities of cross-disciplinary collaboration = American Council of
Learned Societies, Occasional Paper No. 41: Computing and the Humanities, cf. URL:
<http://www.acls.org/op41-toc.htm> (21.4.2002).
[6] W. McCarty: Poem and Algorithm. Humanities Computing in the Life and Place of the Mind. Keynote
speech for: HumanITies. Information Technology in the Arts and Humanities: Present Applications and Future
Perspectives, The Open University Milton Keynes 10 October 1998. W. McCarty: We would know how we
know what we know. In: The Transformation of Science: Research between Printed Information and the
Challenges of Electronic Networks. Max Planck Gesellschaft, Schloss Elmau, 31 May - 2 June 1999, URL:
<http://ilex.cc.kcl.ac.uk/wlm/essays/know/> (21.4.2002).
[7] Vol. 12; Centre for Computing in the Humanities, King's College London Cp. URL:
<http://www.princeton.edu/~mccarty/huanist/> (21.4.2002).
[8] Forms of Theory. Some Models for the Role of Theory in Humanities-Computing Scholarship, abstract in:
International seminars Computers, Literature and Philology (CLiP) 06.-09.12.2001, URL: <http://www.uniduisburg.de/FB3/CLiP2001/abstracts/Lavagnino-en.htm> (21.4.2002).
[9] Centre for Computing in the Humanities, vol. 13, No. 351 (note 7).
[10] Advanced Computing, cap. 2 (note 3).
Vous pouvez adresser vos commentaires et suggestions :

orlandi@rmcisadu.let.uniroma1.it
Rfrence bibliographique : Orlandi, Tito. Is Humanities computing a discipline? Jahrbuch fr
Computerphilologie [en ligne], 2002, n4, p. 51-58. Disponible sur :
<http://computerphilologie.uni-muenchen.de/jg02/orlandi.html>. (Consulte le ...).

Journal of Computer-Mediated Communication

The Role of Internet User Characteristics and


Motives in Explaining Three Dimensions of
Internet Addiction
Junghyun Kim
Paul M. Haridakis
Kent State University

doi:10.1111/j.1083-6101.2009.01478.x

In a little more than a decade, the Internet has revolutionized mediated communication and communication flow. With the pace of change and the emergence of
new uses of the Internet (e.g., YouTube, MySpace) over this time, researchers have
continued to struggle with explaining various positive and negative effects of Internet
use that have garnered attention. Some have suggested that Internet use can enhance
living conditions by providing access to diverse information (Bauer, Gai, Kim,
Muth, & Wildman, 2002), widen users social circles (e.g., Hampton & Wellman,
2003; Katz & Aspden, 1997; Rheingold, 1993), and enhance psychological well-being
(Chen, Boase, & Wellman, 2002; Kang, 2007). Others have considered some potential
negative effects of the Internet, arguing that it can be an isolating medium leading
to loneliness, less social interaction with family members and friends (e.g., Kraut,
Patterson, Landmark, Kielser, Mukophadhyaya, & Scherlis, 1998; Sanders, Field,
Diego, & Kaplan, 2000; Stoll, 1995; Turkle, 1996), and clinical depression (Young &
Rogers, 1998).
One negative effect that has received considerable attention over the last several
years is the extent to which people may become addicted to the Internet. The ongoing
evolution of Internet use and growth in the amount of time people spend using the
Internet has fueled this concern. Researchers have used different terms to describe
very similar types of behavior. These include problematic Internet use (Caplan,
2002; Davis, Besser, & Flett, 2002), pathological Internet use (Morahan-Martin,
& Schumacher, 2000), Internet dependency (Anderson, 1998; Scherer, 1997), and
Internet addiction (Beard & Wolf, 2001; Griffiths, 1996; Young, 1996a). In the current
study we use the term Internet addiction for consistency. However, it must be noted
that conceptual confusion surrounding this emotion-laden term has made it difficult
to ascertain the precise psychopathology arguably associated with it (Shaffer, 2004).
For example, whereas terms such as dependency and addiction have a longstanding
history of being used interchangeably in the context of drug and alcohol abuse
(Eisenman, Dantzker, & Ellis, 2004), in media studies such terms have very different
historical meanings. For instance, dependence or reliance on a particular medium or
988 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

channel has been viewed as a normal consequence of using a medium to satisfy ones
communication needs (e.g., Ball-Rokeach, 1985; Rubin & Windahl, 1986), even if it
is associated with heavy use and extreme affinity with the medium.
Such divergent conceptualizations of Internet addiction result in two glaring
problems that researchers must remedy. First, without clarification, we are left to
struggle with distinguishing between use that may reflect mere dependency on a
medium (which media researchers have suggested is a normal consequence of media
use), mere heavy use (which may or may not be healthy), and actual addiction
(a pathological state as understood in contexts such as substance addiction). This,
in turn, leads to a lack of clarity among professionals and policymakers who
need to understand exactly what problems and symptoms, if any, they have to
address. Second, divergent conceptualizations hinder the advancement of theoretical
explanations about when Internet users exhibit characteristics of use that amounts
to addiction and the identification of antecedent factors/conditions that may
influence this psychological state.
In the current study, we draw on prior addiction research in an effort to
synthesize prior thinking on the current subject, and attempt to conceptualize
Internet addiction in a manner that is consistent with conceptualizations of addiction
in other contexts, such as substance abuse. This is necessary, because if Internet
addiction is a problematic phenomenon, it should have similar indicators and
psychosocial risk factors as other addictions (Shaffer, 2004). Because certain traits
or background characteristics have been considered to be significant predictors of
addiction in both Internet and other contexts such as alcoholism (Loos, 2002; Medora
& Woodward, 1991) and drug abuse (Rokach & Orzeck, 2003), we also suggest the
need to explore more deliberately users background characteristics that may make a
user prone to Internet addictive behavior. We also examine whether motives for using
the Internet help explain such behavior. Research conducted within the uses and
gratifications perspective (U&G) over the past 3 decades has shown that background
characteristics and media-use motives can enhance or mitigate media effects (e.g.,
Rubin, 2002). Therefore, we examine how Internet-use motives and background
characteristics work together and help explain Internet addiction.
In this study we focus on the Internet, generally, rather than addiction to specific
content that may be available via the Internet. The possibility that people can be
addicted to general use of the Internet has been investigated in a group of previous
studies (e.g., Caplan, 2002; Davis, 2001; Young, 1996b). It may be that some people
turn to the Internet to fulfill needs for particular content (e.g., violence, pornography)
or behavior (e.g., gambling). However, our goal here is simply to add to prior research
that has suggested that people can be addicted to the Internet itself, but failed to
account for differences in users background factors and motives that may contribute
to such a consequence.
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

989

Addiction

A number of scholars have suggested that addiction does not necessarily have to
involve abuse of a chemical intoxicant or substance (Griffiths, 1999; Young, 2004).
For example, the term addiction has been used to refer to a range of excessive
behaviors, such as gambling (Griffiths, 1990), video game playing (Keepers, 1990),
eating disorders (Lesieur & Blume, 1993), physical exercise (Morgan, 1979), and
media use (e.g., Horvath, 2004; Kubey, Lavin, & Barrows, 2001). Although such
behavioral addictions do not involve a chemical intoxicant or substance, a group
of researchers have posed that some core indicators of behavioral addiction are
similar to those of chemical or substance addiction, such as loss of control, tolerance,
withdrawal, and negative life consequences (Brown, 1993; Lesieur & Blume, 1993;
Marks, 1990). It also has been suggested that individuals who engage in different
types of addictive behaviors share similar reasons, such as relief of anxiety, boredom,
and depression (Lesieur & Rosenthal, 1991; Zweben, 1987).
The Diagnostic and Statistical Manual of Mental Disorders (DSM) has been one
widely used source for identifying indicators of addiction. The DSM, published by the
American Psychiatric Association (APA), is a handbook that lists a diverse range of
mental disorders, including addiction, and criteria for diagnosing them. The DSM-IV
is the latest major revision published in 1994. It divides disorders into four categories;
clinical disorders, cognitive disorders, mental retardation, and personality disorders.
Although Internet addiction, specifically, has not been recognized as a disorder by the
APA, it did recommend further research of overuse of the Internet and video games
(American Psychiatric Association, 2006).The APAs recommendation suggests the
value of exploring further whether Internet addiction can or should be categorized
as another type of addiction, as promoted by some researchers (Griffiths, 1999;
Young, 2004).
Internet Addiction

Concerns that people can become addicted to a medium pre-date the Internet. For
example, popular books such as The Plug-In Drug (Winn, 1977) referenced addictive
properties of television, and researchers have explored this further in recent years
(Horvath, 2004).
Regardless of medium, using an emotion-laden term such as addiction has
been controversial. This has been the case with the Internet as well. Nonetheless, it has
caught the attention of and spurred debate among the APA (American Psychiatric
Association, 2006), medical professionals, and social scientists. For example, at its
annual conference in June 2007, members of the APA considered a proposal to
include excessive Internet use as an addiction, but decided to table it for further
investigation (Mandell, 2007). Jerald J. Block, M.D., in an editorial published on
The American Journal of Psychiatry, suggested that Internet addiction has become an
increasingly commonplace compulsive-impulsive disorder and should be included
as a common disorder that merits inclusion in DSM-V (p. 306). However, other
medical professionals, such as Dr. Stuard Gitlow of the American Society of Addiction
990 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

Medicine, have rejected such a suggestion and argued that there is not enough evidence
that Internet addiction is a complex physiological state close to alcoholism or drug
addiction (Martin, 2007).
There also has been a lack of agreement among social scientists. While some
have promoted the notion that Internet addiction can or should be categorized as
another type of addiction (Griffiths, 1999; Young, 2004), others have contended
that we should focus more on other sources of maladjustment that lead people to
unhealthy use of the Internet rather than on the Internet itself (e.g., Walther, 1999).
The lack of agreement among the medical and scholarly communities implies that
a clear definition of this disorder has yet to be developed (Shaffer, Hall, & Vander
Bilt, 2000).
Even with some unresolved issues, a growing body of research has suggested that
DSM-IV may offer the most promise for identifying Internet addiction (Brenner,
1997; Thatcher & Goolam, 2005; Widyanto & McMurran, 2004; Young, 1996a) or
addiction to specific online content (e.g., sexual content) (Bingham & Piotrowski,
1996). Others have used the DSM-IV criteria to conceptualize and operationalize
addiction to other media such as television (e.g., Horvath, 2004; Winn, 1977).
Using the DSM diagnostic criteria of substance addiction (e.g., alcohol, cocaine,
etc.)1 , Goldberg (1996) specified four criteria for diagnosing Internet addiction: 1)
one needs to increase the amount of time spent online to achieve the same effect
(tolerance), 2) one experiences an unpleasant feeling when he/she is not online
(withdrawal), 3) one needs to access the Internet more often and for longer periods
of time (craving), and 4) one experiences conflicts between Internet use and other
activities (negative life outcomes). Griffiths (1998) added three more criteria: 1) using
the Internet becomes the most important activity in ones life (salience), 2) one uses
the Internet to alleviate their mood (mood modification), and 3) one keeps going back
to his/her old Internet use pattern with unsuccessful efforts to cut down (relapse).
Young (1997) defined Internet addiction as a type of impulse control disorder. She
created a 20-item Internet addiction scale based on the DSM-IV diagnostic criteria
used for diagnosing substance addiction and pathological gambling. Actually, Young
(1996b, 1998) found that addictive Internet users exhibited tolerance, withdrawal,
and negative academic and occupational consequences that were consistent with
those exhibited by substance abusers.
In light of this line of research, Kubey et al. (2001) suggested that pathological
users of the Internet were engaged in a much more excessive form of use than mere
reliance or dependence. Whereas many Internet users may spend a great deal of time
online, heavy use or reliance does not necessarily reflect what may be one of the
most important characteristics of Internet addiction: the loss of control. It has been
suggested, for example, that those who struggle with Internet addiction are compelled
to spend significant time involved with various Internet activities even though these
activities cause them to neglect family, work, or school obligations. These intemperate
problems reflect a users loss of control over Internet use, increasing involvement
with the Internet and an inability to curtail this involvement in spite of adverse
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

991

consequences associated with such use (Shaffer, 2004). Such a loss of control is
reflected in DSM IV criteria for identifying addiction in various contexts.
Unfortunately, although the DSM-IV criteria for diagnosing pathological
gambling and substance addiction has provided criteria that has been used for
identifying Internet addiction, most research has not been theoretically grounded.
Therefore, we dont have a good overarching theoretical picture of relationships
among variables that may predict Internet addiction. As Kubey et al. (2001) argued,
there is a need, at a minimum, for theoretical explanations why the Internet may have
a hold on some individuals. In the current study, we use an audience-centered media
effects approach, uses and gratification (U&G) theory, to study Internet addiction.
U&G focuses specifically on how various media user background characteristics,
motives for using media, and media use patterns work in concert to influence
effects. Thus, it provides a theoretical framework with which we can consider
the relative contribution of social and psychological antecedent factors that have
predicted addiction in other contexts (e.g., substance addiction), and media-use
motive variables that have been linked to addiction to other media (e.g., television)
to Internet addiction.
Uses and Gratifications Theory (U&G)

U&G suggests that an individuals underlying needs drives his/her communication


behavior. Therefore, people are not viewed as being equally or uniformly purposive,
motivated and active in their use of media to satisfy underlying needs. Individual
factors, the nature of use, and expectations toward the media, and their content
mediate outcomes of use, both intended (e.g., the satisfaction of particular needs)
and unintended (e.g., addiction) (Katz, Blumer, & Gurevitch, 1974). A typical U&G
model would suggest that ones social and psychological circumstances influence
ones needs (perceptible in communication motives), which, in turn, influence
selection and use of communication channels (e.g., mediated and interpersonal), and
outcomes. Although background factors and motives influence media effects, U&G
suggests that a more complete picture of the route to media effects involves various
factors (i.e., psychological and social characteristics, media use motives, and media
use) working together. The U&G model guiding the current study is summarized in
Figure 1.
Psychological and Social Characteristics of Users

Pursuant to U&G, ascertaining factors that contribute to a particular outcome of


media use begins with consideration of potentially relevant background characteristics
of media users. To our knowledge, there has not been any research that included
a comprehensive list of characteristics that could contribute to Internet addiction.
In addition, not all psychological and social characteristics potentially relevant to
Internet addiction can be identified or incorporated in a single study. Nonetheless, in
this study we did include several characteristics that consistently have been associated
with both Internet addiction and addiction in other contexts (e.g., substance, alcohol).
992 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

Figure 1 Uses and Gratification Model for Internet Addiction.

Shyness

Shyness refers to inhibition of normally expected social behavior as a result of tension,


concern, feelings of awkwardness, and discomfort when one interacts with strangers
or casual acquaintances (Cheek & Buss, 1981). Individuals who are shy tend to feel
uncomfortable and awkward in face-to-face interaction because of their social anxiety
or communication apprehension (Morahan-Martin, 2007). They may feel their social
discomfort is alleviated when interacting with others online because of the Internets
greater anonymity, and continue to use the Internet instead of meeting people offline
(Morahan-Martin & Schumacher, 2000). A group of studies have suggested that the
higher ones level of shyness, the greater the likelihood one would be addicted to the
Internet (Chak & Leung, 2004; Yuen & Lavin, 2004). In contexts of substance use and
alcohol problems, research has suggested that people who were shy were more likely
to use drugs and alcohol (Ensminger, Juon, & Fothergill, 2001; Santesso, Schmidt,
& Fox, 2004). Thus, we expect a positive relationship between shyness and Internet
addiction.
Sensation-seeking

Sensation-seeking is a personality trait that reflects how willing a person is to seek


novel or arousing stimuli (Perse, 1996). In the context of the Internet use, Armstrong,
Phillips, and Saling (2000) found that high sensation-seekers exhibited more addictive
Internet use behaviors than low sensation-seekers. Outside of the Internet context,
sensation-seeking has been linked with the selection of media content, particularly
arousing content such as violent and pornographic fare (e.g., Krcmar & Greene,
1999; Oliver, 2002; Perse, 1996). Sensation-seeking has been linked specifically to
engaging in substance use (Wagner, 2001), alcohol use, and susceptibility to future
alcohol problems (Robbins & Bryan, 2004). Thus prior research suggests a potentially
positive relationship between sensation-seeking and Internet addiction.
Loneliness

According to McKenna and Bargh (2000), individuals who feel lonely because of
their lack of good social skills try to overcome their problems through online
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

993

social interactions. As in the case of shy individuals, lonely people may use the
Internet for social compensation when they are not satisfied with their offline
interpersonal relationships (Papacharissi & Rubin, 2000). Reliance on the Internet to
alleviate loneliness may lead to problematic Internet use (Caplan, 2002, 2003; Davis,
2001). Kubey et al. (2001) also suggested a link between loneliness and Internet
addiction, claiming that lonely people feel socially incompetent and tend to feel
more comfortable with online activities. Outside of the Internet context, loneliness
has been linked to drug use (Grunbaum, Tortolero, Weller, & Gingiss, 2000) and
alcoholism (e.g., Akerlind & Hornquist, 1989; Loos, 2002; Medora & Woodward,
1991; Nerviano & Gross, 1976). Based on this prior research, we predicted a positive
association between loneliness and Internet addiction.

Locus of control

Locus of control refers to an individuals belief about the extent to which he/she
is in control of his/her life (i.e., internal locus of control) vis-`a-vis the extent to
which he/she believes external forces (e.g., other people or chance) are in control
of his/her life (i.e., external locus of control) (Rotter, 1966). According to Chak
and Leung (2004), individuals who believed that they had control over their lives
were less likely to be addicted to the Internet, because they believed that they could
maintain healthy Internet use behaviors. If that argument has merit, individuals who
believe that external factors control their lives may be more susceptible to Internet
addiction. In other media contexts, Wober and Gunter (1982) found that individuals
who were externally controlled were heavier TV viewers than those who were
internally controlled. External control also has been linked to problematic effects of
television use, such as increased aggression (Haridakis, 2002). Researchers have found
that high external locus of control scores in adolescents predicted heavy substance
use (Bearinger & Blum, 1997) and alcohol use (Steele, Forehand, Armistead, &
Brody, 1995).

Self-esteem

According to Baumeister (1993) and Swann (1996), individuals with low self-esteem
have negative evaluations about themselves and are suspicious of praise. In order to
withdraw or escape from these negative evaluations and stresses, individuals with
low self-esteem tend to engage in addictive behavior such as substance abuse (e.g.,
Craig, 1995; Hirschman, 1992; Marlatt, Baer, Donovan, & Kivlahan, 1988). In the
context of Internet use behavior, Armstrong, Phillips, and Saling (2000) found that
low self-esteem was a significant positive predictor of addictive Internet use. Outside
of the Internet context, Peele (1985) suggested that one of the reasons people may
become addicted to media use is to bolster their self-esteem. Consistent with the
research results of Armstrong et al. (2000) and Peele (1985), we treat self-esteem as a
possible negative predictor of Internet addiction.
994 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

Applying these findings from prior research on the relationships between diverse
background characteristics and substance or behavioral addiction, the following
hypotheses are posed:
H1a: Shyness, sensation-seeking, and loneliness will be positively related to Internet addiction.
H1b: Internal locus of control and self-esteem will be negatively related to Internet addiction.

Amount of Use

U&G, the theoretical framework guiding this study, suggests that exposure to a
medium is an important antecedent to media effects. U&G also suggests that media
use can be related to unintended consequences of use, such as Internet addiction.
In fact, Widyanto and McMurran (2004) found that the higher the amount of time
spent online, the greater the extent of symptoms of Internet addiction. Leung (2004)
also suggested that hours spent on the Internet per day was a positive predictor of
Internet addiction. Similarly, Horvath (2004) found that those who measured higher
than their counterparts on a measure of television addiction tended to be heavier
television viewers. The results of these studies indicate that amount of Internet use
and Internet addiction have been treated as distinct but related concepts in prior
Internet addiction research. If, as prior research suggest, heavier users of a medium
are likely to be more prone to be addicted to the medium, the amount of use is an
important variable to consider.
H2: The amount of Internet use will be positively associated with Internet addiction.

Motives for Using the Internet

Kubey et al.s (2001) claim that addictive Internet user use the Internet to meet
others suggests the importance of examining communication motives for using
the Internet. Peeles (1985) claim that individuals addicted to media use them to
gain a sense of control in their lives and to bolster self-esteem also suggests the
importance of considering the role of motives when exploring predictors of Internet
addiction. U&G has been one of the predominant theoretical frameworks used to
study the influence of media use motives on media effects over the last 30 years or so.
Researchers specifically have suggested that people use the Internet for a variety of
interpersonal (e.g., affection, inclusion, social interaction) and media-related reasons
(e.g., entertainment, information seeking, passing time, escape) (e.g., Charney &
Greenberg, 2002; Ebersole, 2000; Ferguson & Perse, 2000; Kaye & Johnson, 2004;
Papacharissi & Rubin, 2000). Accordingly, we assessed a range of such motives
individuals may have for using the Internet.
Some researchers have considered the influence of motives for using the Internet
on both Internet dependency (authors, in press) and Internet addiction (Chou &
Hsiao, 2000; LaRose, Lin, & Eastin, 2003). However, there is little research truly
exploring possible links between a range of motives individuals may have for Internet
addiction. This is a significant gap in the research, because prior media use research
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

995

suggests that motives impact effects (see Rubin, 2002 for a review of studies).
Specifically, it has been suggested that more purposive and instrumental use (e.g.,
information seeking, control, caring others, etc.) may inhibit negative outcomes and
that more habitual use (e.g., habitual entertainment, escape) enhances the likelihood
of unintended, and potentially negative outcomes of use (Song, LaRose, Eastin, &
Lin, 2004). We wanted to see if this was the case for Internet addiction.
Integrating all the previous research about the effects of user characteristics, media
use motives, and the amount of use on addiction, the following research question is
put forth;
RQ1: How do users background characteristics, motives, and the amount of Internet use
contribute to Internet addiction?

Research Methods
Sample

The sample included 203 undergraduate students ranging from freshmen to seniors
from a variety of majors enrolled in a multisection course required as part of a large
Midwestern U.S. universitys liberal education requirement. The sample was 48%
men and 52% women. The mean age was 21.5 years (SD = 5.32). Students were asked
to come into a classroom and took a pen-and-paper survey. Given the exploratory
nature of this research, we felt the sample was appropriate. College students tend
to use a variety of Internet functions (Morahan-Martin & Schumacher, 2000). In
addition, computers and the Internet were widely available across campus, and all
students were required to use the Internet.
Measures
Internet addiction scale

Internet addiction was measured by asking respondents how often they engaged
in each of 31 indicators of Internet addiction (1 = Never, 5 = Very Often). This
index consisted of 20 items from Youngs (1996a) Internet Addiction Test (IAT)
and 11 items from Horvaths (2004) Television Addiction Scale. Both measures are
based on DSM-IV criteria in line with the assumption that media addiction shows
symptoms that are similar to addiction to other devices/substances (e.g., drugs).
The reason we chose Youngs scale for this study was that it had been widely used
for measuring Internet addiction in previous research (e.g., Chak & Leung, 2004;
Hur, 2006; Pratarelli, Browne, & Johnson, 1999; Thatcher & Goolam, 2005). We
added additional items from Horvaths Television Addiction Scales (2004), since
we felt it was prudent to encompass additional DSM-IV criteria that were not
included in Youngs scale. Because the measure we used was comprised of items
drawn from different media addiction scales, we subjected it to principle components
factor analysis with varimax rotation to uncover any possible underlying component
structure. Factors with eigenvalue of at least 1.0, primary loadings of at least .50 and
no items that loaded significantly on another factor (i.e., a larger than .20 difference
996 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

between primary and secondary loadings) were retained. Five factors showed up when
all 31 items were entered into factor analysis. However, three items from escaping
reality, two items from attachment, all four items from the fourth factor, and all three
items from the fifth factor were removed because they were cross-loaded across more
than one factor. The remaining 19 items were divided into three factors and were
summed and averaged to create respective indexes of Internet addiction dimensions.
These three factors explained 62.5% of the variance after rotation. Responses that
loaded on each factor were summed and averaged to create indexes of each Internet
addiction dimension.
Factor 1, intrusion, (eigenvalue = 8.78) explained 46.2% of the variance after
rotation. Items comprising this factor reflected that using the Internet became
intrusive to participants everyday life (M = 1.66, SD = 0.77, = .92) (e.g., I often
find that I stay online longer than I intended, I often neglect household chores to
spend more time online). Factor 2, escaping reality, (eigenvalue = 2.06) explained
10.8% of the variance. This factor suggested that the Internet was a tool for escaping
reality (M = 2.63, SD = 0.91, = .90) (e.g., I often block out disturbing thoughts
about my life with soothing thoughts of using the Internet, I often snap, yell,
or act annoyed if someone bothers me while I am online). Factor 3, attachment,
(eigenvalue = 1.04) explained 5.5 % of the variance. This factor reflected a strong
attachment or affinity for the Internet (M = 1.93, SD = 0.92, r = .43) (i.e., I cant
imagine living without the Internet, When I am unable to use the Internet, I miss
it so much that I feel upset). The final results of the factor analysis are depicted in
Table 1.
Amount of the Internet use

General Internet use behavior was measured with two questions asking how much
time participants spent using the Internet yesterday (the day before they participated
the survey) and how much time they spent using the Internet on a typical day. These
two items have been used to measure other media use research, such as television,
and produced reliable estimates (Haridakis, 2002). Answers to the two questions
were summed and averaged (M = 194 minutes, SD = 117.7).
Motives

Internet-use motives were measured with a 45-item Internet motives scale used
in prior research (Papacharissi & Rubin, 2000). This scale taps several motives
associated with using the Internet, ranging from interpersonal motives (e.g., inclusion,
control, affection) to media-use motives gleaned from prior media research (e.g.,
entertainment, escape, pass time, information seeking). We added four additional
items taken from Rubins (1983) television motives scale that were not covered in
the Internet motives scale. These items reflected using the Internet for thrill and
excitement. Respondents were asked how well each of the 49 statements was like
their own reasons for using the Internet (1 = Not at all, 5 = Exactly). All items
were subjected to principle components factor analysis with varimax rotation. To
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

997

Table 1 Primary Factor Loadings of Internet Addiction Scale


Factor Loadings
Items
Intrusion
Lose track of time when I am online
Stay online longer than I intended
Neglect household chores to spend more
time online
Check emails and Instant Messenger before
doing other things
Would be more productive without going
online
Would enjoy more hobbies without going
online
Try to cut down the amount of time spent
online but fail
Lose sleep due to late night logins
Find myself saying Just a few more
minutes online
Compared to others, I spend more
time online
Escaping Reality
Block out disturbing thoughts with
thoughts of going online
Others complain about the amount of time
I spend online
Form new relationships online
Snap, yell, or act annoyed if others bother me
when I am online
Prefer going online to intimacy with friends
and family
Feel preoccupied with the Internet
when offline
Find myself anticipating going online
Attachment
Cant imagine living with the Internet
If I cannot use Internet, I miss it so much
that I am upset
Mean
Standard Deviation

Intrusion

Escaping reality

Attachment

.75
.74
.74

.16
.07
.31

.24
.00
-.01

.74

.05

-.08

.70

.26

.30

.68

.22

.16

.66

.43

.24

.63
.62

.44
.35

.23
.29

.54

.41

.31

.10

.83

.14

.33

.79

.11

.23
.13

.73
.71

.01
.26

.34

.70

.12

.12

.68

.37

.36

.67

.18

.25
.05

.12
.36

.78
.73

1.66
.77

2.63
.91

1.93
.92

retain a factor, we used the same criteria used in the factor analysis of the Internet
addiction measure. Eight factors emerged when all 49 items were entered into the
factor analysis, but 20 items were removed. Specifically, five items from habitual
entertainment, one item from seeking information, three items from escapism, four
items from control, all four items from the seventh factor, and all three items from
the eighth factor were removed because of their high cross-loading values. The
998 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

remaining 29 items were divided into six factors and were summed and averaged to
create respective indexes of motives. Six factors explaining 62.4% of the variance after
rotation emerged. Responses that loaded on each factor were summed and averaged
to create respective indexes of motives.
The first motive, habitual entertainment, (eigenvalue = 11.44) explained 34.7%
of the variance after rotation. This factor was composed of items that reflected
both habitual use and using the Internet to be entertained (M = 1.66, SD = 0.77,
= .93) (e.g., Because its fun just to play around and check things out, Because
its just a habit, just something to do). The second motive, caring for others,
(eigenvalue = 3.06) explained 9.3% of the variance. This factor reflected using the
Internet to show others affection and care (M = 2.45, SD = 0.83, = .85) (e.g.,
To help others, To let others know I care about their feelings). Factor 3,
economical information seeking, (eigenvalue = 2.24) explained 6.8% of the variance
and contained items reflecting using the Internet to search for and share information
conveniently (M = 4.00, SD = 0.66, = .78) (e.g., To get information for free,
and Because it is cheaper than other ways of sending information to other people).
The fourth factor, excitement, (eigenvalue = 1.51) explained 4.6% of the variance.
This three-item factor reflected using the Internet to seek excitement and thrill
(M = 2.72, SD = 1.10, = .90) (e.g., Because it is thrilling, Because it is
exciting). Factor 5, control, (eigenvalue = 1.27) explained 3.8% of the variance.
Items comprising this factor reflected that people used the Internet to affect and
control others behavior (M = 2.69, SD = 0.67, = .73) (e.g., To tell others what
to watch or see, Because I want someone to do something for me). The final factor,
escape, (eigenvalue = 1.05) explained 3.2% of the variance. This factor included two
items, So I can get away from what Im doing, and So I can forget about school,
work or other things (M = 2.96, SD = 1.20, r = .67). The final results of the factor
analysis of the motives scale is depicted in Table 2.
Background characteristics

We measured locus of control with Levensons (1974) 12-item index. These 12


items included powerful others control (e.g., My life is controlled by powerful
others), chance control (e.g., To a great extent, my life is controlled by accidental
happenings), and internal control (e.g., My life is determined by my own action).
Both powerful others control and chance control items were reverse-coded so that
higher scores on all 12 items represented stronger internal locus of control (Haridakis,
2002; Rubin, 1993). Responses were summed and averaged (M = 3.64, SD = 0.53,
= .78).
Rosenbergs Self-Esteem Scale (1965) was used to measure participants selfesteem. Responses to the 10 items in the scale were summed and averaged to create
the self-esteem index (M = 3.90, SD = 0.55, = .90).
For shyness, participants answered a 9-item shyness scale developed by Cheek and
Buss (1981). Responses to these items were summed and averaged to create a shyness
index (M = 2.36, SD = 0.66, = .81).
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

999

Table 2 Primary Factor Loadings of Internet Use Motive Scale


Items

Internet Use Motives


HE

CO

EIS Excite Control Escape

Habitual Entertainment (HE)


Because its just a habit, just
.79 .11 .11
.06
something to do
Because it is entertaining
.76 .11 .25
.16
Because its fun just to play around
.76 .15 .24
.00
and check things out
Because its enjoyable
.74 .10 .19
.18
Because I just like to use it
.73 .09 .11
.12
.72 .16 .27
.15
Because it gives me something
to occupy my Time
When I have nothing better to do
.64 .00 .13
.22
Because it amuses me
.63 .04 .14
.29
Caring for Others (CO)
To let others know I care about
.07 .82 .08
.21
their feelings
To show others encouragement
.15 .81 .02
.07
To belong to a group with the
.23 .69 .04
.13
same interests as Mine
.13
To give my input
.06 .59 .10
.13
Because I enjoy answering other
.08 .58 .09
peoples questions
.08
To help others
.01 .58 .39
Economical Information Seeking (EIS)
To search for information
.13 .05 .72 .07
Because it is easier to get information
.26 .05 .71
.13
.06
To get information for free
.34 .03 .67
Because it is easier to get information
.11 .11 .55
.07
.02
Because people dont have to be
.19 .31 .54
there when you send messages
Because it provides a new and interesting
.16 .19 .53
.23
way to do research
Excitement (Excite)
Because it is exciting
.27 .24 .16
.83
Because it is thrilling
.34 .19 .04
.79
Because it peps me up
.19 .20 .08
.75
Control
Because I want someone to do something for .14 .27 .12
.19
Me
To get something I dont have
.09 .06 .28
.22
Because it allows me to unwind
.34 .19 .09
.23
To tell others what to watch or see
.33 .38 .02
.08
Escape
So I can forget about school, work
.34 .12 .02
.12
or other Things
So I can get away from what Im doing
.39 .11 .00
.17

.11

.20

.05
.14

.03
.04

.12
.28
.06

.10
.27
.26

.27
.21

.30
.26

.02

.18

.12
.21

.26
.00

.11
.39

.02
.14

.22

.02

.05
.08
.10
.35
.05

.05
.08
.04
.32
.27

.03

.04

.15
.23
.26

.05
.04
.26

.63

.18

.60
.60
.59

.05
.26
.00

.23

.70

.30

.65

1000 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

We measured sensation-seeking with Zuckermans (1979) 40-item index. This has


been a widely used measure of risk-taking behavior in communication research (e.g.,
Krcmar & Greene, 1999). It is comprised of four subscales: thrill seeking, experience
seeking, disinhibition, and boredom susceptibility. Responses were summed and
averaged to form indexes for each dimension: thrill seeking (M = 3.24, SD =
0.81, = .86), experience seeking (M = 2.98, SD = 0.74, = .89), disinhibition
(M = 3.04, SD = 0.81, = .78), and boredom susceptibility (M = 2.74, SD = 0.61,
= .70).
Finally, a shortened 10-item version of the UCLA loneliness scale (Russell,
1996) was used to measure loneliness in the current study. Responses were averaged
(M = 1.89, SD = 0.66, = .89). For all the six background characteristics measures,
participants were asked how strongly they agreed/disagreed with each item (1 =
Strongly Disagree, 5 = Strongly Agree).
Results

Hypothesis 1a and 1b predicted that shyness, sensation-seeking, and loneliness


would be positively related to Internet addiction, while internal locus of control and
self-esteem would relate negatively to Internet addiction. H1a was fully supported.
Shyness, sensation seeking, and loneliness related positively to all three dimensions
of Internet addiction. H1b was also fully supported. Internal locus of control and self
esteem were negatively related to all three dimensions of Internet addiction (Table 3).
Hypothesis 2 posed that the amount of Internet use would be positively related
to Internet addiction. This hypothesis was supported. The amount of time using the
Internet was positively related to all three dimensions of Internet addiction (r = .33,
p < .01 for intrusion; r = .36, p < .01 for escaping reality; r = .16, p < .05 for
attachment).
Our research question (RQ1) asked about the contribution of users background
characteristics, motives, and the amount of Internet use to explaining Internet
Table 3 Bivariate Correlation Analyses Results
Dimensions of Internet addiction
Intrusion
Shyness
Loneliness
Sensation-seeking
Disinhibition
Thrill seeking
Excitement seeking
Boredom Susceptibility
Internal control
Self-esteem

Escaping reality

Attachment

.19**
.16*

.35**
.23**

.24**
.30**

.30**
.20**
.22**
.19**
.39**
.23**

.29**
.16*
.18*
.22**
.41**
.15*

.21**
.20**
.12*
.24**
.32**
.17*

Note: *p < .05, **p < .01


Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

1001

addiction. Hierarchical regression analyses were used to examine the contribution


of these antecedent variables to predicting each of the three dimensions of Internet
addiction we identified: Intrusion, escaping reality, and attachment. Pursuant to
the assumptions of a contemporary U&G model referenced above, these antecedent
factors were entered into the regression equations in the following order: users
demographic information and personality traits (step 1), Internet use motives
(step 2), and the amount of Internet use (step 3). None of the predictors variance
inflation factor (VIF) values exceeded 10, a guideline for serious multicollinearity
(Rawlings, Pantula, & Dickey, 1998) suggesting multicollinearity did not pose a
significant problem.
The hierarchical multiple regression equation with all the variables entered
accounted 46 % of the variance in intrusion [R = .68, p < .01, F(16, 186) = 9.90,
p < 01]. Variables entered on Step 1 (gender, self-esteem, shyness, locus of control,
loneliness, and four dimensions of sensation-seeking) accounted for 22% of the
variance (R2 = .22, p < .01). Entering motives on Step 2 accounted for an additional
20% of the variance (R2 = .20, p < .01), and entering the amount of use on Step 3
explained additional 4.2% of the variance (R2 = .042, p < .01). Specifically, locus
of control and loneliness were significant negative predictors of intrusion. Using the
Internet for purposes of caring for others, for excitement, to escape, and amount
of Internet use were significant positive predictors. Using the Internet for habitual
entertainment was a significant negative predictor.
Meanwhile, the multiple regression equation accounted for 54.6% of the variance
in escaping reality [R = .74, p < .01, F(16, 186) = 14.00, p < 01]. Psychological
and social factors entered in Step 1 accounted for 30 % of the variance (R2 = .30,
p < .01). Entering motives on Step 2 accounted for an additional 22% of the variance
(R2 = .22, p < .01), and entering the amount of use on Step 3 explained additional
3% of the variance (R2 = .03, p < .01). Among background characteristics, locus
of control was a significant negative predictor of escaping realty. Gender, shyness,
habitual entertainment motivation, escape motivation, and amount of Internet use
were significant positive contributors to escaping reality.
Finally, the hierarchical multiple regression equation explained 31.1% of the
explained variance of attachment [R = .56, R2 = .31, F(16, 186) = 5.24, p < 01].
Psychological and social factors entered in Step 1 accounted for 24 % of the variance
(R2 = .24, p < .01). Entering motives on Step 2 accounted for an additional 7% of the
variance (R2 = .07, p < .01). However, entering amount of use on Step 3 did not
increase R2 significantly (R2 = .003, p = .39). Gender, thrill-seeking, and using the
internet for purposes of caring for others and for excitement were significant positive
contributors to attachment. Locus of control was a significant negative predictor of
attachment. Final results of the hierarchical regression analyses are summarized in
Table 4.
1002 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

Table 4 Summary of the Final Results of Hierarchical Regression Analyses


Final

Step 1 (Background characteristics)


Gender
Self-esteem
Shyness
Locus of control
Loneliness
Sensation-seeking
Disinhibition
Thrill
ES
BS
Step 2 (Internet use motives)
Habitual entertainment
Caring for others
Economical information seeking
Excitement
Control
Escape
Step 3
Amount of Internet use

Intrusion

Escaping Reality

Attachment

.02
.01
.09
.22**
.17*

.12*
.10
.22**
.18**
.04

.23**
.04
.13
.18*
.11

.03
.08
.13
.05

.03
.07
.08
.05

.03
.22**
.09
.15

.20*
.28**
.11
.20**
.04
.18*

.23**
.08
.06
.02
.02
.24**

.02
.18*
.03
.18*
.08
.03

.23**

.18**

.06

Note. All s are final s on the last step of the regression. N = 204.
*p < .05, **p < .01

Discussion

In the present study, we conceptualized and operationalized Internet addiction by


considering indicators drawn from DSM-IV criteria that have been relied upon in
prior studies of addiction in substance and media-related contexts. These criteria
include indicators that reflect a loss of control (e.g., a compelling need to use it
though it is having negative consequences on ones life) that some researchers have
alleged distinguishes mere heavy use from addictive behavior (e.g., Young, 1996b).
Our exploratory factor analysis uncovered three specific dimensions of such Internet
use behaviors.
Dimensions of Internet Addiction

The first dimension, intrusion, reflects a manifestation of Internet use in which users
neglect activities in their everyday lives (e.g., chores, etc.) due to their unhealthy
Internet use. Individuals who exhibit this form of use tend to use the Internet for
longer periods than they intend. They seem aware of their problematic Internet
use, but are unable to correct it satisfactorily. The second dimension, which we
term escaping reality, seems to be a more intense manifestation of possible Internet
addiction than either of the other two dimensions. Whereas intrusion reflects a sense
that Internet is interfering with ones offline life, those whose Internet use behavior
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

1003

reflects escaping reality see offline activities as interfering with their online lives.
Those exhibiting this form of use experience anger when others hinder their Internet
use, prefer time online over time with friends and family, and are preoccupied
with Internet use even when they arent online. The final dimension, attachment,
reflects a strong emotional connection to the Internet. Users exhibiting this form
of Internet-use behavior could not imagine living without Internet. Although they
might get upset or agitated if unable to go online, this attachment to the Internet
does not seem to be as disrupting to users offline activities as when their Internet
use is manifested in the form of intrusion or escaping reality. But, it does reflect
becoming upset when one cannot use the Internet that may possibly reflect a more
intense feeling of loss or withdrawal than that experienced by those who simply have
an affinity for or reliance on the medium.
But reaching definitive conclusions from just one study using a convenience
sample would be premature and must be tempered. It is tempting, for example, to
speculate that intrusion and attachment may be less intense forms of Internet use that
may be precursors of the more intense form of use, escaping reality, if not negated
through intervention. This might reflect that there can be a progression in Internet
addiction moving from the milder to the intense level (Charlton & Danforth, 2007).
It would be similar to claims made in the context of substance addiction that the use
of soft drugs can lead to the use of hard drugs as addiction progresses (Hopper,
1995). It is also possible, though, that intrusion, attachment, and escaping reality
are three distinct forms of Internet use and that one does not necessarily lead to the
other. This would be consistent with claims made by Caplan (2002) that dimensions
of addiction are distinct and not a continuum of progression. Again, though, either
speculation is premature from the results of just one study. One reason we cant
reach a definitive conclusions about the related or distinct nature of the different
dimensions of Internet addiction is that no consensus has been reached on the
dimensions or stages of Internet addiction in previous research. Another reason is
that there was not a consistent set of predictors across the three different dimensions
of Internet use behavior in the current study.
In addition, among this convenience sample, the mean values of each dimension
(intrusion M = 1.66, escaping reality M = 2.63, attachment M = 1.94) were low.
Thus, even if our measure is a valid and reliable measure of addictive behavior, on the
whole, this student sample did not seem to exhibit inordinately such behaviors. Future
research should target particular populations that do measure high on such indicators
to see if factors identified hereintrusion, escaping reality, attachmentprove to be
stable in confirmatory factor analyses, and valid and reliable when studied with other
variables to which addiction should be linked theoretically.
Items composing intrusion, escaping reality, and attachment are from DSM-IV
criteria that have been used for diagnosing addiction in diverse contexts. However,
when applied to media contexts, we should be cautious in unabashedly interpreting
the three dimensions found in the current study as addiction. Rather, they might
reflect a tendency toward addiction or addictive behaviors. Though much more
1004 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

research is needed, the fact that antecedents that had been associated with addiction
in different contexts were linked with the three dimensions identified in this study
suggests we should at least consider the possibility that these measures reflect aspects
of addictive behavior or a tendency toward it.
Background Characteristics

Specific characteristics that had linked with substance or behavioral addiction in


previous research (i.e., shyness, sensation-seeking, and loneliness) were positively
related to all three dimensions of Internet-use behavior in this study. Meanwhile,
internal locus of control and self-esteem, which were negatively related to substance
or behavioral addiction in prior research, also were negatively related to all three
dimensions of Internet-use-behavior identified in this study. If these dimensions
identified here do reflect addictive behavior, these results could suggest that Internet
addiction may be explained and conceptualized in accordance with addiction in other
contexts.
Shyness was a positive predictor of the second (escaping reality) dimension of
Internet-use behavior. This supports prior research linking this personality trait with
various forms of substance (e.g., Ensminger et al., 2001; Santesso et al., 2004) and
behavioral (e.g., gaming, gambling) addictions (Murali & George, 2007). It also may
corroborate prior research suggesting links between shyness and Internet addiction,
specifically (e.g., Chak & Leung, 2004). But, again, we must be cautious in reaching
such a conclusion. For example, it has been suggested that shy people find offline
interaction less satisfying and supportive, so they go online to feel more comfortable
interacting with others (e.g., Papacharissi & Rubin, 2000; Yuen & Lavin, 2004). Thus,
those who are shy in face-to-face interaction may use the Internet as an alternative
channel for social interaction. The Internet may provide them with a valuable tool
for impression management, control over their self-presentations, and expressing
greater communication competence than they typically have in direct face-to-face
interaction. However, for some individuals (perhaps the extremely shy) there may be
a darker side to their Internet use. The Internet may provide a means of escape from
uncomfortable everyday offline interactions for extremely shy people and leads them
to an unhealthy preference for online communication activities over their offline
activities. But our results do not necessarily suggest the latter. Before concluding that
shyness correlates with media use in the same way it correlates with alcohol or drug
addiction, we must recognize that the underlying connections may be quite different.
This has to be explored further beyond our exploratory study.
Internal locus of control was a strong negative predictor of all three dimensions
of Internet-use behavior identified here. This may suggest that externally controlled
Internet users may be particularly prone to developing an addiction to the Internet.
In other words, individuals with high external locus of control might not believe
that they can control and moderate their Internet-use behaviors. In studies of
television use, research has suggested that externally controlled viewers are prone to
unintended negative consequences of use such as postviewing aggression (Haridakis,
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

1005

2002), cultivation effects such as fear (Wober & Gunter, 1982), and concern with
safety (Haridakis & Rubin, 2005). Given its predictive strength, locus of control
should continue to receive greater attention in media addiction studies. As in the case
of shyness, though, results regarding the connection between locus of control and the
Internet-use dimensions here should be interpreted cautiously. For some externals,
the Internet may provide them with an opportunity to attempt to exercise some
control in their lives that they otherwise lack (e.g., Peele, 1985). Future research should
seek to differentiate between such positive effects, and the potentially unhealthy links
between locus of control and Internet use that our results may suggest.
Zero-order correlation analysis suggested that loneliness related positively with
all three dimensions of Internet addition. However, it was a significant negative
predictor of intrusion in the multiple regression analysis. Thus, when a wider array of
variables was considered, the relationship between loneliness and Internet addiction
was not so straightforward. This finding suggests that prior research linking loneliness
to at least some aspects of addictive behavior could be an artifact of the failure to
account for other variables that may mediate that relationship. This possibility may
also explain why self esteem and sensation seeking were related to dimensions of
addiction, but failed to predict any specific dimension in the regression analysis.
This latter point should be stressed. We included specific background factors in
this study because of their links with addiction in prior contexts. While no single
study can include all of the individual differences that may impact media effects, there
are numerous other possible confounding variables that could be important to assess
in future research. For example, pursuant to uses and gratifications theory various
psychological circumstances (e.g., depression, anxiety) and social circumstances (lack
of mobility, health problems, communication disabilities) could be relevant factors
that could make one more or less prone to Internet addiction or other problematic
use. Accordingly, future research should examine the influence of a wider array of
background factors.
Motives for Using the Internet

The second goal of this study was to ascertain whether certain motives for using the
Internet might predict addiction. Prior studies have not explored systematically the
potential influences of motives for using the Internet on Internet addiction. Here
we found that a number of motives differentially predicted different dimensions
of Internet use. The fact that different sets of motives predicted the three different
dimensions of Internet addiction (as measured here) might provide some hints on
distinguishing different intensity levels of addiction. Using the Internet for purposes
of habitual entertainment and escape predicted allegedly the most intense dimension
of Internet addiction, escaping reality. This may corroborate U&G research suggesting
that more habitual use was less likely to mitigate, and at times might even enhance,
the likelihood of unintended negative effects of media use (e.g., Haridakis, 2002;
Rubin, 2002). On the other hand, neither of those motives predicted attachment.
Instead, motives that reflected using the Internet to care for others and to seek
1006 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

excitement were significant predictors of attachment. Intrusion was predicted by


motives to escape, seek excitement, and to care for others. Habitual entertainment
was a negative predictor of intrusion, suggesting that individuals exhibiting this type
of Internet-use behavior tended not to rely on the Internet for habitual entertainment.
Whereas habitual entertainment was a negative predictor of intrusion, it was a
positive predictor of the Internet-use behavior we termed escaping reality. However,
it is not clear whether using the Internet to escape is a form of habitual use among
those exhibiting this form of Internet-use behavior. It may be that these individuals
purposively use the Internet to escape. This would be consistent with research in
substance abuse contexts suggesting that substance abusers use drugs to escape the
problems of their everyday lives (e.g., Dole & Nyswander, 1967). Therefore, our
speculation is that caring for others, seeking excitement, and perhaps escape motives
(in the case of escaping reality) could be characterized as purposive uses of the
Internet when users actively seek to achieve these goals from the Internet. This leads
us to speculate that engaging in behaviors that evidence addiction is sometimes
manifested in purposive goal-directed use of the Internet. Further investigation is
required about the relationship between purposive media use motives and negative
media use outcomes.
Amount of the Internet Use

Most media effects research focusing on the role of the media on negative effects (e.g.,
violence) suggest that exposure is a central variable contributing to the effects. Prior
research also linked the amount of use with Internet addiction (e.g., Morahan-Martin
& Schumacher, 2000; Young & Rogers, 1998). Here, we found that amount of Internet
use correlated positively with all three dimensions of Internet-use behaviors. In the
multiple regression analyses, the amount of Internet use was a significant predictor
of both intrusion and escaping reality. In each instance, entering amount of Internet
use into the regression equations resulted in a significant increase in the explained
variance.
As with the other variables in this study, though, the relationship between amount
of use and the dimensions of Internet-use behavior identified here should be explored
further. If Internet use can be addictive, it is logical to assume that those who are
addicted would use it extensively. But, not all heavy use is tantamount to addiction.
If that were the case, all heavy use of media could be deemed addiction. On the
whole, these college students used the Internet a significant amount of time, more
than 3 hours per day (194 minutes). Despite the fact that they used the Internet a
significant amount of time, as referenced above, the low means on the addiction
scale suggests they did not on the whole exhibit a high level of addictive Internet-use
behavior. Accordingly, future research should focus more deliberately on the loss
of control over ones media use that is reflected in DSM-IV criteria that comprised
the factors of Internet use identified here. The loss of control and the disruptive
effects it may have on the user and his/her relationship with others may be a major
distinguishing characteristic between mere heavy use and addiction.
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

1007

Conclusion, Future Research, and Limitation

In summary, the results of this study suggest that Internet addiction may be
manifested in different ways. Here we identified three possible dimensions: intrusion,
escaping reality, and attachment. If, as the results suggest, some forms of Internetuse behaviors are more intense and more detrimental than others, then future
research should be directed toward identifying with greater specificity exactly what
background characteristics of Internet users and motives for using the Internet
explain how addiction is manifested and which users are more susceptible to these
different manifestations of addiction.
The results also suggest that if motives and background factors are
important potential contributors to addiction that should be included in future
researchparticularly research that considers more specifically possible addiction to
particular Internet fare or functions. It may be that some Internet users who exhibit
indicators of addiction may be addicted to the Internet. It may also be that they are
addicted to content the Internet permits them to access, rather than the Internet
itself. Perhaps it is possible to be addicted to both the Internet and to particular
fare. But more research has to focus on distinguishing between potential addiction
to the medium and addiction to content that may be accessed via that medium.
With respect to the former, some researchers have suggested that those who are
addicted to the Internet spend more time with a variety of functions such as browsing
without specific goals (e.g., Caplan, 2002; Davis, 2001). For those who are addicted to
particular content, such as pornography, the Internet may simply be a delivery device
in the same way that a syringe is a delivery device for a substance abuser. In addition,
the Internet may only be one medium among others (e.g., videos, magazines) through
which they obtain that content. Whether future research focuses on the Internet or
particular content, the results here suggest that the inquiry should not ignore the
important influence of motives and background characteristics of users that may
make some more prone to addictive behavior than others. For example, over the
years, media research has suggested that some people use and develop an affinity for
media (such as television) whereas others develop an affinity for particular content
(e.g., see Rubin, 2002). Future Internet addiction research should consider profiles
of these different media-use orientations to see if those evidencing either are more or
less prone to addiction to a medium such as the Internet and/or addiction to content
available via the media.
In addition, the results of the current study also lead to a series of questions related
to Internet users. Especially, how far can we generalize the findings from a study of
college students, who did not exhibit a high level of problematic Internet-use behavior,
to potentially more at-risk groups who may be more prone to addictive media-use
behavior? Can the results here be considered applicable to other populations who do
not have the level of Internet access that college students have? Finally, many of the
scales measuring variables included in the current study were developed in decades
preceding the Internet or adapted from research in the 1990s, when the Internet was
in its infancy. The changing nature of Internet use, functions, and use environments
1008 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

may require more advanced and up-to-date measures of variables that may be more
amenable to the study of media use and effects in ever changing media environments.
Notes
1 A maladaptive pattern of substance use, leading to clinically significant impairment
or distress, as manifested by three (or more) of the following, occurring at any time
in the same 12-month period.
(1) Tolerance, as defined by either of the following:
(a) a need for markedly increased amounts of the substance to achieveintoxication
or desired effect
(b) markedly diminished effect with continued use of the same amount of the
substance
(2) Withdrawal, as manifested by either of the following:
(a) the characteristic withdrawal syndrome for the substance (refer to Criteria A
and B of the criteria sets for Withdrawal from the specific substances)
(b) the same (or a closely related) substance is taken to relieve or avoid withdrawal
symptoms
(3) The substance is often taken in larger amounts or over a longer period than was
intended.
(4) There is a persistent desire or unsuccessful efforts to cut down or control substance
use.
(5) A great deal of time is spent in activities necessary to obtain the substance (e.g., visiting
multiple doctors or driving long distances), use the substance (e.g., chain-smoking),
or recover from its effects.
(6) Important social, occupational, or recreational activities are given up or reduced
because of substance use.
(7) The substance use is continued despite knowledge of having a persistent or recurrent
physical or psychological problem that is likely to have been caused or exacerbated
by the substance (e.g., current cocaine use despite recognition of cocaine-induced
depression, or continued drinking despite recognition that an ulcer was made worse
by alcohol consumption) (Behavenet.com, 2007).
References
Akerlind, I., & Hornquist, J. O. (1989). Stability and change in feelings of loneliness: A twoyear prospective longitudinal study of advanced alcohol abuse. Scandinavian Journal of
Psychology, 30(2), 102112.
American Psychiatric Association. (2006). DSM: Diagnostic and statistical manual of mental
disorders (4th ed.). Retrieved January 23, 2008, from
http://www.psych.org/research/dor/dsm/dsmintro81301.cfm.
Anderson, K. (1998). Internet dependency among college students: Should we be concerned?
Presented at the Meeting of the American College Personnel Association, St. Louis, MO.
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

1009

Armstrong, L., Phillips, J.G., & Saling, L.L. (2000). Potential determinants of heavier internet
usage. International Journal of HumanComputer Studies, 53, 537550.
Baumeister, R, F. (1993). Self-esteem: The puzzle of low self-regard. New York: Plenum Press.
Bauer, J. M., Gai, P., Kim, J-H., Muth, T., & Wildman, S. (2002). Broadband: Benefits and
policy challenges. A report prepared for Merit Network, Inc.
Ball-Rokeach, S. J. (1985). The origins of individual media-system dependency:
A sociological framework. Communication Research, 12(4), 485510.
Bearinger, L. H., & Blum, R. W. (1997). The utility of locus of control for predicting
adolescent substance use. Research in Nursing, 20, 229249.
Beard, K. W., & Wolf, E. M. (2001). Modification in the proposed diagnostic criteria for
Internet addiction. Cyberpsychology and Behavior, 4, 377383.
Bingham, J. E., & Piotrowski, C. (1996). On-line sexual addiction: A contemporary enigma.
Psychological Reports, 79, 257258.
Block, J. (2008). Issues for DSM-IV: Internet addiction. American Journal of Psychiatry, 165,
306307.
Brenner, V. (1997). Psychology of computer use: XLVII: Parameters of Internet use, abuse
and addiction: The first 90 days of the Internet usage survey. Psychological reports, 80,
879882.
Brown, R. I. F. (1993). Some contributions of the study of gambling to the study of other
addictions. In W. R. Eadington & J. A. Cornelius (Eds), Gambling behavior and problem
gambling (pp. 241272). Reno: University of Nevada Press.
Caplan, S. E. (2002). Problematic Internet use and psychosocial well-being: Development of
a theory-based cognitive-behavioral measurement instrument. Computers in Human
Behaviors, 18, 553575.
Caplan, S. E. (2003). Preference for online social interaction: A theory of problematic
Internet use and psychosocial well-being. Communication Research, 30(6), 625648.
Chak, K., & Leung, L. (2004). Shyness and locus of control as predictors of internet addiction
and internet use. Cyberpsychology & behavior: The impact of the Internet, multimedia and
virtual reality on behavior and society, 7(5), 559570.
Charney, T., & Greenberg, B. S. (2002). Uses and gratification of the Internet:
Communication, technology and science. In C. Lin & D. Atkin (Eds.), Communication,
technology and society: New media adoption and use (pp. 379407). Cresskill, NJ:
Hampton Press.
Charlton, J. P., & Danforth, I. D.W. (2007). Distinguishing addiction and high engagement
in the context of online game playing. Computers in Human Behavior, 23, 15311548.
Cheek, J. M., & Buss, A. H. (1981). Shyness and sociability. Journal of Personality and Social
Psychology, 41(2), 330339.
Chen, W. J., Boase, J., & Wellman, B. (2002). The Global villagers: Comparing Internet users
and uses around the world. In B. Wellman & C. Haythornthwaite (Eds.), The Internet in
Everyday Life (pp. 74113). Oxford: Blackwell.
Chou, C., & Hsiao, M-C. (2000). Internet addiction, usage, gratification, and pleasure
experience: The Taiwan college students case. Computers and Education, 35(1), 6580.
1010 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

Craig, R. J. (1995). The role of personality in understanding substance abuse. Alcoholism


Treatment Quarterly, 13, 1727.
Davis, R. A. (2001). A cognitivebehavioral model of pathological Internet use. Computers in
Human Behavior, 17, 187195.
Davis, R.A., Besser, A., & Flett, G.L. (2002). Validation of a new scale for measuring
problematic Internet use: Implications for pre-employment screening. Cyberpsychology &
Behavior, 5(4), 331345.
Dole, V. P., & Nyswander, M. E. (1967). Heroin addiction: A metabolic disease. Archives of
Internal Medicine, 120(1), 1924.
Ebersole, S. (2000). Uses and gratifications of the Web among students. Journal of ComputerMediated Communication, 6(1), retrieved February 1, 2008 from,
http://www.ascusc.org/jcmc/vol6/issue1/ebersole.html.
Eisenman, R., Dantzker, M. L., & Ellis, L. (2004). Self ratings of dependency/addiction
regarding drugs, sex, love, and food: Male and female college students. Sexual Addiction
& Compulsivity, 11, 115127.
Ensminger, M. E., Juon, H., & Fothergill, K. E. (2001). Childhood and adolescent
antecedents of substance use in adulthood. Addiction, 97, 833844.
Ferguson, D., & Perse, E. (2000). The World Wide Web as a functional alternative to
television. Journal of Broadcasting and Electronic Media, 44(2), 155174.
Goldberg, I. (1996). Internet addiction disorder. Retrieved March 3, 2008, from
http//www.cog.brown.edu/brochures/people/duchon/humor/internet.addiction.html.
Griffiths, M. (1990). The cognitive psychology of gambling. Journal of Gambling Studies, 6,
3142.
Griffiths, M. (1996). Internet addiction: An issue for clinical psychology? Clinical
Psychology Forum, 97, 3236.
Griffiths, M. (1998). Internet addiction: does it really exist? In J. Gackenbach (Ed.),
Psychology and the Internet: Interpersonal, interpersonal and intranspersonal applications
(pp. 6175). New York: Academic Press.
Griffiths, M. (1999). Internet addiction: Internet fuels other addictions. Student British
Medical Journal, 7, 428429.
Grunbaum, J. A., Tortolero, S., Weller, N., & Gingiss, P. (2000). Cultural, social, and
intrapersonal factors associated with substance use among alternative high school
students. Addictive Behaviors, 25(1), 145151.
Hampton, K. N., & Wellman, B. (2003). Neighboring in netville: How the Internet supports
community and social capital in a wired suburb. City and Community, 2(4), 277311.
Haridakis, P. M. (2002). Viewer characteristics, exposure to television violence, and
aggression. Media Psychology, 4, 323352.
Haridakis, P. M., & Rubin, A. M. (2005). Third-person effects in the aftermath of terrorism.
Mass Communication & Society, 8, 3959.
Hirschman, E. C. (1992). The consciousness of addiction: Toward a general theory of
compulsive consumption. Journal of Consumer Research, 19, 155179.
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

1011

Hopper, E. (1995). A psychoanalytical theory of drug addiction: Unconscious fantasies of


homosexuality, compulsions and masturbation within the context of traumatogenic
processes. The International Journal of Psychoanalysis, 76, 11211142.
Horvath, C. W. (2004). Measuring television addiction. Journal of Broadcasting & Electronic
Media, 48(3), 378398.
Hur, M. (2006). Demographic, habitual, and socioeconomic determinants of Internet
addiction disorder: An empirical study of Korean teenagers. Cyberpsychology & Behavior,
9(5), 514525.
Kang, S. (2007). Disembodiment in online social interaction: Impact of online chat on social
support and psychosocial well-being. Cyberpsychology & Behavior, 10(3), 4757.
Katz, J. E., & Aspden, P. (1997). A nation of strangers? Communications of the ACM, 40(12),
8186.
Katz, E., Blumer, J. G., & Gurevitch, M. (1974). Uses and gratifications research. The Public
Opinion Quarterly, 37(4), 509523.
Kaye, B. K., & Johnson, T. J. (2004). Web for all reasons: Uses and gratifications of Internet
components for political information. Telematics and Informatics, 21(3), 197223.
Keepers, G. A. (1990). Pathological preoccupation with video games. Journal of the American
Academy of Child and Adolescent Psychiatry, 29(1), 4950.
Kraut, R., Patterson, M, Landmark, V., Kielser, S., Mukophadhyaya, T., and Scherlis, W.
(1998). Internet paradox: A social technology that reduces social involvement and
psychological well-being? American Psychologist, 53(9), 10171031.
Krcmar, M., & Greene, K. (1999). Predicting exposure to and uses of television violence.
Journal of Communication, 49(3), 2445.
Kubey, R. W., Lavin, M. J., & Barrows, J. R. (2001). Internet use and collegiate academic
performance decrements: Early findings. Journal of Communication, 51(2), 366382.
LaRose, R., Lin, C. A., & Eastin, M. S. (2003). Unregulated internet usage: Addiction, habit,
or deficient self-regulation? Media Psychology, 5, 225253.
Lesieur, H. R., & Blume, S. B. (1993). Pathological gambling, eating disorders, and the
psychoactive substance use disorders. Comorbidity of Addictive and Psychiatric Disorders,
9(1), 89102.
Lesieur, H. & Rosenthal, R. (1991). Pathological gambling: A review of the literature
(prepared for the American Psychiatric Association Task Force on DSM-IV Committee
on Disorders of Impulse Control not elsewhere classified). Journal of Gambling Studies,
7(1), 539.
Leung, L. (2004). Net-generation attributes and seductive properties of the Internet as
predictors of online activities and Internet addiction. Cyberpsychology & Behavior, 7,
333348.
Levenson, H. (1974). Activism and powerful others: Distinctions within the concept of
internal- external control. Journal of Personality Assessment, 38, 381382.
Loos, M. D. (2002). The synergy of depravity and loneliness in alcoholism: A new
conceptualization, and old problem. Counseling and Values, 46, 199212.
1012 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

Mandell, J. (July 3, (2007). ). Are gadgets, and the Internet, actually addictive? CNN.com,
retrieved November 1, 2008 from
http://edition.cnn.com/2007/TECH/ptech/07/01/la.tech.addictions/index.html.
Marks, I. (1990). Non-chemcal (behavioral) addictions. British Journal of Addiction, 85,
13891394.
Marlatt, A. G., Baer, J. S., Donovan, D. M., & Kivlahan, D. R. (1988). Addictive behaviors:
Etiology and treatment. Annual Review of Psychology, 39, 223252.
Martin, M. (2007). Doctors dismiss video game addiction claim, retrieved March 12, 2009
from
http://www.gamesindustry.biz/articles/doctors-dismiss-videogame-addiction-claim.
McKenna, K. Y. A., & Bargh, J. A. (2000). Plan 9 from cyberspace: The implication of the
Internet for personality and social psychology. Personality and Social Psychology Review,
4, 5775.
Medora, N. P., & Woodward, J. C. (1991). Factors associated with loneliness among
alcoholics in rehabilitation centers. Journal of Social Psychology, 131(6), 769779.
Morahan-Martin, J. (2007). Internet use and abuse and psychological problems. In
A. Joinson, K., McKenna, T., Postmes, & R., Ulf-Dietrich (Eds.), The Oxford handbook of
Internet psychology (pp. 331345). Oxford: University Press.
Morahan-Martin, J., & Schumacher, P. (2000). Incidence and correlates of pathological
Internet use among college students. Computers in Human Behavior, 16, 1329.
Morgan, W. (1979). Negative addiction in runners. Physician and Sports Medicine, 7, 5669.
Murali, V., & George, S. (2007). Lost online: An overview of Internet addiction. Advances in
Psychiatric Treatment, 13, 2430.
Nerviano, V. J., & Gross, W. F. (1976). Loneliness and locus of control for alcoholic males:
Validity against Murray need and Cattell trait dimensions. Journal of Clinical Psychology,
32, 479484.
Oliver, M. B. (2002). Individual differences in media effects. In D. Zillman & J. Bryant (Eds),
Media effects: Advances in theory and research (pp. 507524). Mahwah, NJ: Erlbaum.
Papacharissi, Z., & Rubin, A. M. (2000). Predictors of Internet use. Journal of Broadcasting &
Electronic Media, 44(2), 175196.
Peele, S. (1985). The Meaning of Addiction. Lexington, MA: Lexington Books.
Perse, E. M. (1996). Sensation seeking and the use of television for arousal. Communication
Reports, 9(1), 3748.
Pratarelli, M. E., Browne, B., & Johnson, K. (1999). The bits and bytes of computer/Internet
addiction: A factor analytic approach. Behavior Research Methods, Instruments and
Computers, 31, 305314.
Rawlings, J. O., Pantula, S. G. & Dickey, D. A. (1998). Applied Regression Analysis-A Research
Tool. Spring-Verlag, New Jersey.
Rheingold, H. (1993). The virtual community: Homesteading on the electronic frontier.
Reading, MA: Addison Wesley.
Robbins, R. N., & Bryan, A. (2004). Relationships Between future orientation, impulsive
sensation seeking, and risk behavior among adjudicated adolescents. Journal of Adolescent
Research, 19, 428445.
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

1013

Rokach, A., & Orzeck, T. (2003). Coping with loneliness and drug use in young adults. Social
Indicators Research, 61(3), 259283.
Rosenberg, M. (1965). Society and the adolescent self-image. Princeton: Princeton University
Press.
Rotter, J. B. (1966). Generalized expectancies for internal versus external control of
reinforcement. Psychological Monographs, 80, 128.
Rubin, A. M. (1983). Television uses and gratifications: The interactions of viewing patterns
and motivations. Journal of Broadcasting, 27, 3752.
Rubin, A. M. (1993). The effect of locus of control on communication motivation, anxiety,
and satisfaction. Communication Quarterly, 41, 161171.
Rubin, A. M. (2002). The uses-and-gratifications perspective of media effects. In J. Bryant &
D. Zillmann (Eds), Media effects: Advances in theory and research (pp. 525548).
Mahwah, NJ: Erlbaum.
Rubin, A. M., & Windahl, S. (1986). The uses and dependency model of mass
communication. Critical Studies in Mass Communication, 3, 184199.
Russell, D. (1996). The UCLA Loneliness Scale (Version 3): Reliability, validity, and factor
structure. Journal of Personality Assessment, 66, 2040.
Sanders, C. E., Field, T. M., Diego, M., & Kaplan, M. (2000). The relationship of Internet use
To depression and social isolation among adolescents. Adolescence, 35, 237242.
Santesso, D. L., Schmidt, L. A., & Fox, N. A. (2004). Are shyness and sociability still a
dangerous combination for substance use? Evidence from a US and Canadian sample.
Personality and Individual Differences, 37, 517.
Shaffer, H. (2004). Internet gambling and addiction position paper. Boston: Harvard Medical
School, Division on Addictions.
Shaffer, H., Hall, M., & Vander Bilt, J. (2000). Computer addiction: A critical consideration.
American Journal of Orthopsychiatry, 70, 162168.
Scherer, K. (1997). College life on-line: healthy and unhealthy Internet use. The Journal of
College Student Development, 38, 655664.
Song, I., LaRose, R., Eastin, M., & Lin, C. (2004). Internet gratifications and Internet
addiction: On the uses and abuses of new media. Cyberpsychology & Behavior, 7(4),
384394.
Steele, R. G., Forehand, R., Armistead, L., & Brody, G. (1995). Predicting alcohol and drug
use in early adulthood: The role of internalizing and externalizing behavior problems in
early adolescence. American Journal of Orthopsychiatry, 65, 380388.
Stoll, C. (1995). Silicon snake oil. New York: Doubleday.
Swann, W. B., Jr. (1996). Self-traps: The elusive quest for higher self-esteem. New York:
Freeman.
Thatcher, A., & Goolam, S. (2005). Development and psychometric properties of the
problematic Internet use questionnaire. South Africa Journal of Psychology, 35, 793805.
Turkle, S. (1996). Virtuality and its discontents: Searching for community in cyberspace. The
American Prospect, 24, 5057.
Young, K. S. (1996a). Psychology of computer use XI: Addictive use of the Internet: a case
study that breaks the stereotype. Psychological Reports, 79, 899902.
1014 Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

Young, K. S. (1996b). Internet addiction: The emergence of a new clinical disorder. Paper
presented at the American Psychological Association. Toronto, Canada.
Young, K. S. (1997). What makes online usage stimulating? Potential explanations for
pathological Internet use. Symposia paper presented at the 105th Annual Meeting of the
American Psychological Association, Chicago.
Young, K. S. (1998). Caught in the net. Chichester: Wiley.
Young, K. S. (2004). Internet addiction: A new clinical phenomenon and its consequences.
American Behavioral Scientist, 48(4), 402415.
Young, K. S., & Rogers, R. C. (1998). The relationship between depression and Internet
addiction. CyberPsychology and Behavior, 1(1), 2528.
Yuen, N. C., & Lavin, M. J. (2004). Internet dependence in the collegiate population: The
role of shyness. CyberPsychology & Behavior, 7(4), 379383.
Wagner, M. K. (2001). Behavioral characteristics related to substance abuse and risk-taking,
sensation-seeking, anxiety sensitivity, and self-reinforcement. Addictive Behavior, 26,
115120.
Walther, J. (1999). Communication addiction disorder: Concern over media, behavior and
effects. Paper presented at the annual meeting of American Psychological Association,
Boston.
Widyanto, L., & McMurran, M. (2004). The psychometric properties of the Internet
Addiction Test. CyberPsychology & Behavior, 7(4), 443450.
Winn, M. (1977). The plug-in drug: Television, children, and the family. New York: Vikin.
Wober, M., & Gunter, B. (1982). Television and personal threat: Fact or artifact? A British
survey. British Journal of Social Psychology, 21, 239247.
Zuckerman, M. (1979). Sensation-seeking: Beyond the optimal level of arousal. Hillsdale, NJ,
Erlbaum.
Zweben, J. E. (1987). Recovery-oriented psychotherapy: Facilitating the use of 12-step
programs. Journal of Substance Abuse Treatment, 19, 243251.

About the Authors

Junghyun Kim (Ph.D., Michigan State University) is an Assistant Professor in the


School of Communication Studies at Kent State University. Her research interests
include Psychological and Social Impacts of New Media; Computer-Mediated
Communication.
Address: School of Communication Studies, Kent State University, P.O. Box 5190,
Kent, OH. 44242 Email: jkim23@kent.edu
Paul M. Haridakis (Ph.D., Kent State University) is an Associate Professor in the
School of Communication Studies at Kent State University. His research interests
include Media Uses and Effects; New Communication Technologies; Media Law,
Policy and Regulation; Freedom of Expression and media history.
Address: School of Communication Studies, Kent State University, P.O. Box 5190,
Kent, OH. 44242 Email: pharidak@kent.edu
Journal of Computer-Mediated Communication 14 (2009) 9881015 2009 International Communication Association

1015

INTERNET USAGE AMONG COLLEGE STUDENTS AND ITS


IMPACT ON DEPRESSION, SOCIAL ANXIETY, AND SOCIAL ENGAGEMENT

A Dissertation
Submitted to the School of Graduate Studies and Research
in Partial Fulfillment of the
Requirements for the Degree
Doctor of Psychology

Kimberlee D. DeRushia
Indiana University of Pennsylvania
May 2010

2010 by Kimberlee D. DeRushia


All Rights Reserved

ii

Indiana University of Pennsylvania


The School of Graduate Studies and Research
Department of Psychology

We hereby approve the dissertation of

Kimberlee D. DeRushia, M.A.

Candidate for the degree of Doctor of Psychology

April 13, 2010

Signature on file
Kimberely J. Husenits, Psy.D.
Associate Professor of Psychology, Advisor

April 13, 2010

Signature on file
Beverly J. Goodwin, Ph.D.
Professor of Psychology

April 13, 2010

Signature on file
John A. Mills, Ph.D., ABPP
Professor of Psychology

ACCEPTED

Timothy P. Mack, Ph.D.


Dean
The School of Graduate Studies and Research

iii

ABSTRACT
Title: Internet Usage among College Students and its Impact on Depression, Social
Anxiety, and Social Engagement
Author: Kimberlee D. DeRushia, M.A.
Dissertation Chair: Kimberely J. Husenits, Psy.D.
Dissertation Committee Members:

Beverly J. Goodwin, Ph.D.


John A. Mills, Ph.D., ABPP

The Internet provides an opportunity for individuals to interact with friends


and family members, to research any topic they can imagine, and to explore the world
while sitting in the comfort of their own home. The popular media suggests that
Internet usage decreases the amount of social interaction individuals have with the
world outside of their computer and may be accompanied by social anxiety,
loneliness, lowered self-esteem, or chronic depression, and the psychological
literatures mixed findings on these topics have not helped clarify the issue. This
study looked at the impact that Internet usage has on an individuals psychological
well-being in an effort to clarify and expand on the previous research.
Participants in this study were undergraduates at a state university in rural
Pennsylvania. Participants were randomly selected through a psychology department
subject pool. They completed several psychological questionnaires and tracked their
Internet usage and social engagement for a seven-day period. Results indicated that
time spent on the Internet was not predictive of depression, social anxiety, or social
engagement in face-to-face relationships or online relationships. The type of activity
engaged in online was also not predictive of depression, social anxiety, or social

iv

engagement in face-to-face relationships or online relationships. However, results


indicated that there was a significant difference in the way that participants responded
to measures of social anxiety when referencing face-to-face relationships as opposed
to online relationships. Limitations included not tracking ethnicity of participants, an
unequal distribution of gender across the population, and that the population was
restricted to undergraduate students in a rural setting. Based on these results, future
research would benefit from exploring differences in individuals perceptions of
online relationships compared with face-to-face relationships, and from exploring
similar questions in non-college aged, ethnically diverse populations with gender
equally distributed across the sample.

ACKNOWLEDGMENTS
Although the final product of a dissertation has just one name on the front
cover, in my experience at least, it takes an entire village of people to bring one to
fruition. This dissertation was a herculean task that I occasionally considered
abandoning and if it had not been for the encouragement, support, and even once or
twice outright pushing, from Jamie Brass, Steven Behling, Jessica Buckland, Karen
Graves, Hey-Mi Ahn, Marc Palmer and my dad, John DeRushia, Im not certain that I
would have ever finished. Thank you each for understanding the process and being
there when I needed someone to lean on.
It is also with deep gratitude that I thank the members of my committee,
Kimberley J. Husenits, Beverly J. Goodwin, and John A. Mills for their guidance,
patience and dedication throughout this entire process. Thank you Jennifer
Hambaugh, members of the Indiana University of Pennsylvania Applied Research
Lab and to Dana Reed at Student Voice, and Beverly Obitz in the School of Graduate
Studies and Research for helping with the technical aspects of this dissertation. Thank
you also to Nathaniel Mills, Jed Brubaker, and Daniel Lennen for your help with
analyzing my data and taking the time to remind me that the anxiety that comes from
a dissertation makes it easy to forget or over-think basic statistics. I would be remiss
if I didnt also thank my colleagues at University of the Pacific: Stacie Turks,
Charlene Patterson, Liz Thompson, and Kristina Dulcey-Wang for their never-ending
support and gentle prodding.
Finally, I want to thank my family for their encouragement, not only through
the dissertation process but throughout the totality of graduate school. Thank you to

vi

my husband Jason Clark for your limitless patience, for being there on the worst of
days and the best of days, and for moving not once, but twice across the country in
order to support my dream. Thank you also to my son Sebastian Clark, you were too
young to know this, but on the days that it seemed darkest I would come home, hear
your giggles and see your smile, and be reminded that in the end, the journey was
worth the hardship.

vii

TABLE OF CONTENTS
Chapter
1:

2:

3:

4:

Page

INTRODUCTION

Statement of the Problem


The Purpose of the Study

1
2

REVIEW OF THE LITERATURE

The Internet and Related Terms


Gender Differences and the Internet
Social Engagement and the Internet
Social Anxiety and the Internet
Depression and the Internet
Hypotheses
Hypothesis one
Hypothesis two
Hypothesis three

4
6
7
12
13
15
16
16
17

METHOD

18

Participants
Materials
Demographic questionnaire
Measure of Internet usage
Measures of social engagement
Measure of depression
Measure of social anxiety
Procedures
Selecting participants
Phase one
Phase two
Phase three

18
18
19
19
20
21
21
22
22
22
23
23

RESULTS

25

Descriptive Statistics
Internal Consistency of the Social Rhythm Metric

25
26

viii

Chapter

Page

Internet Use as a Predictor of Social Anxiety, Social Engagement,


and Loneliness
Social Activity on the Internet as a Predictor of Social Anxiety,
Social Engagement, and Loneliness
Internet Use and Social Activity on the Internet as a Predictor of
Depression
Gender Effects on Internet Use, Social Activity and Depression
Perception of Loneliness and Social Anxiety with Face-to-Face and
Online Relationships
Comparison of Estimated Internet Usage with Actual Usage
Summary of Results
5:

6:

26
28
29
30
31
32
33

DISCUSSION

34

Gender Differences and the Internet


Social Engagement and the Internet
Social Anxiety and the Internet
Depression and the Internet

34
35
36
38

CONCLUSION, LIMITATIONS AND RECOMMENDATIONS

39

REFERENCES

43

APPENDICES

51

A.
B.
C.
D.
E.
F.
G.
H.
I.

Informed Consent
Demographics Questionnaire
Part One: Internet Usage Tracking Chart
Part Two: Internet Usage Follow-up Questions
Social Rhythmic Metric
UCLA Loneliness Scale (Version 3)
Center for Epidemiologic Studies Depression Scale
Brief Fear of Negative Evaluation, Revised
Debriefing
Campus and Community Resources

ix

51
53
56
57
58
59
60
61
62
63

LIST OF TABLES
Table
1

Page
Time Spent on the Internet and its Influence on Social Engagement,
Social Anxiety, and Loneliness with Face-to-Face Relationships

27

Time Spent on the Internet and its Influence on Social Anxiety, and
Loneliness with Online Relationships

27

Social Activity on the Internet and its Influence on Social Engagement,


Social Anxiety, and Loneliness with Face-to-Face Relationships

28

Social Activity on the Internet and its Influence on Social Anxiety, and
Loneliness with Online Relationships

29

Depression, Internet Use, and Types of Activities Engaged in Online

30

Influence of Gender on Internet Use, Social Activity Online, and


Depression

30

Means of Responses for Loneliness and Social Anxiety Measures

31

Paired Samples T-Tests for Social Anxiety and Loneliness Measures

32

Means of Participant Time Spent Online

33

CHAPTER 1
INTRODUCTION
Statement of the Problem
The Internet has become an integral part of Western society, with
approximately 72.5% of the population of the United States using the Internet on a
regular basis (Internet World Stats, 2008). With only a click of the mouse, the
Internet allows individuals to learn information about almost any topic they care to
research, and to communicate with or learn about future romantic partners,
prospective employees, long-lost friends, or family members (Davis 2007; Kraut et
al., 2002; Teske, 2002; White, 2007). The present study investigated the effect of
Internet use on social interaction with particular attention to the levels of social
anxiety, and depression experienced by college students who engage in frequent, nonacademic Internet use.
In 2005, the primary researcher noticed a social pattern reported by college
freshmen and sophomores presenting for therapy at a rural university counseling
center. In particular, these students frequently reported that they were more
comfortable talking to their friends using technology such as the Internet or text
messaging on their cell phones, than traditional forms of communication such as faceto-face conversations or speaking on the telephone. Anecdotally, a particular client
reported that she frequently froze up and was unable to have an in-person
conversation with her male friends but had no difficulty talking with text via a
computer instant messaging program.

There is a paucity of psychological literature concerning college student use


of Internet social networking is available and those studies that are available are
contradictory in nature (Brignall & Van Valey, 2005; Kraut et al., 1998, 2002; Odell,
Korgen, Schumacher & Delucchi, 2000; Ybarra, 2004). The popular media, who are
more consistent about the issue, repeatedly infers that Internet use impairs social
interaction and that increased use may even lead to chronic depression and clinical
levels of social anxiety in traditional social situations (CBS News, 2007; Fox News,
2007; Geldof, 2007; USA Today, 2007). The dissimilarity between these two bodies
of literature and the seeming confusion within empirical investigations of the topic
was in need of clarification. That is: is the increase in Internet use, particularly
among younger individuals for social contact harmful? This is of particular concern
as a recent study reported that 89% of individuals between the ages of 18 and 24
residing in the United States engage in Internet use daily (Jones & Fox, 2009, p. 2).
However, it is unclear whether accessing ones social world online negatively impacts
ones face-to-face social relationships and mental health.
The Purpose of the Study
The purpose of the present study was to clarify these discrepant portrayals of
Internet use for social communication by exploring the impact of Internet use on
social engagement in a college-aged population with particular attention to symptoms
of social anxiety and depression. This study investigated three primary questions in
addressing the disparate portrayals of the effect of Internet based social interactions:
1) Can the amount of time spent and the level of social interaction for which a person
uses the Internet predict loneliness, level of social interaction, and social anxiety in

offline settings/face-to-face relationships, and loneliness and social anxiety in online


relationships; 2) Can the amount of time spent the Internet, or the amount of social
interaction engaged in online predict participants reported levels of depression; and
3) Does the gender of the participant make any difference on the amount of time
spent on the Internet, their social interaction online, or their reported levels of
depression?

CHAPTER 2
REVIEW OF THE LITERATURE
The Internet and Related Terms
In 1995, the term Internet was officially defined as the global information
system that is logically linked together by a globally unique address space based on
the Internet Protocol (IP), that is able to support communications using Transmission
Control Protocol/Internet Protocol (TCP/IP) and provides, uses or makes accessible,
either publicly or privately, high level services layered on the communications and
related infrastructure (Federal Networking Council, 1995, p. 1). However, when
individuals talk about the Internet, they are typically referring to more than this
technical definition.
When individuals access the Internet they typically do it via the World Wide
Web (web). The web is actually a collection of electronic documents that are stored
on computers throughout the world (World Wide Web, 2002; Howe 2007). Through
the use of a web browser these documents can be easily accessed by anyone who
knows what to look for and are frequently identified through the use of search engines
designed to access these documents based on key words (Search Engine, 2009). This
information can then be communicated to others through the use of email or instant
messaging/chat programs. Email is an electronic message that is sent and/or received
over a system that is designed specifically for the transmission of electronically
written messages between computers (Email, 2009; Howe, 2007). Due to its virtually
instantaneous delivery, email is a quick and easy form of communication that
individuals use for professional and personal reasons throughout their day.

Communication also happens on the Internet through instant messaging programs and
Internet Relay Chat (IRC). Instant messaging programs are designed to allow real
time conversation to occur between individuals who access the same service by
means of a program installed on their personal computers (Instant messaging, 2009;
Howe 2007). Similar to instant messaging, IRC allows real time conversation to occur
between groups of individuals in locations typically referred to as chat rooms
through a worldwide network of computers (IRC, 2009; Howe, 2007). In the last
decade with the advent of social networking sites, a new form of communication has
emerged on the Internet. Social networking sites, such Facebook or Twitter, are
typically websites designed to allow individuals to publish information about
themselves, with the intention of sharing that information with others in a way that
doesnt require direct conversation (Howe, 2007).
The Internet has expanded in ways that were not foreseeable at its inception.
As the tools that are used to access the Internet increase, so do the number of online
activities and the amount of time spent engaging in online activities. This is
particularly true for younger generations, as represented by the statistics presented in
a recent Pew Internet Survey that reported 83% to 87% of individuals ages 18 to 49
use the Internet compared with 65% of individuals age 50 to 64 and 32% of
individuals age 65+ (Pew Internet Tracking Survey, 2007a). The types of activities
that individuals report engaging in most often online are sending or reading email
(56%), searching for information (41%), getting news (37%), looking for information
on a hobby or other interest (29%), or browsing websites for fun (28%) (Pew Internet
Tracking Survey, 2007b). These statistics are particularly salient for younger

generations who have grown up with the Internet as part of their daily lives and
cannot imagine a time when constant contact to the world via the Internet did not
exist.
Gender Differences and the Internet
Although both genders reported equal use of the Internet in a Pew Internet
Tracking Survey (2007a), the psychological research of Internet usage presents mixed
results when looking at gender differences. An Odell, Korgen, Schumacher &
Delucchi (2000) study measured the responses of 843 students at five public
institutions and three private institutions to compare Internet usage and gender.
Participants were asked basic demographic questions, including major and year in
college, and Internet related questions including amount of access to the Internet
while growing up, how much time they currently spent on the Internet, and why they
accessed the Internet. The study reported that for public institutions, there were no
gender differences in the amount of time spent on the Internet, and that at private
institutions males spent significantly more time online than females (p = 0.019).
However, Odell and colleagues (2000) reported gender differences when examining
the specific activities or services accessed. Females spent significantly more time
checking email (p = 0.015), and conducting research for school (p = 0.002), while
their male counterparts spent significantly more time researching purchases (p =
0.002), visiting sex sites (p < 0.001), reading news (p < 0.001), playing games (p <
0.001), and listening to or downloading music (p < 0.001). A study by Sabrina Neu
(2009) looked at gender and perceptions of boredom, social interaction and social
anxiety among 200 college students ranging in age from 18 to 30 who reported

playing online multiuser games such as World of Warcraft. Participants completed


self-report measures online that measured levels of social interaction, social anxiety,
and boredom (Neu, 2009). Neu reported that in her study males spent significantly
more time playing online games than females (p = 0.05) but were not more likely to
report levels of boredom, social anxiety or decreased social interaction when
compared with females. Another study by Michele Ybarra (2004) reviewed the
information collected by the Youth Internet Safety Survey between September 1999
and February 2000. Out of 1,489 participants 72% of respondents reported
experiencing at least one incident of online harassment, defined as feeling threatened
or embarrassed by others on the Internet Ybarra, 2004). When Ybarra looked at
gender differences within the survey she found a correlation between depression and
Internet use among males, especially regarding online harassment but found no
correlation between harassment on the Internet and depression for females. Thus, an
increase in Internet use may be associated with gender differences in regard to both
symptoms of depression and the types of activities for which the Internet is used
(Neu, 2009; Ybarra, 2004).
Social Engagement and the Internet
Social engagement, as defined by this study is the quality and number of
interactions that an individual has with others on a regular basis. These interactions
can be with family members, peers, and members of their social or personal
communities and have the result of forming a cohesive group that makes the
individuals feel a sense of belonging (Canadian Council on Social Development,
2006; Fiske, 2004 p. 460; Thibault & Kelley, 1986, p. 60; Watters, 2003 p. 104).

Over the last six years, social engagement has expanded to include the Internet
through the use of social networking (Sellers, 2006, para. 5). Social networking
online is typically accomplished through sites that allow individuals to search for
others that have the same interests, establish friendships, and reconnect with friends
from their past (Luo, 2007, para. 1). The impact of the Internet on social engagement
is frequently discussed in both popular media and in the psychological literature in
negative light.
In the psychological literature one meta-analysis has posited that as
individuals become more accustomed to interacting through the Internet there will be
negative consequences on their ability to communicate appropriately in face-to-face
situations (Brignall & Van Valey, 2005). Additionally, a study that focused on
college students asked 649 men and 647 women about their Internet use and found
that the students who reported greater levels of Internet use also reported that, in
addition to a decrease in their amount of daily sleep (p = 0.05) and lower grades
academically (p = 0.05), they also perceived fewer opportunities to interact with
individuals in face-to-face situations (Anderson, 2001). Another study that focused on
adolescent use of the Internet asked 52 female high school seniors and 37 male high
school seniors to complete several self report measures concerning Internet use,
quality of relationships, and depression (Sanders, Field, Diego & Kaplan, 2000).
Sanders and colleagues (2000) found that higher levels of Internet use were
associated with declines in face-to-face relationships with both friends and mothers
when compared with adolescents that used the Internet less than one hour per day (p
= 0.01). Finally, a recent study that asked 300 participants of an online multiplayer

role playing game to complete measures of social engagement and social anxiety
found that individuals were likely to report that as a result of high levels of Internet
use they had missed meals, decreased their amount of sleep, were more likely to
argue with friends and/or family members and perceived that their face-to-face social
life had suffered as a result (Neu, 2009).
In addition to the negative effects of Internet use on social engagement, the
psychological literature on this topic has also found both neutral and positive results
concerning the impact of Internet usage on social engagement. In 1998, a
comprehensive study of the topic occurred at Carnegie Mellon University (Kraut et
al, 1998). These researchers conducted a longitudinal study that gave computers and
Internet access to 93 families (256 individuals) in the Pittsburgh, Pennsylvania area
who had not previously had such access. Participants completed measures of anxiety,
depression, and social activity before they were given Internet access and then again
after they had been given access. The study authors reported that higher amounts of
Internet usage were correlated with declines in communication, and with smaller
social networks (Kraut et al, 1998). However, in contrast to this earlier study, a
follow-up study conducted in 2002 by the same researchers with 208 of the original
participants found that there were no correlations between Internet usage,
communication, and social networks and attributed this change in findings potentially
to maturation in their participants over time or as a result of the Internet changing to
be more socially inclined (Kraut et al. 2002 p. 69). Additionally, a study completed
by Eric Weiser (2000) had 140 males and 295 females from a student population (n =
134) and an online population (n = 301) complete several measures of well-being via

the World Wide Web (Weiser, 2000). Weiser found that when the Internet is used
primarily for social activities there was a decline in psychological well-being of the
individual and when it was used primarily for non-social activities it resulted in an
increase in psychological well-being (Weiser, 2000, p.257). Conversely, other studies
investigating the effects of Internet use on communication and levels social
interaction reported that college students who chatted anonymously on the Internet
over a period of four to eight weeks were more likely to report at the end of the study
that their perceptions of social support increased, and that individuals who used chat
rooms on a regular basis scored lower on measures of social fearfulness than non-chat
users (Campbell, Cumming, & Hughes 2006; Shaw & Gant 2002). A Madell and
Muncer (2007) study that focused on the use of communication and social
interactions reported that individuals preferred to use email and instant messaging
when communicating emotion-laden concerns in particular. Thus, the relational
consequences of Internet communication may differ by the type of conversation
facilitated.
Articles in the popular media frequently focus on the negative interactions that
are caused by use of the Internet. An example of this was seen on July 15, 2007 when
several articles were written in the popular media about a parenting couple from
Reno, Nevada who had neglected their children in order to play online games (CBS
News, 2007, para. 1; Fox News, 2007, para. 1; USA Today, 2007, para. 1). The
prosecutor in that case stated that the couple was too distracted by online games
to give their children proper care (USA Today, 2007, para. 4). The outcome of the
prosecution of this case has not been determined at this date. Similarly, a recent

10

editorial begins with MySpace is ruining my social life and continues to elucidate
the opening statement by detailing how the author no longer goes out with friends,
preferring instead to stay at home and improve her MySpace page (Geldof, 2007,
para. 1). An article in Time magazine in 2008 stated that the social aspects of the
Internet, namely the ability to comment on articles that are posted, result in
individuals being cruel and loathsome and posits that this is due to illusion of
anonymity online and a general disregard of cultural restraints (Grossman, 2008, para.
2 and 3). In contrast to these media accounts is an editorial in Primary Psychiatry
which recommended that social networking sites be used to connect professionals in
healthcare fields in order to take advantage of the ways that these sites allow
individuals to interact with their peers and exchange information with ease (Luo,
2007).
The connection between Internet use and social engagement has received
mixed results in both the popular media and the psychological research. Some studies
have found that increased use of the Internet leads to a decrease in social engagement
(Anderson, 2001; Kraut et al., 1998), while others have found that increased use leads
to increased social engagement (Campbell et al., 2006; Kraut et al, 2002; Madell &
Muncer, 2007; Sanders et al., 2000; Shaw & Gant, 2002). This mixture of results may
be due to the relative lack of research literature and the instinctive response that
guides most popular media to suppose that increased use of the Internet would result
in decreased social engagement. Such common sense may not stand up to scrutiny
when compared with stringent research.

11

Social Anxiety and the Internet


Some writers have suggested that an increase in Internet use is associated with
symptomatology consistent with social anxiety (Amichai-Hamburger, Wainapel, &
Fox, 2002; Caplan, 2007). Social anxiety is characterized by fear of social situations
that could lead to intense social scrutiny if the individual behaves in a manner that is
humiliating or embarrassing (American Psychiatric Association, 2000). Prevalence
rates of social anxiety reportedly range from 3% to 13%, with most individuals
reporting social anxiety in situations that require public speaking or meeting new
people (American Psychiatric Association, 2000; Turk, Heimberg, & Hope, 2001).
The popular media frequently implies that the Internet is useful for individuals
with social anxiety because it gives individuals experiencing social anxiety a place to
practice social skills and increase confidence (Cuncic, 2009, para. 3) and its a safe
place to form new friendships without the pressure of immediately responding to
social cutes (Ayushveda, 2008, para. 6; Sorryforsilence, 2009). The A study that
investigated individuals who experience social anxiety symptoms reported that
individuals who attained higher scores of social anxiety were less likely to spend time
online than those with lower levels of social anxiety (Madell & Muncer, 2006).
Similarly, another study reported that the amount of time spent in chat rooms did not
have an impact on the levels of anxiety reported by participants, but that those who
participated tended overall to be less socially anxious than those who did not spend
time in chat rooms on the Internet (Campbell et al., 2006). A similar finding of no
significant effect for social anxiety and Internet use was also seen in a study that

12

looked at online game playing and the self-reported levels of social anxiety (Neu,
2009). Taken together, these studies suggest that socially anxious individuals do not
use the Internet for interpersonal communication as is assumed in the popular media.
Conversely, a study investigating participants ability to express their real self in a
social environment reported that high scores on measures of introversion and
neuroticism were associated with a greater comfort being their real self on the
Internet, compared to ratings that were high on extroversion and low on neuroticism
being associated with being more comfortable in face-to-face social situations
(Amichai-Hamburger et al. 2002). This finding concerning introversion was also
reported in a study conducted by Scott Caplan (2007) who reported that high social
anxiety was predictive of individual preference for online social interaction to faceto-face social interaction.
Intuitively it makes sense that individuals who experience anxiety in social
situations would be more comfortable on the Internet where the perception of
anonymity allows individuals to present only what they want others to see. However,
given the psychological research, it remains to be seen if this intuitive reaction
concerning social anxiety is something that can be adequately measured.
Depression and the Internet
Depression is one of the most common mental health disorders and is
diagnosed when individuals experience a depressed mood most of the day, show a
diminished interest in pleasurable activities, report changes in appetite, and in levels
of concentration, and have feelings of worthlessness or guilt (American Psychiatric
Association, 2000; Young, Weinberger, & Beck, 2001). The prevalence of Major

13

Depressive Disorder (which requires the presence of at least one episode of


depression) reportedly ranges from 10% to 25% in females and from 5% to 12% in
males (American Psychiatric Association, 2000).
A study conducted at Carnegie Mellon in 1998 originally reported that
increased Internet use was correlated with an increase in reports of loneliness and
depression; however, the follow-up study conducted 4 years later found that there was
no correlation between Internet use and depression and is consistent with a study that
found no link between adolescent use of the Internet and levels of depression (Kraut
et al., 1998, 2002; Sanders et al., 2000). Furthermore, a study on the relationship
between Internet communication and depression reported that over the course of four
to eight weeks, college students chatting anonymously on the Internet were more
likely to report fewer feelings of loneliness and depression than they had before the
study began (Shaw & Gant, 2002). Based on the scant psychological literature, it
appears that the amount of time spent online does not impact levels of depression but
that there are other aspects of Internet use that may play a role. A Morgan and Cotton
(2003) study found that the type of activity engaged in on the Internet was implicated
in levels of depression among college students, and that when the Internet was
utilized for communication, levels of depressive symptoms decreased, particularly for
male respondents. However, when the Internet was utilized for non-communication
oriented activities such as shopping or research, levels of depressive symptoms
increased (Morgan & Cotton, 2003). Another study reported that, rather than the type
of activity, or the amount of time spent on the Internet, depressive symptoms were
eight times more likely to be reported by males who also reported experiencing

14

harassment on the Internet (Ybarra, 2004). Finally, a Campbell, Cumming & Hughes
(2006) study indicated that depressive symptoms were associated simply with
frequent Internet use, regardless of the amount of time or activity, suggesting that
those who reported spending time on the Internet were more likely to report
depressive symptoms than those who report not spending time online.
The psychological literature is scant on the topic of depression and Internet
use and, as with social engagement and social anxiety, the literature that does exist is
contradictory in nature. The overall consensus is that an increase in Internet use is not
implicated in an increase in levels of depression. In fact, the result of an increase in
reported depressive symptoms from Internet use is currently undetermined, with some
studies implicating gender, others implicating the type of activity engaged in online
and still others stating that its simply that chronically depressed individuals are more
prone to using the Internet than non-depressed individuals (Campbell et al., 2006;
Morgan & Cotton, 2003; Ybarra, 2004)
Hypotheses
This study investigated three primary questions to address this topic: 1) Can
time spent and amount of social interaction online predict loneliness and social
anxiety in face-to-face settings, loneliness and social anxiety in online settings, and
social interaction in face-to-face settings; 2) Can Internet use, or social interaction
online, predict participants levels of depression; and 3) Does gender influence the
amount of time spent online, the type of activities accessed online, or participants
level of depression.

15

Hypothesis one. To address question one, Hypothesis 1a posits that Internet


use will predict a significant amount of the variance in participants loneliness and
social anxiety in face-to-face settings, loneliness and social anxiety in online settings,
and social interaction in face-to-face settings. Internet use will be positively related to
loneliness and social anxiety in face-to-face settings; higher levels of Internet use will
be associated with increased social anxiety in face-to-face settings, and increased
loneliness in face-to-face settings. Internet use will be negatively related to loneliness
in an online setting, lower levels of social anxiety in an online setting, and lower
levels of social interaction in face-to-face settings. Hypothesis 1b posits that the
amount of social interaction online will significantly increase the amount of variance
in participants loneliness and social anxiety in face-to-face settings, loneliness and
social anxiety in online settings, and social interaction in face-to-face settings.
Socially oriented Internet activities will be negatively related to social engagement in
face-to-face settings, loneliness in online settings, and social anxiety in online
settings. Socially oriented Internet activities will be positively related to social
anxiety in face-to-face settings, and loneliness in face-to-face settings.
Hypothesis two. To address question two, Hypothesis 2a posits that Internet
use will predict a significant amount of the variance in participants levels of
depression. Internet use will be positively related to depression; higher levels of
Internet use will predict higher levels of depression. Hypothesis 2b posits that the
amount of social interaction online will predict a significant amount of the variance
explained in participants levels of depression. The amount of social interaction

16

online will be negatively related to depression with higher levels of social interaction
online predictive of lower levels of depression.
Hypothesis three. To address question three, Hypothesis 3a posits that there
will be gender differences in the amount of time individuals spend on the Internet.
Men will spend more time than women on the Internet. Hypothesis 3b posits that
there will be gender differences in the amount of social interaction online. Women
will spend more time engaging in social activities online than men. Hypothesis 3c
posits that there will be gender differences in the level of depression reported by
participants. Men will demonstrate higher levels of depression than women.

17

CHAPTER 3
METHOD
Participants
Sixty-eight female and 31 male undergraduate students attending a state
university located in rural Pennsylvania served as participants in the current study.
These participants had enrolled in the Psychology Departments subject pool to fulfill
their general psychology course research requirement. All participants were
randomly selected by the subject pool coordinator and were subsequently emailed an
initial request to participate and sent a second email invitation to participate if they
did not respond to the first request. Students who did not respond to either email
request were invited to participate via a subsequent telephone contact. All
participants were informed of the nature of the study and the time commitment
expected when invited to participate. The names of students who declined
participation were returned to the subject pool.
Participants were required to sign an informed consent form (Appendix A) by
which they were again informed of the time commitment and given the opportunity to
opt out of the study. Of the initial 150 students contacted for participation, 138
initially chose to participate in this study and 99 students completed all three phases
of the study.
Materials
Six measures were used in this study: an experimenter-developed
demographic questionnaire (Appendix B), an experimenter-developed self-report
measure of Internet usage (Appendix C), two measures of social engagement: one

18

that measured day-to-day social interactions (Appendix D) and one that measured
perceptions of loneliness (Appendix E), one questionnaire concerning social anxiety
symptoms (Appendix F), and one questionnaire measuring symptoms of depression
(Appendix G).
Demographic questionnaire. The demographic questionnaire (Appendix B)
consisted of 10 questions that included participants current academic standing,
gender, family income and parental levels of education (Braveman, Cubbin, Marchi,
Egerter, & Chavez 2001). This questionnaire also assessed participants current
ability to access the Internet and the typical locations of their access. Additionally, the
demographic questionnaire asked participants to list the three most important
activities in which they engage on the Internet.
Measure of Internet usage. The Internet Usage Tracking Chart (Appendix C,
parts 1 and 2) consisted of a grid designed to allow participants to quickly check off
the hours they engaged in Internet usage in a 24-hour period. Individuals were
instructed to round off times of use to the nearest hour and enter their responses into
an online computer database they were instructed to access each evening from a
personal computer. After tracking their Internet use for one week, participants were
given a series of questions that required them to estimate the amount of time they
spent studying and using the Internet, and to rank-order 13 potential activities (e.g.,
email, social networking, gambling, etc.) in which they engaged while online. This
rank order list was then used to determine if the type of Internet activities accessed by
each participant were of a social or solitary nature by assigning each item a social or

19

non-social value and weighting the value based on the rank assigned by the
participant.
Measures of social engagement. The Social Rhythm Metric (SRM)
(Appendix D) consists of 17 events that occur in an individuals life over the course
of a day, and was designed to assess social support and social networks of an
individual. Participants keep track of when each activity occurred, who was present
during the activity, and their own level of involvement. Individuals were asked to
manually track these 17 activities and enter them into an online computer database
each evening from a personal computer. These items include when participants get
out of bed each morning, when they have meals and when they participate in
activities such as school, exercise, or watching television. For each item the
participant is asked to enter the time the item was completed, whether or not they
were alone at the time, and, if others were present, whether they were just present
or actively involved. The SRM is calculated using an algorithm found in Monk,
Kupfer, Frank, & Ritenour (1990) and several indices can be calculated including
active social engagement, and minimal to no social engagement (Carney, Edinger,
Meyer, Lindman & Istre, 2006). The test-retest reliability for the SRM is moderate
with a significant correlation between week 1 and week 2 (rho=0.60, p < 0.001)
(Monk, Petrie, Hayes & Kupfer, 1994). Additionally the SRM has been described as a
valid instrument by several studies and in a personal communication by the creator of
the measure (Haynes, Ancoli-Israel, & McQuaid, 2005; Meyer & Maier, 2005; T.H.
Monk, personal communication, July 23, 2009; Monk, et al., 1994; Monk, Frank,
Potts, & Kupfer, 2002; Monk, Kupfer, Frank, & Ritenour, 1990).

20

The UCLA Loneliness Scale, Version 3 (Appendix E) is also a measure of


social engagement. It consists of 20 questions that are answered on a 4-point Likert
scale ranging from 0 (never) to 4 (often). Questions address how the individual feels
in regard to companionships. Scores range from 20 to 80 with higher scores
indicating greater degrees of loneliness. Cronbachs for the UCLA Loneliness
Scale, Version 3 ranges from 0.89 to 0.94 and has a test-retest validity of 0.73 over a
1-year period (Russell, 1996).
Measure of depression. The Center for Epidemiological Studies Depression
Scale (CES-D) (Appendix F) measures levels of depression in a general population
(Radloff, 1977). It consists of 20 questions rated on a 4-point Likert scale that ranges
from 0 (rarely or none of the time) to 4 (most or all of the time). Possible scores range
from 0 to 60 with higher scores indicating greater levels of depression symptoms.
Internal consistency for the general population is in the good range with Cronbachs
of 0.85 (Hann, Winter, & Jacobsen, 1999).
Measure of social anxiety. The Brief Fear of Negative Evaluation, Revised
(BFNE-II) (Appendix G) is a measure of social anxiety that consists of 12 questions
answered on a 5-point Likert scale. Responses range from 0 (not at all characteristic
of me) to 4 (extremely characteristic of me). Scores range from 0 to 48 with higher
scores reflecting greater levels of social anxiety (Carleton, McCreary, Norton, &
Asmundson, 2006). Internal consistency for this measure is in the excellent range
with item coefficients between 0.94 and 0.95 and an overall Cronbachs of 0.95
(Carleton et al, 2006)

21

Procedures
Selecting participants. All participants were randomly selected by the subject
pool coordinator and contacted via email or telephone to request their participation in
this study. All participants were informed of the nature of the study and the time
commitment involved at the time of first contact and given the opportunity to decline
participation. Students electing to participate were met by an assistant experimenter
who explained the time requirements of the study and again gave participants the
chance to decline participation. Those who elected to participate were required to sign
an informed consent form (Appendix A). Participants were informed that the
researcher was looking for possible connections between Internet usage,
psychological well-being, and relationships. No deception was used during this study.
Additionally, participants were given a resource sheet for campus and community
referrals (Appendix I) as a precaution should they experience feelings of concern
when completing the study measures.
Phase one. After signing the informed consent form, participants were
directed to a university computer with Internet access where they completed the
demographic questionnaire and the measures of depression (CES-D), social anxiety
(BFNE-II), and one of the social engagement measures (UCLA Loneliness scale).
Participants were asked to complete the BFNE-II and UCLA Loneliness Scale twice.
The first time they completed these two measures they were asked to focus on faceto-face relationships, the second time the focus was on online relationships.
Participants were asked to consider face-to-face and online relationships separately in
order to determine if there was a difference in their perception of experienced anxiety

22

or loneliness based on the population with which the participant was interacting.
Participants completed this first phase of the study in approximately 30 minutes.
Phase two. After completing these psychological measures, participants were
given verbal directions for tracking their Internet use and daily social interactions.
Additionally, they were instructed in how they were to enter their Internet use and
social interactions online using their personal computers. Participants were also given
paper copies of the measures to aid in their ability to keep track of their interactions
while not at a computer. Finally, an email reminder was sent from the Applied
Research Lab, a campus department devoted to assisting in research, to participants
each day for seven days to prompt participants to respond. This email reminder was
based on the email contact address provided by the subject and was not tied to
specific results in order to protect confidentiality of responses. It is estimated that this
aspect of the study took approximately 15 minutes each evening for the course of
seven days.
Phase three. At the end of seven days, participants were sent an email with a
link to access the final part of the study, a questionnaire (Appendix C, part 2) that
asked participants to estimate the amount of time they spent studying and using the
Internet, and to rank-order 13 potential activities in which they engaged while online.
After answering these questions, participants were thanked and debriefed (Appendix
H) online and provided with the experimenters contact information should they wish
to receive the results of the study. Additionally participants were again provided with
a copy of local community and campus resources (Appendix I) to access if they felt

23

concerned about any of the information that they were prompted to think about over
the course of this study.

24

CHAPTER 4
RESULTS
Descriptive Statistics
Out of the 150 individuals that were originally approached to participate in
this study, 12 declined to participate after being informed of the time commitment for
this study. Of the remaining 138 individuals, 99 successfully completed all three
phases of the study and were included for analysis. Of the 99 participant scores
included in the analyses, 31 (31.3%) were male and 68 (68.7%) were female. A chisquare test of goodness-of-fit was performed to determine if the differences in group
size for sex of participant significantly different. Sex was not equally distributed
across the population, X2 (1, n=99) = 13.828, p < 0.001. This means that possible
gender effects may not have been detected due to the difference in group sizes.
The majority of the sample was comprised of college freshmen, with 87
(87.9%) of the participants in their first year of college at the time of this study, nine
(9.1%) were sophomores, two (2%) were juniors, and one student (1%) reported
being a continuing education student. Participants reported that they were in 43
different majors, with 16 (16.2%) listing their major as undecided. The majority of
participants with chosen majors were in the college of Health and Human Services
(24.2%), with 18.2% in the college of Natural Sciences and Mathematics, 15.2% in
the college of Business and Information Technology, 15.2% in the college of
Education and Education Technology, 10.1% in the college of Humanities and Social
Sciences, and 1% in the college of Fine Arts. All participants were enrolled in an
undergraduate general psychology course at the time of this study.

25

Internal Consistency of the Social Rhythm Metric


All of the previously published measures used in this study, with the
exception of the Social Rhythm Metric (SRM), displayed internal consistency in the
form of Cronbachs reported in previous research. Thus, the first analysis conducted
for this study was determining Cronbachs for the SRM for this population. Onehundred and fourteen of the original 138 participants completed the SRM and so
analysis for this statistic was completed on this larger population rather than on the 99
who had completed all three phases of the study. The obtained internal consistency of
the SRM for this population, was found to be in the good range ( = 0.879, n = 114).
Internet Use as a Predictor of Social Anxiety, Social Engagement, and Loneliness
This study hypothesized that the amount of time participants spent on the
Internet would predict their reported loneliness, social anxiety and social engagement
scores in both offline settings (e.g., face-to-face relationships) and in online settings
(e.g., online relationships). Two separate linear regressions were performed to test the
hypothesis, one testing this relationship between participants loneliness, social
anxiety and social engagement in offline settings and one testing the hypothesized
relationship in online settings. The first model produced an R2 of 0.040, F(3,98) =
1.303, p = 0.278 and did not support the hypothesis for offline settings since no
relationship between time spent on the Internet and participants scores on measures
of social anxiety, social engagement or loneliness was produced. Table 1 reports the
results of the first of these analyses.

26

Table 1
Time Spent on the Internet and its Influence on Social Engagement, Social Anxiety,
and Loneliness with Face-to-Face Relationships
Measure
R2
B
SE

p
0.040
SRM

0.142

0.255

0.57

0.578

BFNE-II: Offline Relationships

0.020

0.020

0.112 0.322

UCLA-3: Offline Relationships

0.024

0.023

0.116 0.302

The second linear regression similarily revealed no support for the hypothesis that
participants loneliness, social engagement and social anxiety in online settings was a
function of the amount of time they spent on the Internet. This model produced an R2
of 0.280, F(2,98) = 1.393, p = 0.253 and is displayed in Table 2.

Table 2
Time Spent on the Internet and its Influence on Social Anxiety, and Loneliness with
Online Relationships
Measure
R2
B
SE

p
0.028
BFNE-II: Online Relationships

0.021

0.021

0.107

0.301

UCLA-3: Online Relationships

0.021

0.020

0.109

0.293

These analyses indicated that amount of time spent on the Internet was not predictive
of subjects reported social anxiety, social engagement and loneliness in either offline
or online settings.

27

Social Activity on the Internet as a Predictor of Social Anxiety, Social


Engagement, and Loneliness
A second hypothesis forecast that social activity on the Internet would predict
subjects reported loneliness, social anxiety and social engagement in both offline
settings and online settings. Two linear regressions were performed to test this
hypothesis. The model testing the relationship in offline settings produced an R2 of
0.043, F(3,98) = 1.408, p = 0.245, revealing no support for this prediction. The model
is shown in Table 3.
Table 3
Social Activity on the Internet and its Influence on Social Engagement, Social
Anxiety, and Loneliness with Face-to-Face Relationships
Measure
R2
B
SE

p
0.043
SRM

-0.181

0.092 -0.199

0.053

BFNE-II: Offline Relationships

-0.003

0.007 -0.042

0.705

UCLA-3: Offline Relationships

-0.003

0.008 -0.034

0.763

A second regression testing this hypothesis for online settings was performed and
similarily did not support this prediction. This model produced an R2 of 0.190,
F(2,98) = 0.916, p = 0.404 and is displayed in Table 4.

28

Table 4
Social Activity on the Internet and its Influence on Social Anxiety, and Loneliness
with Online Relationships
Measure
R2
B
SE

p
0.019
BFNE-II: Online Relationships

-0.002

0.007 -0.021

0.838

UCLA-3: Online Relationships

-0.009

0.007 -0.131

0.209

These analyses indicated that type of activity engaged in while on the Internet was not
predictive of subjects reported social anxiety, social engagement and loneliness in
either offline or online settings.
Internet Use and Social Activity on the Internet as a Predictor of
Depression
A series of linear regressions were used to investigate the relationship
between depression, Internet use, and social activity on the Internet. The first
regression was performed to predict participants depression as a function of the
amount of time they spent on the Internet. This model produced an R2 of 0.007,
F(1,98) = 0.682, p = 0.411 and. A second linear regression was performed to predict
participants depression as a function of the type of activities in which they engaged
while using the Internet. This model produced an R2 of 0.002, F(1,98) = 0.198, p =
0.657. Neither analysis supported the hypotheses that time or activity were linked to
participants scores on a measure of their reported depression. Both models can be
found on the following page in Table 5.

29

Table 5
Depression, Internet Use, and Types of Activities Engaged in Online
Predictor
R2
B
SE

Internet Use (time spent online)


0.007 0.338
0.409 0.084
Internet Activity (social v. non-social)

0.002

0.504

1.131 0.045

p
0.411
0.657

Gender Effects on Internet Use, Social Activity and Depression


A series of one-way between groups analyses of variance were performed to
detect if the sex of the participant influenced the amount of time individuals spent on
the Internet, the amount of social activity individuals engaged in on the Internet, and
the level of depression individuals reported during Phase One of the study. As shown
in Table 6, these analyses of variance showed no effect of sex on the amount of time
spent on the Internet, F(1,98) = 0.080, p = 0.778 or the amount of social activity
engaged in on the Internet, F(1,98) = 0.510, p = 0.477 and sex of participant did not
significantly impact the level of depression reported F(1,98) = 3.561, p = 0.062.
Table 6
Influence of Gender on Internet Use, Social Activity Online, and Depression
Variable
Group
SS
df
MS
F
Daily Internet Use

Social Interaction
Online

Depression
(CES-D)

Between Groups

0.424

0.424

Within Groups

516.369

97

5.323

Total

516.793

98

Between Groups

0.355

0.355

Within Groups

67.484

97

0.696

Total

67.838

98

Between Groups

298.675

298.675

Within Groups

8135.164

97

83.868

Total

8433.838

98

30

0.080

0.778

0.510

0.477

3.561

0.062

Perception of Loneliness and Social Anxiety with Face-to-Face and Online


Relationships
After preliminary analysis, additional analysis was done to detect differences
between participants perception of loneliness and social anxiety within face-tofacerelationships and their perception of loneliness and social anxiety within online
relationships. Table 7 includes the means for participants responses on the social
anxiety and loneliness measures for both offline and online relationships.
Table 7
Means of Responses for Loneliness and Social Anxiety Measures
Measure
Relationship Type
Mean
Social Anxiety (BFNE-II)

Loneliness (UCLA)

SD

SE

Offline

21.26

12.811

1.288

Online

13.19

11.485

1.154

Offline

37.37

11.109

1.116

Online

39.05

11.807

1.187

A paired-samples t-test was conducted to compare loneliness in Offlinerelationships


and loneliness in online relationships. No significant differences in the scores for
face-to-facerelationships and online relationships, t(98) = -1.827, p = 0.071, were
found. This indicates that participants did not perceive a difference in their feelings of
loneliness based on the type of relationships (e.g., offline v. online) with whom they
were interacting. A paired-samples t-test was also conducted to compare social
anxiety in face-to-face relationships and social anxiety in online relationships. A
significant difference between scores for offline relationships and online
relationships, t(98) = 7.319, p < 0.001, was found.This indicates that participants
31

perceived significantly more social anxiety when interacting with each others in faceto-face relationships than when socializing in online formats. Results from both the
social anxiety and loneliness paired sample t-tests can be found in table 8.
Table 8
Paired Samples T-Tests for Social Anxiety and Loneliness Measures
Measure
t
df

Social Anxiety (BFNE-II)

7.319

98

0.000

Loneliness (UCLA)

-1.827

98

0.071

Comparison of Estimated Internet Usage with Actual Usage


Participants were asked in Phase Three to estimate the amount of time they
spent online. This estimation was then compared with the amount of time they had
entered each day to determine if there were any significant differences between their
estimated use and their actual use. Means of the average amount of time spent daily
on the Internet can be found in Table 9. No significant differences were found
between participants estimated use of the Internet and their actual use as tracked on
the daily self-report measure, t(98) = 1.424, p = 0.157. This indicates that the selfreport measure of time spent online was not significantly different from the amount of
time that participants perceive they are using the Internet. Additionally, this indicates
the time participants spent on the Internet was not significantly impacted by having
participants track their time in one hour segments.

32

Table 9
Means of Participant Time Spent Online
Mean

SD

SE

Estimated Daily Hours

4.491

99

4.733

0.476

Tracked Daily Hours

3.911

99

2.296

0.230

Summary of Results
Overall the analyses uses to test the three primary hypotheses did not lend
support to the predictions as expected. No significant results were found for
predicting scores on the measures of social engagement, social anxiety, or depression
based on time spent on the Internet or amount of social activity engaged in while
online. There was, however, the significant finding that participants reported greater
levels of social anxiety when referencing to their face-to-face relationships as
opposed to their online relationships even though the time and activity online did not
impact their overall level of social anxiety.

33

CHAPTER 5
DISCUSSION
The intention of this study was to clarify discrepant portrayals of Internet use
for social interaction by exploring the impact of Internet use on social engagement in
offline and online settings in a college-aged population with particular attention to
symptoms of social anxiety and depression. Sixty-eight female and 31 male
undergraduate college students spanning 43 different majors served as participants in
this study.
Gender Differences and the Internet
Previous research reported differences in the way the gender of participants
influenced ones interactions with the Internet (Neu, 2009, Ybarra, 2004). Based on
this literature, it was hypothesized that the sex of the participant would result in a
difference in either the amount of time spent on the Internet or in the types of
activities (e.g., social or non-social) in which they engaged while online. It was also
hypothesized that males that spent more time on the Internet would report higher
levels of depression than would females. Contrary to this hypothesis, in depth
analysis found no differences detected in the amount of time spent on the Internet, the
type of activities accessed while online, or reported levels of depression between
male and female participants. However, its important to note that although the
researcher attempted to have an equal number of male and female participants, an
overwhelming majority of the participants were female. The fact that a focus was
placed on having an equal number of male and female participants and the sample
still was disproportionately female may allude to some effects for gender that are not

34

visible and thus not measureable. It is hypothesized that men declined to participate in
this study because of the open nature of what was being measured and they did not
want to report the types of activities they engage in online. Due to the low number of
male participants it is possible that differences exist that were not able to be detected
by these analyses as a result of the subsequent low statistical power.
Social Engagement and the Internet
This study defined social engagement as the quality and quantity of
interactions that an individual had with others on a daily basis. Previous literature
reported mixed results that indicated that the amount of Internet usage was linked
with both increases and decreases in face-to-face social interaction (Anderson, 2001;
Campbell et al., 2006; Kraut et al., 1998; Kraut et al., 2002; Madell & Muncer, 2007;
Sanders et al., 2000; and Shaw & Gant, 2002). Whereas popular media articles
frequently focus on a perceived negative effect of Internet usage in face-to-face social
interactions (Geldof, 2007; Grossman, 2008 and USA Today, 2007).
Based on this review of both the psychological literature and the popular
media, it was hypothesized that either the amount of time spent on the Internet or the
amount of social interaction engaged in while online could be used to predict
loneliness and social interaction in offline settings (i.e., face-to-face relationships) and
in online settings (i.e., online relationships). Results suggest that neither the amount
of time spent on the Internet nor the amount of social activity engaged in while online
were predictive of participants scores on measures of social engagement and
loneliness in offlineand online settings. This lack of a statistically significant result
should not be dismissed because it helps to build on the previous literature that use of

35

the Internet is not going to result in individuals who are less socially engaged with
their day-to-day lives.
After investigating the primary hypothesis concerning social engagement and
the Internet, an additional analysis was completed to look at potential differences in
participants perceptions of loneliness for face-to-face social engagement settings and
online social engagement settings (e.g., online relationships). The primary reason for
conducting this analysis was to investigate if the frequently negative conception of
the Internets effect on social engagement in the popular media is related to the
intuitive perception of individuals. Previous research has shown that individuals will
overlook information that does not fit with their intuitive sense of how things should
occur particularly if they are already confident that the information should fit in a
particular intuitive way (Simmons & Nelson, 2006). With this in mind, participants
scores on the loneliness measure for face-to-facerelationships and online relationships
were compared to look for apparent differences in their perception of loneliness.
Contrary to popular media accounts of social engagement and the Internet, there were
no apparent differences in the respondents perception of the loneliness aspect of
social engagement for these seemingly disparate relationships. Thus is can be
hypothesized that when the popular media refers to the negative impact of the Internet
on social engagement they are not referring to the loneliness aspects of social
engagement.
Social Anxiety and the Internet
This study used the traditional definition of social anxiety as defined by the
Diagnostic and Statistical Manual, fourth edition, text revision (American Psychiatric

36

Association, 2000). The psychological literature on social anxiety and the Internet
(Campbell et al., 2006; Madell & Muncer, 2006; Neu, 2009) contradicted the popular
perception that socially anxious individuals were more likely to use the Internet for
interpersonal interactions (Ayushveda, 2008; Cuncic, 2009; Sorryforsilence, 2009).
Based on the review of both the psychological literature and the popular
media accounts of social anxiety, it was hypothesized that either the amount of time
spent online or the amount of social interaction engaged in while online could be used
to predict levels of social anxiety. This study confirmed previous findings in the
psychological literature that neither the amount of time spent on the Internet nor the
amount of social activity engaged in while online was predictive of the level of social
anxiety reported by participants when interacting with both offline and online
relationships.
After the analysis of the primary hypothesis was completed and found to not
be significantly significant, additional analysis was completed in order to investigate
the potential differences in participants perceptions of social anxiety while engaging
with the Internet. As stated previously, previous research had shown that individuals
were more likely to overlook information that is counterintuitive based on their own
level of confidence in the erroneous information (Simmons & Nelson, 2006) and it
was theorized that this may account for some of the discrepancy between the
psychological literature and the popular media. Unlike with the loneliness analysis,
the additional analysis on the participants perceptions when they were asked to
respond to questions measuring social anxiety showed that they were more likely to
perceive differences in their social anxiety level when asked to focus on offline

37

relationships versus online relationships. This discrepancy in participants perceptions


of social anxiety may contribute to the popular media accounts that assume a link
between higher levels of social anxiety when socializing in face-to-face situations and
higher levels of comfort when socializing while using the Internet to communicate.
Depression and the Internet
For the purpose of this study, the American Psychiatric Association definiton
of depression was used, defining it as a depressed mood most of the day with
anhedonia, changes in appetite, and feelings of worthlessness and guilt (American
Psychiatric Association, 2000). Previous studies investigating the association between
the Internet and levels of depression found that the link between depression and the
Internet was convoluted with gender (Ybarra, 2004), type of activity engaged in
online (Morgan & Cotton, 2003), and the amount of time spent on the Internet
(Campbell et al., 2006), with all aspects being implicated in levels of depression
among users of the Internet. Other studies found no correlation between the Internet
and depression (Kraut et al., 2002; Sanders et al., 2000).
Using this review of the literature, it was hypothesized that the amount of time
spent on the Internet, the amount of social activity engaged in while online, or the
gender of the participant may be predictive of levels of depression. However, similar
to the Kraut and Sanders studies, this study also found that the amount of time spent
on the Internet, the type of activity engaged in while online (e.g., social versus nonsocial), and/or the gender of the participants were not predictive of reported levels of
depression.

38

CHAPTER 6
CONCLUSION, LIMITATIONS, AND RECOMMENDATIONS
This study was intended to clarify the psychological literature concerning use
of the Internet and its effect on social engagement in a college-aged population with
particular attention to levels of social anxiety and depression. Thorough investigation
of three primary questions revealed no correlation between the amount of time spent
on the Internet and levels of social engagement, social anxiety and depression in
either offline (e.g., face-to-face) relationships or online relationships. Additionally, no
correlation between the type of activity engaged in while on the Internet and levels of
social engagement, social anxiety, or depression in offline or online relationships was
found in the current study. One finding of note was that the perception of social
anxiety decreased for participants when asked to answer for online relationships, even
though their actual levels of social anxiety were still not significantly influenced by
Internet use.
The seeming implication of this study is that the Internet, like so many other
aspects of daily life, is merely a tool that individuals access and use in ways that they
can choose. The amount of social engagement in which a person engages, both onand offline is not sigificantly influenced by this tool, nor is their reported levels of
depression symptoms or social anxiety.
There are several limitations to the findings of this study that must be
considered. First, the population at the rural university where this study was
conducted is 87% White, non-hispanic (IUP, 2010), and thus the sample can be
assumed to have been disproporiately White. This assumption is made because

39

ethnicity was inadvertently absent in the demographic questionnaire this study used.
This prevented exploration of differences based on ethnicity and represents a
limitation for generalizing the results of this study to non-White populations.
Collection of this variable would benefit furture investigations of this topic.
A second and commom limitation is the analog nature of the current study.
Although pariticpants were asked to enter their Internet use and social engagement
into a computer database each evening, they were still required to manually keep
track and enter their self-report. Due to the nature of self-report it is possible that the
data entered is not as accurate as it would be if their usage had been tracked digitally.
Future studies would benefit from gaining permission from participants to install a
computer tracking program to automatically gather the information needed.
The third major limitation of this study is the age group that was tracked.
Although the reason behind focusing this particular study on college students was due
to the fact that this population is assumed to have more access to the Internet as a part
of their daily lives for the majority of their lives, it is possible that different results
regarding the predictive nature of Internet usage would have been found in older
populations. Future research would benefit from exploring the hypothesized links of
this study across both ethnicity and the lifespan.
The last major limitation is that the population of this study was
disproportionately female despite investigator efforts to obtain equal representation of
sex across participants and thus it is possible that the lack of significant gender effects
was due to this discrepancy. Future studies would benefit from using a population
with equally distributed sex in order to explore or rule out any potential effects due to

40

the sex or gender of the standard Internet user. Perhaps future studies would benefit
by masking the study to help prevent reactivity effects, and thus encourage more
individuals of all genders to participate.
This study confirmed what previous psychological studies have alluded to,
and what the popular media has appeared to deny: the Internet is a valuable tool that
individuals use on a daily basis in order to access information concerning the world
around them. It appears that this tool does not significantly increase a persons
reported levels of depression, social engagement, or social anxiety. However, one
finding that has not been seen in other studies is that although the Internet does not
change a persons actual level of social anxiety, it may decrease their perception of
social anxiety when interacting online. Future studies would benefit from continuing
to explore this difference between the individuals perceived and actual levels of
social anxiety to determine what, if any, aspect of individuals online relationships
help. Specifically, more research needs to be done looking at both online and offline
relationships in all aspects of mental health and across all ethnicities and ages.
Additionally, with the rate that the Internet, and all aspects of social media are
expanding, it would be beneficial to have participants rate their activities online in
terms of how social they consider each activity. For example, one person playing
chess online may find the interaction to be highly social while another individual may
find it to be a solitary activity. This perception of social interaction online would be a
rich area to explore in future research.
Thus, contrary to the original hypotheses of this study, the Internet is simply a
tool that can be used to broaden a persons experience of the world in any way that

41

they see fit. The Internet puts the world at a persons fingertips, and in the United
States, is theoretically the type of tool that any individual can access regardless of
their socio-economic-status, ethnicity, class standing or geographic location.

42

REFERENCES
American Psychiatric Association. (2000). Diagnostic and Statistical Manual of
Mental Disorders (text revision). Washington, DC: Author.
Amichai-Hamburger, Y., Wainapel, G., & Fox, S. (2002). On the Internet No One
Knows Im an Introvert: Extroversion, Neuroticism, and Internet Interaction.
CyberPsychology & Behavior, 5(2), 125-128.
Anderson, K.J. (2001). Internet Use Among College Students: An Exploratory Study.
Journal of American College Health, 50(1), 21-26.
Ayushveda (2008, July). How to Reduce Anxiety. Online magazine by Ayushveda.
Retrieved from http://www.ayushveda.com/
Braveman, P., Cubbin, C., Marchi, K., Egerter, S., & Chavez, G. (2001). Measuring
Socioeconomic Status/Position in Studies of Racial/Ethnic Disparities:
Maternal and Infant Health. Public Health Reports, 116, 449-463.
Brignall, T.W., III & Van Valey, T. (2005). The Impact of Internet Communications
on Social Interactions. Sociological Spectrum, 25, 335-348.
Canadian Council on Social Development (2006). Social Engagement: The Progress
of Canadas Children and Youth, 2006. Retrieved from
http://www.ccsd.ca/pccy/2006/pdf/pccy_socialengagement.pdf.
Campbell, A.J., Cummings, S.R., & Hughes, I. (2006). Internet Use by the Socially
Fearful: Addiction or Therapy? CyberPsychology & Behavior, 9(1), 69-81.
Caplan, S.E. (2007). Relations Among Loneliness, Social Anxiety, and Problematic
Internet Use. CyberPsychology & Behavior, 10(2), 234-242.

43

Carleton, R.N., McCreary, D.R., Norton, P.J., & Asmundson, G.J.G. (2006). Brief
Fear of Negative Evaluation Scale Revised. Depression and Anxiety, 23,
297-303.
Carney, C.E., Edinger, J.D., Meyer, B., Lindman, L., & Istre, T. (2006). Daily
Activities and Sleep Quality in College Students. Chronobiology
International, 23(3), 623-637.
CBS News (2007, July 15). Parents Played Video Games As Kids Starved. Retrieved
from http://www.cbsnews.com/stories/2007/07/15/national/
main3058816.shtml. Article on file with author.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ:
Lawrence Erlbaum Associates.
Cuncic, A. (2009, June 30). Is Facebook Good for Social Anxiety [Web log post].
Retrieved from
http://socialanxietydisorder.about.com/b/2009/06/30/facebook-good-forsocial-anxiety.htm
Davis, D.C. (2007). MySpace Isnt Your Space. ExpressO Preprint Series. Working
Paper 1943. Retrieved from http://law.bepress.co m/expresso/eps/1943
Email (2009). The American Heritage Dictionary of the English Language, Fourth
Edition. Retrieved from http://dictionary.reference.com/browse/email
Faul, F., Erdfelder, E., Lang, A.-G. & Buchner, A. (2007). G*Power 3: A flexible
statistical power analysis program for the social, behavioral, and biomedical
sciences. Behavior Research Methods, 39, 175-191.

44

Federal Networking Council. (1995). FNC Resolution: Definition of Internet.


Retrieved from http://nitrd.gov/fnc/Internet_res.html
Fiske, S.T. (2004). Social Beings: A Core Motives Approach to Social Psychology.
Hoboken, N.J.: John Wiley & Sons, Inc.
Fox News (2007, July 15). Nevada Couple Blame Internet for Neglect. Retrieved
from http://www.foxnews.com/2007Jul15/0,4675,Neglect
InternetAddiction,00.html. Article on file with author.
Geldof, P. (2007, March 30). It may start as innocent flirtation, but be warned, you
too could become a lonely MySpace Addict. Retrieved from
http://www.guardian.co.uk/
Grossman, L (2008, July 10). Post Apocalypse. Time. Retrieved from
http://www.time.com/time/
Hann, D., Winter, K., & Jacobsen, P. (1999). Measurement of depressive symptoms
in cancer patients: Evaluation of the Center for Epidemiological Studies
Depression Scale (CES-D). Journal of Psychosomatic Research, 46, 437-443.
Haynes, P.L., Ancoli-Israel, S., & McQuaid, J. (2005). Illuminating the Impact of
Habitual Behaviors in Depression. Chronobiology International, 22(2), 279297.
Howe, D. (2007). The Free On-Line Dictionary of Computing. Retrieved from
http://fodoc.org/
Instant Messaging (2009). The American Heritage Dictionary of the English
Language, Fourth Edition. Retrieved from
http://dictionary.reference.com/browse/instant_messaging

45

Internet World Stats (2008) United States of America: Internet Usage and Broadband
Usage Report. Retrieved from http://www.internetworldstats.com/am/us.htm
IRC (2009). The American Heritage Dictionary of the English Language, Fourth
Edition. Retrieved from http://dictionary.reference.com/browse/irc
IUP (2010). Facts about IUP. Retrieved from http://www.iup.edu/about/default.aspx
Jones, S. & Fox, S. (2009), Pew Internet & American Life Project: Generations
online in 2009. Retrieved from http://www.pewinternet.org
Kraut, R., Patterson, M., Lundmark, V., Kiesler, S., Mukopadhyay, T., & Scherlis, W.
(1998). Internet Paradox: A Social Technology that Reduces Social
Involvement and Psychological Well-Being? American Psychologist, 53(9),
1017-1031
Kraut, R., Kiesler, S., Boneva, B., Cummings, J., Helgeson, V., & Crawford, A.
(2002). Internet Paradox Revisited. Journal of Social Issues. 58(1), 49-74.
Luo, J.S. (2007). Social Networking: Now Professionally Ready. Primary Psychiatry,
14(2), 21-24
Madell, D., & Muncer, S. (2006). Internet Communication: An Activity that Appeals
to Shy and Socially Phobic People? CyberPsychology & Behavior, 9(5), 618622.
Madell, D., & Muncer, S.J. (2007). Control over Social Interactions: An Important
Reason for Young Peoples Use of the Internet and Mobile Phones for
Communication. CyberPsychology & Behavior, 10(1), 137-140.
Meyer, T.D., & Maier, S. (2006). Is there evidence for social rhythm instability in
people at risk for affective disorders? Psychiatry Research, 141, 103-114.

46

Monk, T.H., Frank, E., Potts, J.M., & Kupfer, D.J. (2002). A simple way to measure
daily lifestyle regularity. J. Sleep Res., 11, 183-190.
Monk, T.H., Kupfer, D.J., Frank, E., & Ritenour, A.M. (1990). The Social Rhythm
Metric (SRM): Measuring Daily Social Rhythms Over 12 Weeks. Psychiatry
Research, 36, 195-207.
Monk, T.H., Petrie, S.R., Hayes, A.J., & Kupfer, D.J. (1994). Regularity in daily life
in relation to personality, age, gender, sleep quality, and circadian rhythms. J.
Sleep Res, 3, 196-205.
Morgan, C., & Cotton, S.R. (2003). The Relationship between Internet Activities and
Depressive Symptoms in a Sample of College Freshmen. CyberPsychology &
Behavior, 6(2), 133-142.
Neu, S. (2009). Use of Massively Multiplayer Online Role Play Games by College
Students (Doctoral dissertation). Available from ProQuest Dissertations and
Thesis Database (ATT No. 3344420).
Odell, P.M., Korgen, K.O., Schumacher, P., & Delucchi, M. (2000). Internet Use
Among Female and Male College Students. CyberPsychology & Behavior,
3(5), 855-862.
Pew Internet Tracking Survey (2007a) Demographics of Internet Users. Retrieved
from http://www.pewinternet.org/
Pew Internet Tracking Survey (2007b) Daily Internet Activities. Retrieved from
http://www.pewinternet.org/

47

Radloff, L.S., (1977). The CES-D Scale: A Self-Report Depression Scale for
Research in the General Population. Applied Psychological Measurement, 1,
385-401.
Russell, D. (1996). UCLA Loneliness Scale (Version 3): Reliability, Validity, and
Factor Structure. Journal of Personality Assessment, 66(1), 20-40.
Sanders, C.E., Field, T.M., Diego, M. & Kaplan, M. (2000). The Relationship of
Internet Use to Depression and Social Isolation Among Adolescents.
Adolescence, 35(138), 237-242.
Sellers, P. (2006, August 29). MySpace Cowboys. Fortune Magazine. Retrieved from
http://money.cnn.com/magazines/fortune/
Shaw, L.H., & Gant, L.M. (2002). In Defense of the Internet: The Relationship
between Communication and Depression, Loneliness, Self-Esteem, and
Perceived Social Support. CyberPsychology & Behavior, 5(2), 157-171
Simmons, J.P. & Nelson, L.D. (2006). Intuitive Confidence: Choosing Between
Intuitive and Nonintuitive Alternatives. Journal of Experimental Psychology:
General, 135(3), 409-428.
Sorryforsilence (2009, June 22). Are online friends BAD for people with social
anxiety disorder? [Web log post]. Retrieved from
http://sorryforsilence.wordpress.com/2009/06/22/is-having-online-friendsbad-for-people-with-social-anxiety-disorder/
Teske, J.A. (2002). Cyberpsychology, Human Relationships, and Our Virtual
Interiors. Zygon, 37(3), 677-700.

48

Thibaut, J., & Kelley, H. (1986). Interference and Facilitation in Interaction. In The
Social Psychology of Groups. (p. 60). Edison, New Jersey: Transaction
Publishers.
Turk, C.L., Heimberg, R.G., & Hope, D.A. (2001). Social Anxiety Disorder. In. D.H.
Barlow (Ed.), Clinical Handbook of Psychological Disorders (3rd edition).
(pp. 114-153). New York: The Guilford Press.
USA Today (2007, July 15). Couple Accused of starving children while on the
Internet. Retrieved from http://www.usatoday/news/nation/2007-07-15internet-neglect_N.htm. Article on file with author.
Watters, E. (2003). Urban Tribes. New York: Bloomsbury
Weiser, E. (2000). The Functions of Internet use and their social, psychological, and
interpersonal consequences (Doctoral dissertation). Available from ProQuest
Dissertations and Thesis Database (ATT No. 9980637).
White, E. (2007). Text Appeal: In the Age of Computers and Cell Phones,
Relationships Progress from Email to Text to the Real Commitment: A Phone
Call. The Houston Chronicle. Retrieved from http://www.chron.com/
World Wide Web (2002). The American Heritage Science Dictionary. Retrieved
from http://dictionary.reference.com/browse/world_wide_web
Ybarra, M.L. (2004). Linkages between Depressive Symptomatology and Internet
Harassment among Young Regular Internet Users. CyberPsychology &
Behavior, 7(2), 247-257.

49

Young, J.E., Weinberger, A.D., Beck, A.T. (2001). Cognitive Therapy for
Depression. In. D.H. Barlow (Ed.), Clinical Handbook of Psychological
Disorders (3rd edition). (pp. 114-153). New York: The Guilford Press.
Yule, T. (2004). Lotus Illustrated Dictionary of Internet. Twin Lakes, WI: Lotus
Press

50

APPENDIX A
Informed Consent

Clinical Psychology Doctoral Program


Psychology Department
Uhler Hall, Room 201 / 1020 Oakland Avenue
Indiana, Pennsylvania 15705-1064
724-357-4519 (office) 724-357-4519 (fax)

Informed Consent
You are invited to participate in this research study. The following information is
provided in order to help you make an informed decision about whether or not to
participate in this study. If you have any questions please do not hesitate to ask via
the provided researcher email listed below. You are eligible to participate because
you are an undergraduate at Indiana University of Pennsylvania and enrolled in PSYC
101 General Psychology.
The purpose of this study is to learn about college students habits when using the
Internet and the impact that it may have on social relationships and psychological
well-being. This study is particularly interested in looking at the amount of time you
spend on the Internet per week and the particular activities you engage in while on the
Internet. In an effort to get a complete picture of respondents, some demographic
information is included for this study. Several questionnaires include items of a
personal nature related to feelings of loneliness, depression and anxiety. It is
estimated that completion of questionnaires will take one hour at the initial interview
and an additional 15 minutes each night for a period of 7 days for no longer than a
total commitment of 3 hours. Your completed participation in this study will earn 4
of the 6 points required to complete your research participation in your PSYC 101
course.
Your participation in this study is voluntary. You may choose not to participate in this
study or to withdraw at any time without adversely affecting your relationship with
the investigators, with IUP, or your psychology professor. If you choose not to
participate, your name will be returned to the subject pool and your research
participation obligation will remain the same. If you choose to participate you may
withdraw at any time by notifying the researcher. Upon your request to withdraw, all
information pertaining to you will be destroyed. If you choose to participate, all
information will be held in strict confidence. Your responses will be considered only
in combination with those from other participants. The information obtained in this
study may be published in scientific journals or presented at scientific meetings but
your identity will always be kept strictly confidential.

51

This research is sponsored by Indiana University of Pennsylvanias Department of


Psychology. If you have any questions, please contact the researchers listed below:
Primary Researcher:
Kimberlee D. DeRushia, M.A.
Graduate Student
Psychology Department
201 Uhler Hall
Indiana, PA 15705
k.d.derushia@iup.edu

Faculty Sponsor:
Kimberely J. Husenits, Psy.D.
Associate Professor
Psychology Department
238A Uhler Hall
Indiana, PA 15705
husenits@iup.edu

If you are willing to participate in this study, please sign the statement below. If you
choose not to participate, please inform the researcher now.
I have read the above information and understand that participation in this study is
voluntary. I agree to be a part of this research.

Signature of Participant

52

Date

APPENDIX B
Demographic Questionnaire
Where do you currently reside?
_____ On campus in student housing
_____ Off campus in student housing
_____ Off campus with family
_____ Off campus with friends
_____ Off campus alone
_____ Other:

Do you own a personal computer?


_____ yes
_____ no

Where do you have Internet access (check all that apply)?


_____ On campus
_____ At home
_____ At parents house
_____ At work
_____ Other:

What are the 3 most important activities you use the Internet for?
(1)
(2)
(3)

What is your current college major (s)?


(1)
(2)

53

Have you declared a minor?


_____ Yes, if yes, in what field:
_____ No

What is your current academic standing?


_____ Freshman
_____ Sophomore
_____ Junior
_____ Senior
_____ Continuing Education

What is your gender?


_____ Male
_____ Female
What is your mothers highest level of education?
_____ Did not complete high school
_____ High school
_____ Vocational or trade school
_____ Some college
_____ Bachelors degree
_____ Masters degree
_____ Doctorate degree

54

What is your fathers highest level of education?


_____ Did not complete high school
_____ High school
_____ Vocational or trade school
_____ Some college
_____ Bachelors degree
_____ Masters degree
_____ Doctorate degree
What is your familys estimated total income?
_____ less than $7,550
_____ between $7,550 and $30,650
_____ between $30,650 and $61,850
_____ between $61,850 and $94,225
_____ between $94,225 and $168,275
_____ over $168,275

55

APPENDIX C
Part One: Internet Usage Tracking Chart

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

Midnight 1 am

1 am 2 am

2 am 3 am

3 am 4 am

4 am 5 am

5 am 6 am

6 am 7 am

7 am 8 am

8 am 9 am

9 am 10 am

10 am 11 am

11 am 12 pm

12 pm 1 pm

1 pm 2 pm

2 pm 3 pm

3 pm 4 pm

4 pm 5 pm

5 pm 6 pm

6 pm 7 pm

7 pm 8 pm

8 pm 9 pm

9 pm 10 pm

10 pm 11 pm

11 pm Midnight

Scoring: Total amount of hours per day, divided by number of days = average amount per day online.

56

Part Two: Internet Usage Follow-up Questions

How many hours per week do you spend studying?


How many courses have you taken that require Internet use?
How many total hours per week do you spend using the Internet?
How many hours per week do you use the Internet for school related work?
How many hours per week do you use the Internet for emailing?
How many hours per week do you use the Internet for instant messaging?
How many hours per week do you spend in Internet chat rooms?
How many hours per week do you spend browsing Internet sites?
How many hours per week do you use the Internet for gaming?
How many hours per week do you use the Internet for blogging or on social
networking sites (Facebook, MySpace, etc.)?
Please rank-order these Internet activities from most likely to be what you do online
to least likely:
_____ Email
_____ Research for school
_____ WebCT / Online Course
_____ Research for personal knowledge
_____ Sex sites
_____ Chat
_____ Shopping
_____ Researching items for purchasing
_____ News
_____ Games
_____ Music
_____ Blogs / Social networking sites
_____ Gambling
57

APPENDIX D

PM

Spouse /
partner

Children

Other
family
members

Other
person(s)

People
1=just present
2= actively involved

AM

Clock time

Check if
Did not do

Time

Check if Alone

Social Rhythm Metric

Have morning beverage

Have breakfast

Go outside for the first time

Have lunch

Take an afternoon nap

Have dinner

Physical exercise

Have an evening snack / drink

Watch evening TV news program

Watch another TV program

Activity A

Activity B

Return home (last time)

Go to bed

Activity

Out of bed
First contact (in person or by phone)
with another person

Start work, school, housework,


volunteer activities, child or family
care

58

APPENDIX E
UCLA Loneliness Scale (Version 3)
The Following statements describe how people sometimes feel. For each statement, please indicate how often you feel
the way described by writing a number in the space provided.
Here is an example: How often do you feel happy?
If you never felt happy, you would respond never; if you always feel happy, you would respond always.
Never
Rarely
Sometimes

Often

1.

How often do you feel that you are in tune with the people
around you?

2.

How often do you feel that you lack companionship?

3.

How often do you feel that there is no one you can turn to?

4.

How often do you feel alone?

5.

How often do you feel part of a group of friends?

6.

How often do you feel that you have a lot in common with
the people around you?

7.

How often do you feel that you are no longer close to


anyone?

8.

How often do you feel that your interests and ideas are not
shared by those around you?

9.

How often do you feel outgoing and friendly?

10. How often do you feel close to people?

11. How often do you feel left out?

12. How often do you feel that your relationships with others are
not meaningful?

13. How often do you feel that no one really knows you well?

14. How often do you feel isolated from others?

15. How often do you feel that you can find companionship
when you want it?

16. How often do you feel that there are people who really
understand you?

17. How often do you feel shy?

18. How often do you feel that people are around you but not
with you?

19. How often do you feel that there are people you can talk to?

20. How often do you feel that there are people you can turn to?

59

APPENDIX F
Center For Epidemiologic Studies Depression Scale (CES-D)

Below is a list of the ways you might have felt or behaved.


Please indicate how often you have felt this way during the past week:
Some or a little
Rarely or none
of the time
(less than 1 day)

of the time
(1-2 days)

Occasionally or
a moderate
amount of time
(3-4 days)

Most or all of
the time
(5-7 days)

1. I was bothered by things that


usually dont bother me
2. I did not feel like eating; my
appetite was poor.
3. I felt that I could not shake off
the blues even with help from
my family or friends.
4. I felt I was just as good as
other people.
5. I had trouble keeping my mind
on what I was doing.

6. I felt depressed.

7. I felt that everything I did was


an effort.

8. I felt hopeful about the future.

9. I thought my life had been a


failure.

10. I felt fearful.

11. My sleep was restless.

12. I was happy.

13. I talked less than usual.

14. I felt lonely.

15. People were unfriendly.

16. I enjoyed life.

17. I had crying spells.

18. I felt sad.

19. I felt that people dislike me.

20. I could not get going.

60

APPENDIX G
Brief Fear of Negative Evaluation, Revised (BFNE-II)

For the following statements please indicate how characteristic each is of you using the following rating scale
Not at all
characteristic
of me

Slightly
characteristic
of me

Moderately
characteristic
of me

Very
characteristic
of me

Extremely
characteristic
of me

1.

I worry about what other people


will think of me even when I know
it doesnt make any difference

2.

It bothers me when people form


an unfavorable impression of me

3.

I am frequently afraid of other


people noticing my shortcomings

4.

I worry about what kind of


impression I make on people

5.

I am afraid that people will find


fault with me

6.

I am afraid that others will not


approve of me

7.

I am concerned about other


peoples opinions of me

8.

When I am talking to someone, I


worry about what they may be
thinking of me

9.

I am usually worried about what


kind of impression I make

10. If I know someone is judging me,


it tends to bother me

11. Sometimes I think I am too


concerned with what other people
think of me

12. I often worry that I will say or do


wrong things

61

APPENDIX H
Debriefing
Clinical Psychology Doctoral Program
Psychology Department
Uhler Hall, Room 201 / 1020 Oakland Avenue
Indiana, Pennsylvania 15705-1064
724-357-4519 (office) 724-357-4519 (fax)

Debriefing
Thank you for participating in this research study. The Internet has become an
integral part of Western society, with approximately 69.2% of the population of the
United States using the Internet on a regular basis (Internet World Stats, 2007). This
study was conducted with the purpose of learning about college students habits when
using the Internet and the impact that it may have on social relationships and
psychological well-being of regular users of this medium. The connection between
Internet use and social engagement, depression and social anxiety has received mixed
results in both the popular media and the psychological research, an example of this
can be found in the journal article Internet Paradox Revisited (Kraut et al., 2002). The
study in which you participated is designed to more accurately track students daily
use of the Internet in terms of time spent on a variety of online activities in order to
clarify links between use and psychological outcomes.
The responses that you gave will be considered only in combination with those from
other participants in the study so that you cannot be personally identified. Although
the information obtained in this study may be published in scientific journals or
presented at scientific meetings, your identity will always be kept strictly
confidential.
This research is sponsored by Indiana University of Pennsylvanias Department of
Psychology. If you have any questions concerning this study, if you would like more
examples of the mixed results found between popular media and the psychological
research, or if you feel that you need to speak with a professional and would like a
referral, please contact the primary researcher listed below:

Primary Researcher:
Kimberlee D. DeRushia, M.A.
Graduate Student
Psychology Department
201 Uhler Hall
Indiana, PA 15705
k.d.derushia@iup.edu

Faculty Sponsor:
Kimberely J. Husenits, Psy.D.
Associate Professor
Psychology Department
238A Uhler Hall
Indiana, PA 15705
husenits@iup.edu
62

APPENDIX I
Campus and Community Resources
Counseling / Psychotherapy Resources:
1. IUP Counseling and Student Development Center
307 Pratt Hall (IUP campus)
724.357.2621
2. Crisis Intervention, Drug and Alcohol Counseling:
Open Door Counseling & Crisis Center
334 Philadelphia Street
Indiana, PA
724.465.2605
Suicide Hotline: 800.794.2112
3. Indiana County Guidance Center
793 Old Route 119 Highway North
Indiana, PA
724.465.5576
4. Center for Applied Psychology
Includes Stress & Habit Disorders Clinic, Child & Family Clinic, and Assessment
Clinic
210 Uhler Hall (IUP campus)
724.357.6228
Domestic Violence or Rape Crisis:
1. Alice Paul House
724.349.4444 or 800.435.7249
Child Abuse or Neglect:
1. Indiana County Children and Youth Services
350 N. 4th Street
Indiana, PA
724.465.3895

63

Academic / Learning Difficulties:


1. Tutorial Center
306 Pratt Hall (IUP campus)
724.357.2159
2. Advising and Testing / Disability Services
106 Pratt Hall (IUP campus)
724.357.4067
3. Learning Center
202 Pratt Hall (IUP campus)
724.357.2727
Career Planning:
1. Career Services
302 Pratt Hall (IUP campus)
724.357.2235
Legal Services:
1. Student Legal Services
936 Philadelphia Street
Indiana, PA
724.349.6020
Activities / Campus Events:
1. Center for Student Life
Student Activities and Organizations
102 Pratt Hall (IUP campus)
724.357.2315
2. Student Cooperative Association
Hadley Union (IUP campus)
724.357.2590
Other Resources:
1. Gay, Lesbian, Bisexual and Transgender Concerns
Dr. Rita Drapkin: safe-zone@iup.edu
2. Interfaith Council
www.iup.edu/student dev/ministry.shtm

64

Recommending or Persuading?
The Impact of a Shopping Agents
Algorithm on User Behavior
Gerald Hubl

Kyle B. Murray

School of Business
University of Alberta
Edmonton, AB
Canada, T6G 2R6

School of Business
University of Alberta
Edmonton, AB
Canada, T6G 2R6

Gerald.Haeubl@ualberta.ca

kbmurray@ualberta.ca

ABSTRACT

Keywords

This paper investigates the potential of recommendation agents


for electronic shopping to influence human decision making by
shaping user preferences. Specifically, we examine how the type
of information that is elicited by a shopping agent for use in its
recommendation algorithm may affect consumers preference for
product features and ultimately their product choice in an
electronic marketplace. A recommendation agent is defined as a
software tool that (a) calibrates a model of a users preference
based on his/her input and (b) uses this model to make
personalized product recommendations. We report the results of a
controlled experiment that demonstrates that, everything else
being equal, the inclusion of a product feature in a
recommendation agent renders this feature more prominent in
shoppers purchase decisions. In addition, we find that this effect
is moderated by an important property of the marketplace the
correlation structure among the features of available products. We
conclude that electronic shopping agents, through the design of
their recommendation algorithms, have the potential to influence
user preferences in a systematic fashion.

Recommendation Systems, Shopping Agents, Personalization,


Human Decision Making, Consumer Behavior, Product Choice,
Online Shopping, User Preferences, Persuasion, Influence.

1. INTRODUCTION
The constraints of physical space no longer dictate the
organization of information in electronic shopping environments
[6]. One consequence of this is that online vendors are able to
offer a very large number of products due to their virtually infinite
shelf space, i.e., the lack of physical constraints with respect to
product display. Combined with the fact that the cost of searching
for product information across merchants is substantially lower in
electronic marketplaces than in the physical world [1, 7], this
results in the availability of a potentially vast amount of
information about market offerings to consumers.
Easy access to large amounts of product information is both a
blessing and a curse. It is a blessing in the sense that more
information may allow consumers to make better purchase
decisions (e.g., to select products that better match their personal
preferences) than they would otherwise. However, the curse of
having access to vast amounts of information is that consumers,
due to their limited cognitive capacity, may be unable to
adequately process this information. The idea that human decision
makers have limited resources for information processing
whether those limits are in memory, attention, motivation, or
elsewhere has deep roots in the literature of both marketing
and psychology [9, 11, 12]. In electronic shopping environments,
consumers are less constrained by the availability of product
information, yet they remain bounded by the cognitive limitations
of human information processing.

Categories and Subject Descriptors


H.1.2 [Models and Principles]: User/Machine Systems human
information processing, human factors, software psychology.
H.5.2 [Information Interfaces and Presentation (e.g., HCI)]:
User Interfaces evaluation/methodology, interaction styles,
screen design, theory and methods.

General Terms
Algorithms, Management, Design, Economics, Experimentation,
Human Factors, Theory.

A response to the problem of information overload in digital


marketplaces is the emergence of electronic decision aids for
consumers. The latter represent a technology that takes advantage
of an important and unique characteristic of digital shopping
interfaces the potential for real-time personalization of an
information environment based on explicit input by, or other
information about, a user [5]. Software tools that generate

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
EC01, October 14-17, 2001, Tampa, Florida, USA.
Copyright 2001 ACM 1-58113-387-1/01/0010$5.00.

163

personalized product recommendations in the form of a list in


which alternatives are sorted by their predicted attractiveness to
an individual shopper, thus allowing the latter to screen a large set
of alternatives in a systematic and efficient manner, are
particularly valuable to consumers and have become highly
prevalent in real-world digital marketplaces (e.g., Amazon,
MySimon, Microsofts MSN eShop, and the Yahoo! shopping
site). We refer to electronic decision aids of this type as
recommendation agents.

include any features that are not available in Nike products.


Similarly, Active Buyers Guide typically does not include all
possible product attributes, as this would defeat one of the main
purposes of recommendation agents, namely to simplify
consumers product search and purchase decisions. The present
work pertains equally to both provider scenarios (vendor and
third-party provider), as long as the recommendation agent is
selective in terms of attribute inclusion. Throughout the remainder
of this paper, we refer to attribute-based recommendation tools in
general, regardless of who might have control over the specific
aspects of their design.

Following Hubl and Trifts [5], we conceptualize an electronic


recommendation agent as a software tool that (a) attempts to
understand a human decision makers multi-attribute preference
with respect to a particular domain or product category based on a
learning (or calibration) phase during which the human reveals
subjective preference information to the agent and (b) makes
recommendations in the form of a sorted list of alternatives
provided to the human in a decision task based on its
understanding of that individuals subjective preference structure.

Recent empirical research shows that the availability of an


attribute-based recommendation agent in an electronic shopping
environment may result in a substantial reduction in the amount of
consumers pre-purchase information search [5]. This finding
suggests that, due to the limited information-processing capacity
of the human mind, users tend to rely heavily upon an electronic
agents product recommendation in order to reduce the amount of
effort required to make a purchase decision. Given this tendency
to rely on suggestions made by recommendation agents, and given
the rapidly increasing prevalence of such decision aids in digital
marketplaces, it is critically important to develop an
understanding of whether and how electronic recommendation
agents may influence consumers preferences.

Our focus is on recommendation agents that attempt to understand


a consumers preference in terms of a multi-attribute preference
model (based, e.g., on a weighted additive evaluation rule) that is
calibrated using subjective preference information revealed to the
agent by the human user. Such feature- or attribute-based
recommendation agents are an integral component of many of the
major online shopping sites (e.g., MySimon), and they represent a
standardized technology that can be licensed by vendors for
inclusion in their electronic stores (e.g., Active Buyers Guide and
Frictionless Commerces PurchaseSource).

Based on recent theorizing in the area of preference construction,


we propose that the characteristics of a recommendation agent
may systematically influence decision makers preferences for
objects in multi-attribute space. Since real-world recommendation
agents are almost inevitably selective in the sense that they
consider only a subset of the pertinent product features, the
particular set of features that is included in an agent i.e., used
at its calibration stage and considered by its sorting algorithm
is a key characteristic of a recommendation agent and of the
digital shopping environment that it is embedded in. We
hypothesize that whether or not a particular attribute is included
will affect the subjective importance of that attribute to the
decision maker. More specifically, we predict an inclusion effect,
such that an attribute will be rendered more prominent in
preferential choice merely as a result of its inclusion in a
recommendation agent.

Almost inevitably, real-world attribute-based recommendation


agents are selective in the sense that only a subset of all the
relevant product features can be used in their calibration and,
thus, in the algorithm used to generate the recommendations. This
is apparent in the implementation of many commercial
recommendation systems for online shopping (see, e.g.,
MySimon, Active Buyers Guide, or Nikes online product
recommender). The reasons for such selectivity in
recommendation agents include (a) the very large number of
attributes that exist in many product categories, (b) the substantial
amount of data about, or interaction with, a consumer that would
be required to develop an accurate understanding of the
consumers subjective preference in high-dimensional attribute
space, (c) an inclination to use only those attributes that are
common to most or all alternatives, and (d) a tendency to include
only attributes that are quantitative in nature (i.e., the levels of
which can be represented numerically). Apart from these reasons,
the selective inclusion of attributes in a recommendation agent
may also be driven by strategic objectives (e.g., to de-emphasize
specific attributes) on the part of whoever controls the design of
the agent.

The remainder of the paper is organized as follows. First, we


briefly review the notion of constructive consumer preferences
and discuss the potential for preference construction in digital
marketplaces. Specifically, we focus on the possible role of
electronic recommendation agents in consumers construction of
preferences. This is followed by a discussion of the methods and
results of a controlled experiment aimed at enhancing our
understanding of how attribute inclusion in an electronic agent
may influence individuals preferences. The results of this
experiment provide support for the existence of the predicted
inclusion effect attributes receive greater weight in an agentassisted shopping task when they are included in the
recommendation agent than when they are not. The paper
concludes with a discussion of how the findings contribute to the
expansion of our understanding of human decision making in
connection with electronic shopping agents.

A recommendation agent may be made available either by a


particular online vendor (e.g., Nikes online store), in order to
assist shoppers in choosing one of the products in its own
assortment, or by a third-party provider (e.g., Active Buyers
Guide), in order to help consumers in selecting a product from
among different vendors. This distinction might be associated
with different motivations for including certain attributes in the
decision aid. For example, Nikes product recommender does not

164

2. THE POTENTIAL OF AN
ELECTRONIC AGENT TO PERSUADE

evidence as to the existence of such a preference-construction


effect in connection with product choice in an electronic shopping
environment. In this study, we also investigated the potential
moderating effect of an important property of the marketplace, the
inter-attribute correlation structure across the set of available
products, with respect to any inclusion effect that might exist.

The information-processing approach to human decision making


recognizes that individuals information-processing capacity is
limited [2] and that most decisions are consistent with the notion
of bounded rationality in that decision makers seek to attain some
satisfactory, although not necessarily maximal, level of
achievement [13]. As a result of these constraints, individuals
typically do not have well-defined preferences that are stable over
time and invariant to the context in which decisions are made [3].
That is, in a domain (e.g., product category) involving alternatives
that are characterized in terms of multiple attributes, individuals
typically do not have specific pre-formed strategies with respect to
exactly how important each of several attributes is to them
personally, what kind of integration rule they should use to
combine different pieces of attribute information into overall
assessments of alternatives, or precisely how they wish to make
trade-offs between attributes. Instead, decision makers tend to
construct their preferences on the spot when they are prompted
either to express an evaluative judgment or to make a decision [8].

3. METHOD
3.1 Overview
The objective of this experiment was to examine the possibility of
preference construction due to the selective inclusion of attributes
in a recommendation agent. The study was fully computer-based,
and involved a simulated shopping trip in an Internet-based
electronic store equipped with a recommendation agent and the
subsequent completion of an online questionnaire. Subjects were
informed that the purpose of the research was to test a new
electronic shopping environment and its features. Their task was
to shop for a backpacking tent in the Internet-based store and to
complete their simulated shopping trip by selecting from the set of
available tents the one that was the most attractive to them
personally. A total of 347 subjects completed the study remotely,
via a secure Internet site. Participants were randomly assigned to
one of the treatment conditions (see below).

The basic idea that underlies the constructive-preferences


perspective is that decision makers typically do not have a master
list that they can refer to with regards to their preferences [3].
This perspective adheres to two major tenets: (a) expressions of
preference are generally constructed at the time at which the
valuation of an object is required, and (b) this construction
process will be shaped by the interaction between the properties
of the human information-processing system and the properties of
the decision task, leading to highly contingent behavior [10].

3.2 Selective Inclusion of Attributes in the


Recommendation Agent
The backpacking tents were described within a four-dimensional
attribute space. For the selective inclusion of attributes in the
recommendation agent, the four attributes were divided into two
subsets, block 1 and block 2, each containing two attributes.
Each block contained one attribute that, according to a pilot study,
is of high importance (primary attribute) and one that is of only
moderate importance (secondary attribute). Block 1 included
durability (primary attribute) and fly fabric (secondary attribute),
and block 2 included weight (primary attribute) and warranty
(secondary attribute). The inclusion of attributes was manipulated
by using either attribute block 1 or attribute block 2 in the agents
calibration interface and sorting algorithm. Through this
counterbalancing, it was possible to manipulate attribute inclusion
in the recommendation agent independently of the characteristics
of the actual attributes, such as their ecological importance.

Given the large amount of empirical evidence suggesting that the


particular characteristics of the decision environment may play a
central role in individuals construction of preference [14], the
potential of electronic shopping environments, which are
interactive and personalizable, to influence consumer preferences
and, ultimately, purchase decisions is very significant [6]. In
particular, since shoppers tend to be quite willing to rely on
product recommendations made to them by digital agents [5], the
latter may be an important determinant of how consumers
construct their preferences in electronic shopping environments.
In view of the fact that real-world recommendation tools for
online shopping almost inevitably base their product suggestions
on only a subset of the pertinent attributes (see above), we
propose that such selective recommendation agents may
systematically influence consumers preferences. More
specifically, we predict that the relative importance that shoppers
attach to different product attributes may be influenced by
whether or not a particular attribute is used in the
recommendation algorithm of an electronic agent. Our key
hypothesis is that the inclusion of an attribute in an agent will,
everything else being equal, render this attribute more prominent
when consumers make product choices in digital marketplaces.
We refer to this type of preference-construction effect as an
inclusion effect. For a more detailed discussion of the
psychological mechanisms that may underlie this effect, see [4].

3.3 Available Alternatives


A total of 16 backpacking tents were available during the
shopping task. These alternatives were hypothetical, but presented
as actual products in the study. Each tent was identified by a
fictitious model name and described in terms of four attributes.
The two primary and the two secondary attributes were varied at
eight and two levels, respectively. In order to allow for a clear and
simple test of the predicted inclusion effect, the available
alternatives were constructed such that subjects choice of their
most-preferred product would be informative with respect to
which attribute was the most important one to them in making
their decision. Specifically, this was accomplished by combining
(in products) the most attractive level of each primary attribute
with a level of the other primary attribute that is not the most
attractive one. Two alternatives had the best level of durability,

The proposed inclusion effect was examined in a controlled


experiment using human subjects, in which we gathered empirical

165

and two others had the most attractive level of weight. No


alternative had the most attractive level of both primary attributes,
and, thus, subjects had to rely relatively more heavily on one of
the two primary attributes when making their choice. Which of the
two primary attributes the selected alternative was superior on was
an indicator of the relative importance of these attributes to a
respondent in making his/her decision. The price of the
backpacking tents was held constant subjects were informed
that the price of each tent was $249.

were available in each of the two marketplaces are provided in


Table 1 and Table 2.

3.5 Subjects Interaction with the


Recommendation Agent
At the beginning of the experiment, subjects read instructions
relating to the task, including an explanation of the
recommendation agents purpose and functionality. In order to
calibrate the agent, subjects were asked to indicate how important
they personally considered each of the two included attributes to
be on a 100-point rating scale (see Figure 1). Based on these
subjective attribute-importance weights, and using (standardized)
utility scale values for the different levels of each attribute, the
recommendation agent computed a linear-weighted overall utility
score for each available product. It then returned a personalized
list of recommended products in which the alternatives were
sorted by their utility score in descending order (see Figure 2). For
each product, this list contained the model name and the levels of
the two attributes that were included in the agent. From the
recommendation list, subjects were able to request a detailed
description of a particular product (i.e., the levels of all four
attributes) by clicking on a hyperlink. From the screen containing
the detailed description, subjects were able to either return to the
personalized list of recommended products for further search or
proceed to complete their purchase.

3.4 Inter-Attribute Correlation


In order to examine whether the predicted inclusion effect might
be moderated by the nature of the marketplace, we systematically
manipulated the correlation structure among the attributes (in
terms of attribute-level utilities) of the available products.
Specifically, two different product spaces (markets) were
created. The first one was characterized by negative inter-attribute
correlation. This marketplace is efficient in the sense that no
alternative is clearly superior to another and choosing a product
involves making (potentially very difficult) trade-offs among
attributes. Thus, the setting in which attributes are negatively
correlated resembles a typical real-world market. The second type
of marketplace that we used was characterized by positive interattribute correlation. Such a market is inefficient in the sense
that some alternatives are clearly superior to others and choosing
a product involves few, if any, trade-offs among attributes. While
this type of market structure is somewhat atypical, it provides us
with an opportunity to also test the predicted inclusion effect in a
setting that requires little effort on the part of the decision maker.
Subjects were randomly assigned to one of the two treatments of
inter-attribute correlation. Descriptions of the sets of products that

3.6 Experimental Design


The experimental design for this study is a 2 (inclusion of
attribute block in recommendation agent) 2 (inter-attribute
correlation between included and excluded attributes) betweensubjects full factorial, yielding four different treatment conditions.

Table 1: Set of Available Products Negative Inter-Attribute Correlation


Model
Name

Durability
Rating

Fly
Fabric

Weight
(kilograms)

1
Coyote
76
2.3 oz Nylon
3.4
2
Adventurer
76
1.9 oz Polyester
3.4
3
Sunlight
79
2.3 oz Nylon
3.5
4
Grizzly
79
1.9 oz Polyester
3.5
5
Oasis
82
2.3 oz Nylon
3.6
6
Solitude
82
1.9 oz Polyester
3.6
7
Summit
85
2.3 oz Nylon
3.3
8
Drifter
85
1.9 oz Polyester
3.3
9
Challenger
88
2.3 oz Nylon
3.8
10
Serenity
88
1.9 oz Polyester
3.8
11
Raven
91
2.3 oz Nylon
3.9
12
Waterfall
91
1.9 oz Polyester
3.9
13
Naturalist
94
2.3 oz Nylon
4.0
14
Skyline
94
1.9 oz Polyester
4.0
15
Neptune
97
2.3 oz Nylon
3.7
16
Freestyle
97
1.9 oz Polyester
3.7
Note: The most attractive level of each of the two primary attributes is indicated by gray shading.

166

Warranty
(years)
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3

The two experimental factors were manipulated as follows:

The inclusion of attributes in the electronic agent was


implemented by having either attribute block 1 or attribute
block 2 included in the agents algorithm for generating
product recommendations.

The inter-attribute correlation was manipulated by


constructing the available alternatives such that the
correlation (in terms of attribute-level utilities) between the
primary attribute that was included in the recommendation
agent and the one that was not was either negative
( = 0.71) or positive ( = +0.71). Descriptions of the sets
of products for the two treatments of inter-attribute
correlation are provided in Table 1 and Table 2.

Each of the 347 study participants was randomly assigned to one


of the four treatment conditions implied by the 2 2 full factorial
design.

Figure 1: Calibration of Recommendation Agent

4. RESULTS
As a test of the predicted inclusion effect, we examine the relative
choice shares in the shopping task for alternatives that have the
most attractive level of the primary included attribute, i.e., the
primary attribute that was considered by the recommendation
agent. Our directional prediction is that alternatives that are
superior on the included attribute are more likely to be chosen
than ones that are superior on the excluded attribute. The
corresponding null hypothesis is that the extent to which an
attribute drives subjects choices of products is independent of
whether that attribute was included in the recommendation agent
or not, i.e., that half of the subjects select an alternative that has
the most attractive level of the included attribute and the other

half choose a product that has the most attractive level of the
excluded attribute (when controlling for potential differences in
the ecological importance of the actual attributes through
counterbalancing). A significant departure from such a fifty-fifty
split in choice shares in the predicted direction (i.e., greater
importance of an attribute when it is included in the
recommendation agent) would provide support for the inclusion
effect. Since attribute-specific characteristics were controlled for
through the counterbalancing of the two blocks of attributes, any
such departure would be independent of the relative importance of
the actual attributes used.

Table 2: Set of Available Products Positive Inter-Attribute Correlation


Model
Name

Durability
Rating

Fly
Fabric

Weight
(kilograms)

1
Traveler
76
2.3 oz Nylon
3.9
2
Journey
76
1.9 oz Polyester
3.9
3
Seabreeze
79
2.3 oz Nylon
4.0
4
Moonscape
79
1.9 oz Polyester
4.0
5
Galaxy
82
2.3 oz Nylon
3.5
6
Lakeside
82
1.9 oz Polyester
3.5
7
BackTrail
85
2.3 oz Nylon
3.6
8
Eagle
85
1.9 oz Polyester
3.6
9
Eclipse
88
2.3 oz Nylon
3.7
10
Daydream
88
1.9 oz Polyester
3.7
11
Spirit
91
2.3 oz Nylon
3.8
12
Westwind
91
1.9 oz Polyester
3.8
13
Glacier
94
2.3 oz Nylon
3.3
14
Wanderer
94
1.9 oz Polyester
3.3
15
Mountain
97
2.3 oz Nylon
3.4
16
Outfitter
97
1.9 oz Polyester
3.4
Note: The most attractive level of each of the two primary attributes is indicated by gray shading.

167

Warranty
(years)
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3

Figure 2: Personalized List of Recommended Products

We find that 60.7 percent of subjects chose an alternative that had


the most desirable level of the primary included attribute and only
39.3 percent (see Figure 3) chose an alternative that had the most
desirable level of the primary excluded attribute. Based on a
binomial test (using equal choice probabilities as the null
hypothesis), this departure from a fifty-fifty split in choice shares
is statistically significant (p < 0.0001) and provides strong support
for the predicted inclusion effect. This demonstrates that,
everything else being equal, the weight that a user attaches to a
particular product attribute when making a purchase decision in
an agent-assisted shopping environment may in fact be enhanced
as a result of the inclusion of that attribute in the recommendation
agent. Thus, the empirical evidence obtained in this experiment
illustrates the potential of electronic agents for online shopping to
influence consumer preferences.

only 29 percent chose a product that was superior on the primary


attribute that was excluded from the agent. According to a
binomial test, these relative choice shares depart significantly
from the base case of equal choice shares (p < 0.0001). By
contrast, when inter-attribute correlation was positive, we do not
find evidence of such preference construction. The choice shares
for the two types of alternatives (superior on the included versus
on the excluded attribute) are not significantly different from each
other in this case (p > 0.75). A graphical representation of the
moderating effect of inter-attribute correlation with respect to the
inclusion effect, which is highly significant (2 = 12.234, df = 1,
p < 0.0001), is provided in Figure 4. We observe a strong agentinduced preference-construction effect when the marketplace is
efficient in the sense that no alternative is clearly superior to
another and choosing a product involves making trade-offs among
attributes (i.e., market with negative inter-attribute correlation),
but not when the marketplace is inefficient (i.e., positive interattribute correlation). Thus, the inclusion of an attribute in an
electronic agent tends to influence user preferences only when
decision making involves some degree of difficulty (e.g., when
decision makers are forced to trade off one feature for another),
and not when arriving at a decision is easy (e.g., when an
attractive level of one attribute tends to be associated with
attractive levels of other attributes and, thus, some products are

This effect of attribute inclusion in the recommendation agent is


moderated by the correlation between included and excluded
attributes in the set of available alternatives. A strong inclusion
effect was observed when the primary attribute included in the
agent was negatively correlated (in terms of utility) with the
excluded primary attribute. In conditions with negative interattribute correlation, 71 percent of subjects purchased an
alternative that was superior on the primary included attribute and

168

70%
60.7%

Choice Shares

60%

Alternative
Superior on
Included
Attribute

50%
39.3%

40%
30%

Alternative
Superior on
Excluded
Attribute

20%
10%
0%

Figure 3: Attribute Inclusion in the Agent and Choice Shares in Agent-Assisted Shopping

80%
71.0%
70%

Choice Shares

60%

51.5%

50%

48.5%

40%
30%

Alternative
Superior on
Included
Attribute

29.0%

Alternative
Superior on
Excluded
Attribute

20%
10%
0%

Negative

Positive

Inter-Attribute Correlation

Figure 4: Moderating Effect of Inter-Attribute Correlation

clearly superior to others). This suggests that the agent-induced


preference-construction effect may require that a certain level of
cognitive effort on the part of the user be invoked by the shopping
task or decision environment. Finally, it is worth pointing out that,
since most real-world markets tend to be competitive (and
therefore efficient), the setting in which we do find the predicted
inclusion effect is the one that most closely resembles reality.

5. DISCUSSION
Although electronic shopping environments are not subject to the
space constraints of bricks-and-mortar stores, consumers remain
bounded by the familiar cognitive constraints in terms of their
ability to process information. Electronic recommendation agents
can play a key role in reducing the amount of information about

169

available products that has to be processed by human users,


thereby assisting shoppers in making better decisions with limited
cognitive effort [5]. However, for such a decision aid to be
effective, the consumer must place some confidence in the
product recommendations made by the agent, as well as in the
process by which these recommendations are being generated.
This required level of confidence, or trust, raises the potential for
an electronic agent to not only assist the user in the decisionmaking process given his/her subjective preference, but also to
influence this preference. The objective of this paper has been to
investigate the potential of recommendation agents to
systematically influence user preferences.

and Jim Bettman for their valuable input. Correspondence should


be addressed to the first author at University of Alberta, School of
Business, Edmonton, Alberta, T6G 2R6, Canada, E-mail:
Gerald.Haeubl@ualberta.ca, Voice: 780-492-6886.

7. REFERENCES

Our key hypothesis has been that, everything else being equal, the
inclusion of an attribute in a selective recommendation agent
renders this attribute more prominent in consumers purchase
decisions in an electronic shopping environment. The results of
our controlled experiment provide strong support for the existence
of such an inclusion effect under typical market conditions where
no alternative is clearly superior to another and choosing a
product involves making trade-offs among attributes (i.e.,
negative inter-attribute correlation). Our findings suggest that, in
addition to providing a recommendation, an electronic agent has
the potential, whether intentionally or unintentionally, to
persuade users that certain alternatives are preferable to others.
The research presented here demonstrates that the preferences of
human decision makers can be influenced in a systematic and
predictable manner by merely altering the composition of the set
of product attributes that are included in a recommendation agent
for online shopping. In combination with the results from Hubl
and Murray [4], which demonstrate that the inclusion effect may
persist over time and into settings where an electronic agent is no
longer present, this stream of research illustrates the considerable
potential for systematically manipulating consumer behavior and
consumer preferences in digital marketplaces through the design
of electronic decision aids.
This paper extends the existing body of literature on constructive
consumer preferences by proposing and demonstrating a new type
of preference-construction effect that, given the rapidly increasing
prevalence of electronic decision aids for online shopping, is of
growing importance. In addition, this research also makes a
contribution to the emerging literature on consumer behavior in
the context of electronic commerce, in that it represents a step
towards a more complete understanding of human decision
making in agent-assisted electronic shopping environments.

[1]

Bakos, Yannis (1997), Reducing Buyer Search Costs:


Implications for Electronic Marketplaces, Management
Science, 43, 12, 1676-1692.

[2]

Bettman, James R. (1979), An Information Processing


Theory of Consumer Choice, Reading, MA: AddisonWesley.

[3]

Bettman, James R., Mary Frances Luce, and John W. Payne


(1998), Constructive Consumer Choice Processes,
Journal of Consumer Research, 25 (December), 187-217.

[4]

Hubl, Gerald and Kyle B. Murray (2002), Preference


Construction and Persistence in Digital Marketplaces: The
Role of Electronic Recommendation Agents, Journal of
Consumer Psychology, forthcoming.

[5]

Hubl, Gerald and Valerie Trifts (2000), Consumer


Decision Making in Online Shopping Environments: The
Effects of Interactive Decision Aids, Marketing Science,
19, 1, 4-21.

[6]

Johnson, Eric J., Gerald L. Lohse, and Naomi Mandel


(2001), Computer-Based Choice Environments: Four
Approaches to Designing Marketplaces of the Artificial,
working paper, Columbia University.

[7]

Lynch, John G. and Dan Ariely (2000), Wine Online:


Search Costs and Competition on Price, Quality, and
Distribution, Marketing Science, 19, 1, 83-103.

[8]

Payne, John W., James R. Bettman, and Eric J. Johnson


(1992), Behavioral Decision Research: A Constructive
Processing Perspective, Annual Review of Psychology, 43,
87-131.

[9]

Payne, John W., James R. Bettman, and Eric J. Johnson


(1993), The Adaptive Decision Maker, New York:
Cambridge University Press.

[10] Payne, John W., James R. Bettman, and David Schkade


(1999), Measuring Constructed Preferences: Towards a
Building Code, Journal of Risk and Uncertainty, 19, 243271.

[11] Shugan, Steven M. (1980), The Cost of Thinking,


Journal of Consumer Research, 7 (September), 99-111.

6. ACKNOWLEDGMENTS

[12] Simon, Herbert A. (1955), A Behavioral Model of

The authors gratefully acknowledge the support provided by the


Social Sciences and Humanities Research Council of Canada
(SSHRC grant 410-99-0677), the University of Alberta,
Macromedia Inc., and the Institute for Online Consumer Studies
(www.iocs.org). This research also benefited from the Banister
Professorship in Electronic Commerce, the Petro-Canada Young
Innovator Award, and the Southam Faculty Fellowship awarded
to Gerald Hubl, as well as from the Poole PhD Endowment
Fellowship, the University of Alberta School of Business PhD
Fellowship, and the Province of Alberta Graduate Fellowship
awarded to Kyle Murray. The authors wish to thank Terry Elrod

Rational Choice, Quarterly Journal of Economics, 69


(February), 99-118.

[13] Simon, Herbert A. (1990), Invariants of Human Behavior,


Annual Review of Psychology, 41, 1-19.

[14] Slovic, Paul (1995), The Construction of Preference,


American Psychologist, 50 (May), 364-371.

170

White Paper: The Deep Web: Surfacing Hidden


Value
Bergman, Michael K.
Journal of Electronic Publishing
Volume 7, Issue 1: Taking License, August, 2001
DOI: http://dx.doi.org/10.3998/3336451.0007.104 [http://dx.doi.org/10.3998/3336451.0007.104]

This White Paper is a version of the one on the BrightPlanet


[http://www.brightplanet.com] site. Although it is designed as a marketing tool for a
program "for existing Web portals that need to provide targeted, comprehensive
information to their site visitors," its insight into the structure of the Web makes it
worthwhile reading for all those involved in e-publishing. J.A.T.
Searching on the Internet today can be compared to dragging a net across the surface of the ocean.
While a great deal may be caught in the net, there is still a wealth of information that is deep, and
therefore, missed. The reason is simple: Most of the Web's information is buried far down on
dynamically generated sites, and standard search engines never find it.
Traditional search engines create their indices by spidering or crawling surface Web pages. To be
discovered, the page must be static and linked to other pages. Traditional search engines can not
"see" or retrieve content in the deep Web those pages do not exist until they are created
dynamically as the result of a specific search. Because traditional search engine crawlers can not
probe beneath the surface, the deep Web has heretofore been hidden.
The deep Web is qualitatively different from the surface Web. Deep Web sources store their
content in searchable databases that only produce results dynamically in response to a direct
request. But a direct query is a "one at a time" laborious way to search. BrightPlanet's search
technology automates the process of making dozens of direct queries simultaneously using
multiple-thread technology and thus is the only search technology, so far, that is capable of
identifying, retrieving, qualifying, classifying, and organizing both "deep" and "surface" content.
If the most coveted commodity of the Information Age is indeed information, then the value of
deep Web content is immeasurable. With this in mind, BrightPlanet has quantified the size and
relevancy of the deep Web in a study based on data collected between March 13 and 30, 2000. Our
key findings include:
Public information on the deep Web is currently 400 to 550 times larger than the commonly
defined World Wide Web.
The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of
information in the surface Web.
The deep Web contains nearly 550 billion individual documents compared to the one billion of
the surface Web.
More than 200,000 deep Web sites presently exist.
Sixty of the largest deep-Web sites collectively contain about 750 terabytes of information
sufficient by themselves to exceed the size of the surface Web forty times.
On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and
are more highly linked to than surface sites; however, the typical (median) deep Web site is not
well known to the Internet-searching public.

The deep Web is the largest growing category of new information on the Internet.
Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.
Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface
Web.
Deep Web content is highly relevant to every information need, market, and domain.
More than half of the deep Web content resides in topic-specific databases.
A full ninety-five per cent of the deep Web is publicly accessible information not subject to fees
or subscriptions.
To put these findings in perspective, a study at the NEC Research Institute [1] [#fn1] , published in
Nature estimated that the search engines with the largest number of Web pages indexed (such as
Google or Northern Light) each index no more than sixteen per cent of the surface Web. Since they
are missing the deep Web when they use such search engines, Internet searchers are therefore
searching only 0.03% or one in 3,000 of the pages available to them today. Clearly,
simultaneous searching of multiple surface and deep Web sources is necessary when
comprehensive information retrieval is needed.

The Deep Web


Internet content is considerably more diverse and the volume certainly much larger than
commonly understood.
First, though sometimes used synonymously, the World Wide Web (HTTP protocol) is but a subset
of Internet content. Other Internet protocols besides the Web include FTP (file transfer protocol),
e-mail, news, Telnet, and Gopher (most prominent among pre-Web protocols). This paper does not
consider further these non-Web protocols. [2] [#fn2]
Second, even within the strict context of the Web, most users are aware only of the content
presented to them via search engines such as Excite [http://www.excite.com] , Google
[http://www.google.com/] , AltaVista [http://www.altavista.com/] , or Northern Light
[http://www.northernlight.com/] , or search directories such as Yahoo! [http://www.yahoo.com/] ,
About.com [http://www.about.com/] , or LookSmart [http://www.looksmart.com/] . Eighty-five percent
of Web users use search engines to find needed information, but nearly as high a percentage cite
the inability to find desired information as one of their biggest frustrations. [3] [#fn3] According to a
recent survey of search-engine satisfaction by market-researcher NPD, [http://www.npd.com] search
failure rates have increased steadily since 1997. [4a] [#fn4]
The importance of information gathering on the Web and the central and unquestioned role of
search engines plus the frustrations expressed by users about the adequacy of these engines
make them an obvious focus of investigation.
Until Van Leeuwenhoek first looked at a drop of water under a microscope in the late 1600s,
people had no idea there was a whole world of "animalcules" beyond their vision. Deep-sea
exploration in the past thirty years has turned up hundreds of strange creatures that challenge old
ideas about the origins of life and where it can exist. Discovery comes from looking at the world in
new ways and with new tools. The genesis of the BrightPlanet study was to look afresh at the nature
of information on the Web and how it is being identified and organized.

How Search Engines Work


Search engines obtain their listings in two ways: Authors may submit their own Web pages, or the
search engines "crawl" or "spider" documents by following one hypertext link to another. The latter
returns the bulk of the listings. Crawlers work by recording every hypertext link in every page they
index crawling. Like ripples propagating across a pond, search-engine crawlers are able to extend
their indices further and further from their starting points.

The surface Web contains an estimated 2.5 billion


documents, growing at a rate of 7.5 million documents
per day. [5a] [#fn5] The largest search engines have done
"Whole new classes of
Internet-based companies
an impressive job in extending their reach, though Web
choose the Web as their
growth itself has exceeded the crawling ability of search
preferred medium for
engines [6a] [#fn6] [7a] [#fn7] Today, the three largest search
commerce and information
engines in terms of internally reported documents
transfer"
indexed are Google with 1.35 billion documents (500
million available to most searches), [8] [#fn8] Fast,
[http://www.alltheweb.com] with 575 million documents
[9] [#fn9] and Northern Light with 327 million documents. [10] [#fn10]
Legitimate criticism has been leveled against search engines for these indiscriminate crawls,
mostly because they provide too many results (search on "Web," for example, with Northern Light,
and you will get about 47 million hits. Also, because new documents are found from links within
other documents, those documents that are cited are more likely to be indexed than new
documents up to eight times as likely. [5b] [#fn5]
To overcome these limitations, the most recent generation of search engines (notably Google) have
replaced the random link-following approach with directed crawling and indexing based on the
"popularity" of pages. In this approach, documents more frequently cross-referenced than other
documents are given priority both for crawling and in the presentation of results. This approach
provides superior results when simple queries are issued, but exacerbates the tendency to overlook
documents with few links. [5c] [#fn5]
And, of course, once a search engine needs to update literally millions of existing Web pages, the
freshness of its results suffer. Numerous commentators have noted the increased delay in posting
and recording new information on conventional search engines. [11a] [#fn11] Our own empirical tests of
search engine currency suggest that listings are frequently three or four months or more out of
date.
Moreover, return to the premise of how a search engine obtains its listings in the first place,
whether adjusted for popularity or not. That is, without a linkage from another Web document, the
page will never be discovered. But the main failing of search engines is that they depend on the
Web's linkages to identify what is on the Web.
Figure 1 is a graphical representation of the limitations of the typical search engine. The content
identified is only what appears on the surface and the harvest is fairly indiscriminate. There is
tremendous value that resides deeper than this surface content. The information is there, but it is
hiding beneath the surface of the Web.

[/j/jep/images/3336451.0007.104-00000001.gif]
Figure 1. Search Engines: Dragging a Net Across the Web's Surface

Searchable Databases: Hidden Value on the Web


How does information appear and get presented on the Web? In the earliest days of the Web, there
were relatively few documents and sites. It was a manageable task to post all documents as static
pages. Because all pages were persistent and constantly available, they could be crawled easily by
conventional search engines. In July 1994, the Lycos search engine went public with a catalog of

54,000 documents. [12] [#fn12] Since then, the compound growth rate in Web documents has been on
the order of more than 200% annually! [13a] [#fn13]
Sites that were required to manage tens to hundreds of documents could easily do so by posting
fixed HTML pages within a static directory structure. However, beginning about 1996, three
phenomena took place. First, database technology was introduced to the Internet through such
vendors as Bluestone's Sapphire/Web (Bluestone [http://www.bluestone.com] has since been bought
by HP) and later Oracle. [http://www.oracle.com/] Second, the Web became commercialized initially
via directories and search engines, but rapidly evolved to include e-commerce. And, third, Web
servers were adapted to allow the "dynamic" serving of Web pages (for example, Microsoft's ASP
and the Unix PHP technologies).
This confluence produced a true database orientation for the Web, particularly for larger sites. It is
now accepted practice that large data producers such as the U.S. Census Bureau
[http://www.census.gov] , Securities and Exchange Commission [http://www.sec.gov] , and Patent and
Trademark Office [http://www.uspto.gov] , not to mention whole new classes of Internet-based
companies, choose the Web as their preferred medium for commerce and information transfer.
What has not been broadly appreciated, however, is that the means by which these entities provide
their information is no longer through static pages but through database-driven designs.
It has been said that what cannot be seen cannot be defined, and what is not defined cannot be
understood. Such has been the case with the importance of databases to the information content of
the Web. And such has been the case with a lack of appreciation for how the older model of
crawling static Web pages today's paradigm for conventional search engines no longer applies
to the information content of the Internet.
In 1994, Dr. Jill Ellsworth first coined the phrase "invisible Web" to refer to information content
that was "invisible" to conventional search engines. [14] [#fn14] The potential importance of searchable
databases was also reflected in the first search site devoted to them, the AT1 engine that was
announced with much fanfare in early 1997. [15] [#fn15] However, PLS, AT1's owner, was acquired by
AOL in 1998, and soon thereafter the AT1 service was abandoned.
For this study, we have avoided the term "invisible Web" because it is inaccurate. The only thing
"invisible" about searchable databases is that they are not indexable nor able to be queried by
conventional search engines. Using BrightPlanet technology, they are totally "visible" to those who
need to access them.
Figure 2 represents, in a non-scientific way, the improved results that can be obtained by
BrightPlanet technology. By first identifying where the proper searchable databases reside, a
directed query can then be placed to each of these sources simultaneously to harvest only the
results desired with pinpoint accuracy.

[/j/jep/images/3336451.0007.104-00000002.gif]
Figure 2. Harvesting the Deep and Surface Web with a Directed Query Engine

Additional aspects of this representation will be discussed throughout this study. For the moment,
however, the key points are that content in the deep Web is massive approximately 500 times
greater than that visible to conventional search engines with much higher quality throughout.
BrightPlanet's technology is uniquely suited to tap the deep Web and bring its results to the
surface. The simplest way to describe our technology is a "directed-query engine." It has other
powerful features in results qualification and classification, but it is this ability to query multiple
search sites directly and simultaneously that allows deep Web content to be retrieved.

Study Objectives
To perform the study discussed, we used our technology in an iterative process. Our goal was to:
Quantify the size and importance of the deep Web.
Characterize the deep Web's content, quality, and relevance to information seekers.
Discover automated means for identifying deep Web search sites and directing queries to them.
Begin the process of educating the Internet-searching public about this heretofore hidden and
valuable information storehouse.
Like any newly discovered phenomenon, the deep Web is just being defined and understood. Daily,
as we have continued our investigations, we have been amazed at the massive scale and rich
content of the deep Web. This white paper concludes with requests for additional insights and
information that will enable us to continue to better understand the deep Web.

What Has Not Been Analyzed or Included in Results


This paper does not investigate non-Web sources of Internet content. This study also purposely
ignores private intranet information hidden behind firewalls. Many large companies have internal
document stores that exceed terabytes of information. Since access to this information is restricted,
its scale can not be defined nor can it be characterized. Also, while on average 44% of the
"contents" of a typical Web document reside in HTML and other coded information (for example,
XML or Javascript), [16] [#fn16] this study does not evaluate specific information within that code. We
do, however, include those codes in our quantification of total content (see next section).
Finally, the estimates for the size of the deep Web include neither specialized search engine sources

which may be partially "hidden" to the major traditional search engines nor the contents of
major search engines themselves. This latter category is significant. Simply accounting for the
three largest search engines and average Web document sizes suggests search-engine contents
alone may equal 25 terabytes or more [17] [#fn17] or somewhat larger than the known size of the
surface Web.

A Common Denominator for Size Comparisons


All deep-Web and surface-Web size figures use both total number of documents (or database
records in the case of the deep Web) and total data storage. Data storage is based on "HTML
included" Web-document size estimates. [13b] [#fn13] This basis includes all HTML and related code
information plus standard text content, exclusive of embedded images and standard HTTP
"header" information. Use of this standard convention allows apples-to-apples size comparisons
between the surface and deep Web. The HTML-included convention was chosen because:
Most standard search engines that report document sizes do so on this same basis.
When saving documents or Web pages directly from a browser, the file size byte count uses this
convention.
BrightPlanet's reports document sizes on this same basis.
All document sizes used in the comparisons use actual byte counts (1024 bytes per kilobyte).
In actuality, data storage from deep-Web documents
will therefore be considerably less than the figures
"Estimating total record
reported. [18] [#fn18] Actual records retrieved from a
count per site was often not
searchable database are forwarded to a dynamic Web
straightforward"
page template that can include items such as standard
headers and footers, ads, etc. While including this
HTML code content overstates the size of searchable
databases, standard "static" information on the surface Web is presented in the same manner.
HTML-included Web page comparisons provide the common denominator for comparing deep
and surface Web sources.

Use and Role of BrightPlanet Technology


All retrievals, aggregations, and document characterizations in this study used BrightPlanet's
technology. The technology uses multiple threads for simultaneous source queries and then
document downloads. It completely indexes all documents retrieved (including HTML content).
After being downloaded and indexed, the documents are scored for relevance using four different
scoring algorithms, prominently vector space modeling (VSM) and standard and modified
extended Boolean information retrieval (EBIR). [19] [#fn19]
Automated deep Web search-site identification and qualification also used a modified version of
the technology employing proprietary content and HTML evaluation methods.

Surface Web Baseline


The most authoritative studies to date of the size of the surface Web have come from Lawrence and
Giles of the NEC Research Institute in Princeton, NJ. Their analyses are based on what they term
the "publicly indexable" Web. Their first major study, published in Science magazine in 1998, using
analysis from December 1997, estimated the total size of the surface Web as 320 million
documents. [4b] [#fn4] An update to their study employing a different methodology was published in
Nature magazine in 1999, using analysis from February 1999. [5d] [#fn5] This study documented 800
million documents within the publicly indexable Web, with a mean page size of 18.7 kilobytes
exclusive of images and HTTP headers. [20] [#fn20]

In partnership with Inktomi, NEC updated its Web page estimates to one billion documents in
early 2000. [21] [#fn21] We have taken this most recent size estimate and updated total document
storage for the entire surface Web based on the 1999 Nature study:
Table 1. Baseline Surface Web Size Assumptions
Total No. of Documents Content Size (GBs) (HTML basis)
1,000,000,000

18,700

These are the baseline figures used for the size of the surface Web in this paper. (A more recent
study from Cyveillance [5e] [#fn5] has estimated the total surface Web size to be 2.5 billion documents,
growing at a rate of 7.5 million documents per day. This is likely a more accurate number, but the
NEC estimates are still used because they were based on data gathered closer to the dates of our
own analysis.)
Other key findings from the NEC studies that bear on this paper include:
Surface Web coverage by individual, major search engines has dropped from a maximum of 32%
in 1998 to 16% in 1999, with Northern Light showing the largest coverage.
Metasearching using multiple search engines can improve retrieval coverage by a factor of 3.5 or
so, though combined coverage from the major engines dropped to 42% from 1998 to 1999.
More popular Web documents, that is, those with many link references from other documents,
have up to an eight-fold greater chance of being indexed by a search engine than those with no
link references.

Analysis of Largest Deep Web Sites


More than 100 individual deep Web sites were characterized to produce the listing of sixty sites
reported in the next section.
Site characterization required three steps:
1. Estimating the total number of records or documents contained on that site.
2. Retrieving a random sample of a minimum of ten results from each site and then computing
the expressed HTML-included mean document size in bytes. This figure, times the number of
total site records, produces the total site size estimate in bytes.
3. Indexing and characterizing the search-page form on the site to determine subject coverage.
Estimating total record count per site was often not straightforward. A series of tests was applied to
each site and are listed in descending order of importance and confidence in deriving the total
document count:
1. E-mail messages were sent to the webmasters or contacts listed for all sites identified,
requesting verification of total record counts and storage sizes (uncompressed basis); about
13% of the sites shown in Table 2 provided direct documentation in response to this request.
2. Total record counts as reported by the site itself. This involved inspecting related pages on
the site, including help sections, site FAQs, etc.
3. Documented site sizes presented at conferences, estimated by others, etc. This step involved
comprehensive Web searching to identify reference sources.
4. Record counts as provided by the site's own search function. Some site searches provide total
record counts for all queries submitted. For others that use the NOT operator and allow its

stand-alone use, a query term known not to occur on the site such as "NOT ddfhrwxxct" was
issued. This approach returns an absolute total record count. Failing these two options, a
broad query was issued that would capture the general site content; this number was then
corrected for an empirically determined "coverage factor," generally in the 1.2 to 1.4 range
[#fn22]

[22]

5. A site that failed all of these tests could not be measured and was dropped from the results
listing.

Analysis of Standard Deep Web Sites


Analysis and characterization of the entire deep Web involved a number of discrete tasks:
Qualification as a deep Web site.
Estimation of total number of deep Web sites.
Size analysis.
Content and coverage analysis.
Site page views and link references.
Growth analysis.
Quality analysis.
The methods applied to these tasks are discussed separately below.

Deep Web Site Qualification


An initial pool of 53,220 possible deep Web candidate URLs was identified from existing
compilations at seven major sites and three minor ones. [23] [#fn23] After harvesting, this pool
resulted in 45,732 actual unique listings after tests for duplicates. Cursory inspection indicated that
in some cases the subject page was one link removed from the actual search form. Criteria were
developed to predict when this might be the case. The BrightPlanet technology was used to retrieve
the complete pages and fully index them for both the initial unique sources and the one-link
removed sources. A total of 43,348 resulting URLs were actually retrieved.
We then applied a filter criteria to these sites to determine if they were indeed search sites. This
proprietary filter involved inspecting the HTML content of the pages, plus analysis of page text
content. This brought the total pool of deep Web candidates down to 17,579 URLs.
Subsequent hand inspection of 700 random sites from this listing identified further filter criteria.
Ninety-five of these 700, or 13.6%, did not fully qualify as search sites. This correction has been
applied to the entire candidate pool and the results presented.
Some of the criteria developed when hand-testing the 700 sites were then incorporated back into
an automated test within the BrightPlanet technology for qualifying search sites with what we
believe is 98% accuracy. Additionally, automated means for discovering further search sites has
been incorporated into our internal version of the technology based on what we learned.

Estimation of Total Number of Sites


The basic technique for estimating total deep Web sites uses "overlap" analysis, the accepted
technique chosen for two of the more prominent surface Web size analyses. [6b] [#fn6] [24] [#fn24] We
used overlap analysis based on search engine coverage and the deep Web compilation sites noted
above (see results in Table 3 through Table 5).
The technique is illustrated in the diagram below:

[/j/jep/images/3336451.0007.104-00000003.gif]
Figure 3. Schematic Representation of "Overlap" Analysis

Overlap analysis involves pairwise comparisons of the number of listings individually within two
sources, na and nb, and the degree of shared listings or overlap, n0, between them. Assuming
random listings for both na and nb, the total size of the population, N, can be estimated. The
estimate of the fraction of the total population covered by na is no/nb; when applied to the total size
of na an estimate for the total population size can be derived by dividing this fraction into the total
size of na. These pairwise estimates are repeated for all of the individual sources used in the
analysis.
To illustrate this technique, assume, for example, we know our total population is 100. Then if two
sources, A and B, each contain 50 items, we could predict on average that 25 of those items would
be shared by the two sources and 25 items would not be listed by either. According to the formula
above, this can be represented as: 100 = 50 / (25/50)
There are two keys to overlap analysis. First, it is important to have a relatively accurate estimate
for total listing size for at least one of the two sources in the pairwise comparison. Second, both
sources should obtain their listings randomly and independently from one another.
This second premise is in fact violated for our deep Web source analysis. Compilation sites are
purposeful in collecting their listings, so their sampling is directed. And, for search engine listings,
searchable databases are more frequently linked to because of their information value which
increases their relative prevalence within the engine listings. [5f] [#fn5] Thus, the overlap analysis
represents a lower bound on the size of the deep Web since both of these factors will tend to
increase the degree of overlap, n0, reported between the pairwise sources.

Deep Web Size Analysis


In order to analyze the total size of the deep Web, we need an average site size in documents and
data storage to use as a multiplier applied to the entire population estimate. Results are shown in
Figure 4 and Figure 5.
As discussed for the large site analysis, obtaining this information is not straightforward and
involves considerable time evaluating each site. To keep estimation time manageable, we chose a
+/- 10% confidence interval at the 95% confidence level, requiring a total of 100 random sites to be
fully characterized. [25a] [#fn25]
We randomized our listing of 17,000 search site candidates. We then proceeded to work through
this list until 100 sites were fully characterized. We followed a less-intensive process to the large
sites analysis for determining total record or document count for the site.

Exactly 700 sites were inspected in their randomized order to obtain the 100 fully characterized
sites. All sites inspected received characterization as to site type and coverage; this information was
used in other parts of the analysis.
The 100 sites that could have their total
record/document count determined were then sampled
"The invisible portion of the
for average document size (HTML-included basis).
Web will continue to grow
Random queries were issued to the searchable database
exponentially before the
with results reported as HTML pages. A minimum of
tools to uncover the hidden
ten of these were generated, saved to disk, and then
Web are ready for general
averaged to determine the mean site page size. In a few
use"
cases, such as bibliographic databases, multiple records
were reported on a single HTML page. In these
instances, three total query results pages were
generated, saved to disk, and then averaged based on the total number of records reported on those
three pages.

Content Coverage and Type Analysis


Content coverage was analyzed across all 17,000 search sites in the qualified deep Web pool
(results shown in Table 6); the type of deep Web site was determined from the 700 handcharacterized sites (results shown in Figure 6).
Broad content coverage for the entire pool was determined by issuing queries for twenty top-level
domains against the entire pool. Because of topic overlaps, total occurrences exceeded the number
of sites in the pool; this total was used to adjust all categories back to a 100% basis.
Hand characterization by search-database type resulted in assigning each site to one of twelve
arbitrary categories that captured the diversity of database types. These twelve categories are:
1. Topic Databases subject-specific aggregations of information, such as SEC corporate
filings, medical databases, patent records, etc.
2. Internal site searchable databases for the internal pages of large sites that are dynamically
created, such as the knowledge base on the Microsoft site.
3. Publications searchable databases for current and archived articles.
4. Shopping/Auction.
5. Classifieds.
6. Portals broader sites that included more than one of these other categories in searchable
databases.
7. Library searchable internal holdings, mostly for university libraries.
8. Yellow and White Pages people and business finders.
9. Calculators while not strictly databases, many do include an internal data component for
calculating results. Mortgage calculators, dictionary look-ups, and translators between
languages are examples.
10. Jobs job and resume postings.
11. Message or Chat .
12. General Search searchable databases most often relevant to Internet search topics and
information.
These 700 sites were also characterized as to whether they were public or subject to subscription or

fee access.

Site Pageviews and Link References


Netscape's "What's Related" browser option, a service from Alexa, provides site popularity
rankings and link reference counts for a given URL. [26a] [#fn26] About 71% of deep Web sites have
such rankings. The universal power function (a logarithmic growth rate or logarithmic
distribution) allows pageviews per month to be extrapolated from the Alexa popularity rankings.
[27] [#fn27] The "What's Related" report also shows external link counts to the given URL.
A random sampling for each of 100 deep and surface Web sites for which complete "What's
Related" reports could be obtained were used for the comparisons.

Growth Analysis
The best method for measuring growth is with time-series analysis. However, since the discovery of
the deep Web is so new, a different gauge was necessary.
Whois [http://www.whois.net] [28] [#fn28] searches associated with domain-registration services [25b]
[#fn25] return records listing domain owner, as well as the date the domain was first obtained (and
other information). Using a random sample of 100 deep Web sites [26b] [#fn26] and another sample of
100 surface Web sites [29] [#fn29] we issued the domain names to a Whois search and retrieved the
date the site was first established. These results were then combined and plotted for the deep vs.
surface Web samples.

Quality Analysis
Quality comparisons between the deep and surface Web content were based on five diverse,
subject-specific queries issued via the BrightPlanet technology to three search engines (AltaVista,
Fast, Northern Light) [30] [#fn30] and three deep sites specific to that topic and included in the 600
sites presently configured for our technology. The five subject areas were agriculture, medicine,
finance/business, science, and law.
The queries were specifically designed to limit total results returned from any of the six sources to a
maximum of 200 to ensure complete retrieval from each source. [31] [#fn31] The specific technology
configuration settings are documented in the endnotes. [32] [#fn32]
The "quality" determination was based on an average of our technology's VSM and mEBIR
computational linguistic scoring methods. [33] [#fn33] [34] [#fn34] The "quality" threshold was set at our
score of 82, empirically determined as roughly accurate from millions of previous scores of surface
Web documents.
Deep Web vs. surface Web scores were obtained by using the BrightPlanet technology's selection by
source option and then counting total documents and documents above the quality scoring
threshold.

Results and Discussion


This study is the first known quantification and characterization of the deep Web. Very little has
been written or known of the deep Web. Estimates of size and importance have been anecdotal at
best and certainly underestimate scale. For example, Intelliseek's "invisible Web" says that, "In our
best estimates today, the valuable content housed within these databases and searchable sources is
far bigger than the 800 million plus pages of the 'Visible Web.'" They also estimate total deep Web
sources at about 50,000 or so. [35] [#fn35]
Ken Wiseman, who has written one of the most accessible discussions about the deep Web,
intimates that it might be about equal in size to the known Web. He also goes on to say, "I can
safely predict that the invisible portion of the Web will continue to grow exponentially before the
tools to uncover the hidden Web are ready for general use." [36] [#fn36] A mid-1999 survey by

About.com's Web search guide concluded the size of the deep Web was "big and getting bigger." [37]
[#fn37] A paper at a recent library science meeting suggested that only "a relatively small fraction of
the Web is accessible through search engines." [38] [#fn38]
The deep Web is about 500 times larger than the surface Web, with, on average, about three times
higher quality based on our document scoring methods on a per-document basis. On an absolute
basis, total deep Web quality exceeds that of the surface Web by thousands of times. Total number
of deep Web sites likely exceeds 200,000 today and is growing rapidly. [39] [#fn39] Content on the
deep Web has meaning and importance for every information seeker and market. More than 95%
of deep Web information is publicly available without restriction. The deep Web also appears to be
the fastest growing information component of the Web.

General Deep Web Characteristics


Deep Web content has some significant differences from surface Web content. Deep Web
documents (13.7 KB mean size; 19.7 KB median size) are on average 27% smaller than surface Web
documents. Though individual deep Web sites have tremendous diversity in their number of
records, ranging from tens or hundreds to hundreds of millions (a mean of 5.43 million records per
site but with a median of only 4,950 records), these sites are on average much, much larger than
surface sites. The rest of this paper will serve to amplify these findings.
The mean deep Web site has a Web-expressed (HTML-included basis) database size of 74.4 MB
(median of 169 KB). Actual record counts and size estimates can be derived from one-in-seven
deep Web sites.
On average, deep Web sites receive about half again as much monthly traffic as surface sites
(123,000 pageviews per month vs. 85,000). The median deep Web site receives somewhat more
than two times the traffic of a random surface Web site (843,000 monthly pageviews vs. 365,000).
Deep Web sites on average are more highly linked to than surface sites by nearly a factor of two
(6,200 links vs. 3,700 links), though the median deep Web site is less so (66 vs. 83 links). This
suggests that well-known deep Web sites are highly popular, but that the typical deep Web site is
not well known to the Internet search public.
One of the more counter-intuitive results is that 97.4% of deep Web sites are publicly available
without restriction; a further 1.6% are mixed (limited results publicly available with greater results
requiring subscription and/or paid fees); only 1.1% of results are totally subscription or fee limited.
This result is counter intuitive because of the visible prominence of subscriber-limited sites such as
Dialog, Lexis-Nexis, Wall Street Journal Interactive, etc. (We got the document counts from the
sites themselves or from other published sources.)
However, once the broader pool of deep Web sites is looked at beyond the large, visible, fee-based
ones, public availability dominates.

60 Deep Sites Already Exceed the Surface Web by 40


Times
Table 2 indicates that the sixty known, largest deep Web sites contain data of about 750 terabytes
(HTML-included basis) or roughly forty times the size of the known surface Web. These sites
appear in a broad array of domains from science to law to images and commerce. We estimate the
total number of records or documents within this group to be about eighty-five billion.
Roughly two-thirds of these sites are public ones, representing about 90% of the content available
within this group of sixty. The absolutely massive size of the largest sites shown also illustrates the
universal power function distribution of sites within the deep Web, not dissimilar to Web site
popularity [40] [#fn40] or surface Web sites. [41] [#fn41] One implication of this type of distribution is that
there is no real upper size boundary to which sites may grow.
Table 2. Sixty Largest Deep Web Sites

Name

Type

URL

Web
Size
(GBs)

National
Climatic Data
Center (NOAA)

Public

http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html

366,00

NASA EOSDIS

Public

http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html 219,60

National
Oceanographic
(combined with
Geophysical)
Data Center
(NOAA)

Public/Fee

http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/

32,940

Alexa

Public
(partial)

http://www.alexa.com/

15,860

Right-to-Know
Network (RTK
Net)

Public

http://www.rtk.net/

14,640

MP3.com

Public

http://www.mp3.com/

4,300

Terraserver

Public/Fee

http://terraserver.microsoft.com/

4,270

HEASARC
(High Energy
Astrophysics
Science Archive
Research
Center)

Public

http://heasarc.gsfc.nasa.gov/W3Browse/

2,562

US PTO Trademarks +
Patents

Public

http://www.uspto.gov/tmdb/, http://www.uspto.gov/patft/

2,440

Informedia
(Carnegie
Mellon Univ.)

Public (not
yet)

http://www.informedia.cs.cmu.edu/

1,830

Alexandria
Digital Library

Public

http://www.alexandria.ucsb.edu/adl.html

1,220

JSTOR Project

Limited

http://www.jstor.org/

1,220

10K Search
Wizard

Public

http://www.tenkwizard.com/

769

UC Berkeley
Digital Library
Project

Public

http://elib.cs.berkeley.edu/

766

SEC Edgar

Public

http://www.sec.gov/edgarhp.htm

610

US Census

Public

http://factfinder.census.gov

610

NCI CancerNet
Database

Public

http://cancernet.nci.nih.gov/

488

Amazon.com

Public

http://www.amazon.com/

461

IBM Patent
Center

Public/Private http://www.patents.ibm.com/boolquery

345

NASA Image
Exchange

Public

337

InfoUSA.com

Public/Private http://www.abii.com/

195

Betterwhois
(many similar)

Public

http://betterwhois.com/

152

GPO Access

Public

http://www.access.gpo.gov/

146

Adobe PDF
Search

Public

http://searchpdf.adobe.com/

143

http://nix.nasa.gov/

Internet
Auction List

Public

http://www.internetauctionlist.com/search_products.html

130

Commerce, Inc.

Public

http://search.commerceinc.com/

122

Library of
Public
Congress Online
Catalog

http://catalog.loc.gov/

116

Sunsite Europe

Public

http://src.doc.ic.ac.uk/

98

Uncover
Periodical DB

Public/Fee

http://uncweb.carl.org/

97

Astronomer's
Bazaar

Public

http://cdsweb.u-strasbg.fr/Cats.html

94

eBay.com

Public

http://www.ebay.com/

82

REALTOR.com
Real Estate
Search

Public

http://www.realtor.com/

60

Federal Express

Public (if
shipper)

http://www.fedex.com/

53

Integrum

Public/Private http://www.integrumworld.com/eng_test/index.html

49

NIH PubMed

Public

http://www.ncbi.nlm.nih.gov/PubMed/

41

Visual Woman
(NIH)

Public

http://www.nlm.nih.gov/research/visible/visible_human.html

40

AutoTrader.com Public

http://www.autoconnect.com/index.jtmpl/?
LNX=M1DJAROSTEXT

39

UPS

Public (if
shipper)

http://www.ups.com/

33

NIH GenBank

Public

http://www.ncbi.nlm.nih.gov/Genbank/index.html

31

AustLi
(Australasian
Legal
Information
Institute)

Public

http://www.austlii.edu.au/austlii/

24

Digital Library
Program (UVa)

Public

http://www.lva.lib.va.us/

21

Subtotal
Public and
Mixed
Sources

673,0

DBT Online

Fee

http://www.dbtonline.com/

30,500

Lexis-Nexis

Fee

http://www.lexis-nexis.com/lncc/

12,200

Dialog

Fee

http://www.dialog.com/

10,980

Genealogy ancestry.com

Fee

http://www.ancestry.com/

6,500

ProQuest Direct
(incl. Digital
Vault)

Fee

http://www.umi.com

3,172

Dun &
Bradstreet

Fee

http://www.dnb.com

3,113

Westlaw

Fee

http://www.westlaw.com/

2,684

Dow Jones
News Retrieval

Fee

http://dowjones.wsj.com/p/main.html

2,684

infoUSA

Fee/Public

http://www.infousa.com/

1,584

Elsevier Press

Fee

http://www.elsevier.com

570

EBSCO

Fee

http://www.ebsco.com

481

Springer-Verlag

Fee

http://link.springer.de/

221

OVID
Technologies

Fee

http://www.ovid.com

191

Investext

Fee

http://www.investext.com/

157

Blackwell
Science

Fee

http://www.blackwell-science.com

146

GenServ

Fee

http://gs01.genserv.com/gs/bcc.htm

106

Academic Press
IDEAL

Fee

http://www.idealibrary.com

104

Tradecompass

Fee

http://www.tradecompass.com/

61

INSPEC

Fee

http://www.iee.org.uk/publish/inspec/online/online.html

16

Subtotal FeeBased
Sources

75.46

TOTAL

748,5

This listing is preliminary and likely incomplete since we lack a complete census of deep Web sites.
Our inspection of the 700 random-sample deep Web sites identified another three that were not in
the initially identified pool of 100 potentially large sites. If that ratio were to hold across the entire
estimated 200,000 deep Web sites (see next table), perhaps only a very small percentage of sites
shown in this table would prove to be the largest. However, since many large sites are anecdotally
known, we believe our listing, while highly inaccurate, may represent 10% to 20% of the actual
largest deep Web sites in existence.
This inability to identify all of the largest deep Web sites today should not be surprising. The
awareness of the deep Web is a new phenomenon and has received little attention. We solicit
nominations for additional large sites on our comprehensive CompletePlanet site and will
document new instances as they arise.

Deep Web is 500 Times Larger than the Surface Web


We employed three types of overlap analysis to estimate the total numbers of deep Web sites. In
the first approach, shown in Table 3, we issued 100 random deep Web URLs from our pool of
17,000 to the search engines that support URL search. These results, with the accompanying
overlap analysis, are:
Table 3. Estimation of Deep Web Sites, Search Engine Overlap Analysis
Search Engine A

Total Est.
Deep
Search
A no
Search
B no
A
Unique Database Database Web Sites
Engine A dupes Engine B dupes plus
Fraction Size
B
AltaVista

Northern
Light

60

0.133

20,635

154,763

AltaVista

Fast

57

0.140

20,635

147,024

Fast

57

AltaVista

49

0.889

27,940

31,433

Northern
Light

60

AltaVista

52

0.889

27,195

30,594

Northern
Light

60

Fast

57

44

16

0.772

27,195

35,230

Fast

57

Northern
Light

60

44

13

0.733

27,940

38,100

This table shows greater diversity in deep Web site estimates as compared to normal surface Web
overlap analysis. We believe the reasons for this variability are: 1) the relatively small sample size
matched against the engines; 2) the high likelihood of inaccuracy in the baseline for total deep Web
database sizes from Northern Light [42] [#fn42] ; and 3) the indiscriminate scaling of Fast and
AltaVista deep Web site coverage based on the surface ratios of these engines to Northern Light. As
a result, we have little confidence in these results.
An alternate method is to compare NEC reported values [5g] [#fn5] for surface Web coverage to the
reported deep Web sites from the Northern Light engine. These numbers were further adjusted by
the final qualification fraction obtained from our hand scoring of 700 random deep Web sites.
These results are shown below:
Table 4. Estimation of Deep Web Sites, Search Engine Market Share Basis
Search
Engine

ReportedDeep
WebSites

Surface Web
Coverage %

QualificationFraction Total
Est.Deep
WebSites

Northern
Light

27,195

16.0%

86.4%

146,853

AltaVista

20,635

15.5%

86.4%

115,023

This approach, too, suffers from the limitations of using the Northern Light deep Web site baseline.
It is also unclear, though likely, that deep Web search coverage is more highly represented in the
search engines' listing as discussed above.
Our third approach is more relevant and is shown in Table 5.
Under this approach, we use overlap analysis for the three largest compilation sites for deep Web
sites used to build our original 17,000 qualified candidate pool. To our knowledge, these are the
three largest listings extant, excepting our own CompletePlanet site.
This approach has the advantages of:
providing an absolute count of sites
ensuring final qualification as to whether the sites are actually deep Web search sites
relatively large sample sizes.
Because each of the three compilation sources has a known population, the table shows only three
pairwise comparisons (e.g., there is no uncertainty in the ultimate A or B population counts).
Table 5. Estimation of Deep Web Sites, Searchable Database Compilation Overlap
Analysis
DB A

A no
dups

DB B

B no
dups

A
+
B

Unique DB
DB
Fract. Size

Total Estimated
Deep Web Sites

Lycos

5,081

Internets 3,449

256 4,825

0.074

5,081 68,455

Lycos

5,081

Infomine 2,969

156 4,925

0.053

5,081 96,702

Internets 3,449

Infomine 2,969

234 3,215

0.079

3,449 43,761

As discussed above, there is certainly sampling bias in these compilations since they were
purposeful and not randomly obtained. Despite this, there is a surprising amount of uniqueness
among the compilations.
The Lycos and Internets listings are more similar in focus in that they are commercial sites. The
Infomine site was developed from an academic perspective. For this reason, we adjudge the LycosInfomine pairwise comparison to be most appropriate. Though sampling was directed for both
sites, the intended coverage and perspective is different.
There is obviously much uncertainty in these various tables. Because of lack of randomness, these
estimates are likely at the lower bounds for the number of deep Web sites. Across all estimating
methods the mean estimate for number of deep Web sites is about 76,000, with a median of about
56,000. For the searchable database compilation only, the average is about 70,000.
The under count due to lack of randomness and what we believe to be the best estimate above,
namely the Lycos-Infomine pair, indicate to us that the ultimate number of deep Web sites today is
on the order of 200,000.

[/j/jep/images/3336451.0007.104-00000004.gif]
Figure 4. Inferred Distribution of Deep Web Sites, Total Record Size

Plotting the fully characterized random 100 deep Web sites against total record counts produces
Figure 4. Plotting these same sites against database size (HTML-included basis) produces Figure 5.
Multiplying the mean size of 74.4 MB per deep Web site times a total of 200,000 deep Web sites
results in a total deep Web size projection of 7.44 petabytes, or 7,440 terabytes. [43] [#fn43] [44a] [#fn44]
Compared to the current surface Web content estimate of 18.7 TB (see Table 1), this suggests a
deep Web size about 400 times larger than the surface Web. Even at the lowest end of the deep
Web size estimates in Table 3 through Table 5, the deep Web size calculates as 120 times larger
than the surface Web. At the highest end of the estimates, the deep Web is about 620 times the size
of the surface Web.
Alternately, multiplying the mean document/record count per deep Web site of 5.43 million times
200,000 total deep Web sites results in a total record count across the deep Web of 543 billion
documents. [44b] [#fn44] Compared to the Table 1 estimate of one billion documents, this implies a
deep Web 550 times larger than the surface Web. At the low end of the deep Web size estimate this
factor is 170 times; at the high end, 840 times.
Clearly, the scale of the deep Web is massive, though uncertain. Since 60 deep Web sites alone are
nearly 40 times the size of the entire surface Web, we believe that the 200,000 deep Web site basis

is the most reasonable one. Thus, across database and record sizes, we estimate the deep Web to be
about 500 times the size of the surface Web.

[/j/jep/images/3336451.0007.104-00000005.gif]
Figure 5. Inferred Distribution of Deep Web Sites, Total Database Size (MBs)

Deep Web Coverage is Broad, Relevant


Table 6 represents the subject coverage across all 17,000 deep Web sites used in this study. These
subject areas correspond to the top-level subject structure of the CompletePlanet site. The table
shows a surprisingly uniform distribution of content across all areas, with no category lacking
significant representation of content. Actual inspection of the CompletePlanet site by node shows
some subjects are deeper and broader than others. However, it is clear that deep Web content also
has relevance to every information need and market.
Table 6. Distribution of Deep Sites by Subject Area
Deep Web Coverage
Agriculture

2.7%

Arts

6.6%

Business

5.9%

Computing/Web

6.9%

Education

4.3%

Employment

4.1%

Engineering

3.1%

Government

3.9%

Health

5.5%

Humanities

13.5%

Law/Politics

3.9%

Lifestyles

4.0%

News, Media

12.2%

People, Companies 4.9%

Recreation, Sports

3.5%

References

4.5%

Science, Math

4.0%

Travel

3.4%

Shopping

3.2%

Figure 6 displays the distribution of deep Web sites by type of content.

[/j/jep/images/3336451.0007.104-00000006.gif]
Figure 6. Distribution of Deep Web Sites by Content Type

More than half of all deep Web sites feature topical databases. Topical databases plus large internal
site documents and archived publications make up nearly 80% of all deep Web sites. Purchasetransaction sites including true shopping sites with auctions and classifieds account for
another 10% or so of sites. The other eight categories collectively account for the remaining 10% or
so of sites.

Deep Web is Higher Quality


"Quality" is subjective: If you get the results you desire, that is high quality; if you don't, there is no
quality at all.
When BrightPlanet assembles quality results for its Web-site clients, it applies additional filters
and tests to computational linguistic scoring. For example, university course listings often contain
many of the query terms that can produce high linguistic scores, but they have little intrinsic
content value unless you are a student looking for a particular course. Various classes of these
potential false positives exist and can be discovered and eliminated through learned business rules.
Our measurement of deep vs. surface Web quality did not apply these more sophisticated filters.
We relied on computational linguistic scores alone. We also posed five queries across various
subject domains. Using only computational linguistic scoring does not introduce systematic bias in
comparing deep and surface Web results because the same criteria are used in both. The relative
differences between surface and deep Web should maintain, even though the absolute values are
preliminary and will overestimate "quality." The results of these limited tests are shown in Table 7.

Table 7. "Quality" Document Retrieval, Deep vs. Surface Web


Query

Surface Web

Deep Web

Total "Quality" Yield Total

"Quality" Yield

Agriculture 400

20

5.0%

300

42

14.0%

Medicine

500

23

4.6%

400

50

12.5%

Finance

350

18

5.1%

600

75

12.5%

Science

700

30

4.3%

700

80

11.4%

Law

260

12

4.6%

320

38

11.9%

TOTAL

2,210 103

4.7%

2,320 285

12.3%

This table shows that there is about a three-fold improved likelihood for obtaining quality results
from the deep Web as from the surface Web on average for the limited sample set. Also, the
absolute number of results shows that deep Web sites tend to return 10% more documents than
surface Web sites and nearly triple the number of quality documents. While each query used three
of the largest and best search engines and three of the best known deep Web sites, these results are
somewhat misleading and likely underestimate the "quality" difference between the surface and
deep Web. First, there are literally hundreds of applicable deep Web sites for each query subject
area. Some of these additional sites would likely not return as high an overall quality yield, but
would add to the total number of quality results returned. Second, even with increased numbers of
surface search engines, total surface coverage would not go up significantly and yields would
decline, especially if duplicates across all search engines were removed (as they should be). And,
third, we believe the degree of content overlap between deep Web sites to be much less than for
surface Web sites.(45) Though the quality tests applied in this study are not definitive, we believe
they point to a defensible conclusion that quality is many times greater for the deep Web than for
the surface Web. Moreover, the deep Web has the prospect of yielding quality results that cannot
be obtained by any other means, with absolute numbers of quality results increasing as a function
of the number of deep Web sites simultaneously searched. The deep Web thus appears to be a
critical source when it is imperative to find a "needle in a haystack."

Deep Web Growing Faster than Surface Web


Lacking time-series analysis, we used the proxy of domain registration date to measure the growth
rates for each of 100 randomly chosen deep and surface Web sites. These results are presented as a
scattergram with superimposed growth trend lines in Figure 7.

[/j/jep/images/3336451.0007.104-00000007.gif]
Figure 7. Comparative Deep and Surface Web Site Growth Rates

Use of site domain registration as a proxy for growth has a number of limitations. First, sites are
frequently registered well in advance of going "live." Second, the domain registration is at the root
or domain level (e.g., www.mainsite.com [http://www.mainsite.com] ). The search function and page
whether for surface or deep sites often is introduced after the site is initially unveiled and may
itself reside on a subsidiary form not discoverable by the whois analysis.
The best way to test for actual growth is a time series analysis. BrightPlanet plans to institute such
tracking mechanisms to obtain better growth estimates in the future.
However, this limited test does suggest faster growth for the deep Web. Both median and average
deep Web sites are four or five months "younger" than surface Web sites (Mar. 95 v. Aug. 95). This
is not surprising. The Internet has become the preferred medium for public dissemination of
records and information, and more and more information disseminators (such as government
agencies and major research projects) that have enough content to qualify as deep Web are moving
their information online. Moreover, the technology for delivering deep Web sites has been around
for a shorter period of time.

Thousands of Conventional Search Engines Remain


Undiscovered
While we have specifically defined the deep Web to exclude search engines (see next section), many
specialized search engines, such as those shown in Table 8 below or @griculture.com
[http://www.agriculture.com/] , AgriSurf [http://www.agrisurf.com/agrisurfscripts/agrisurf.asp?index=_25] ,
or joefarmer [formerly http://www.joefarmer.com/] in the agriculture domain, provide unique
content not readily indexed by major engines such as AltaVista, Fast or Northern Light. The key
reasons that specialty search engines may contain information not on the major ones are indexing
frequency and limitations the major search engines may impose on documents indexed per site.
[11b] [#fn11]

To find out whether the specialty search engines really do offer unique information, we used
similar retrieval and qualification methods on them pairwise overlap analysis in a new
investigation. The results of this analysis are shown in the table below.

Table 8. Estimated Number of Surface Site Search Engines


Search Engine A
Search Engine A

A no
Search
dupes Engine B

B no
A
Unique Search
Search Est. #
dupes plus
Engine
Engine of
B
Fraction Size
Search
Engines

FinderSeeker

2,012

SEG

1,268

233

1,779

0.184

2,012

10,949

FinderSeeker

2,012

Netherlands

1,170

167

1,845

0.143

2,012

14,096

FinderSeeker

2,012

LincOne

783

129

1,883

0.165

2,012

12,212

SearchEngineGuide 1,268

FinderSeeker 2,012

233

1,035

0.116

1,268

10,949

SearchEngineGuide 1,268

Netherlands

1,170

160

1,108

0.137

1,268

9,272

SearchEngineGuide 1,268

LincOne

783

28

1,240

0.036

1,268

35,459

Netherlands

1,170

FinderSeeker 2,012

167

1,003

0.083

1,170

14,096

Netherlands

1,170

SEG

1,268

160

1,010

0.126

1,170

9,272

Netherlands

1,170

LincOne

783

44

1,126

0.056

1,170

20,821

LincOne

783

FinderSeeker 2,012

129

654

0.064

783

12,212

LincOne

783

SEG

1,268

28

755

0.022

783

35,459

LincOne

783

Netherlands

1,170

44

739

0.038

783

20,821

These results suggest there may be on the order of 20,000 to 25,000 total search engines currently
on the Web. (Recall that all of our deep Web analysis excludes these additional search engine sites.)
M. Hofstede, of the Leiden University Library in the Netherlands, reports that one compilation
alone contains nearly 45,000 search site listings. [46] [#fn46] Thus, our best current estimate is that
deep Web searchable databases and search engines have a combined total of 250,000 sites.
Whatever the actual number proves to be, comprehensive Web search strategies should include the
specialty search engines as well as deep Web sites. Thus, BrightPlanet's CompletePlanet Web site
also includes specialty search engines in its listings.

Commentary
The most important findings from our analysis of the deep Web are that there is massive and
meaningful content not discoverable with conventional search technology and that there is a nearly
uniform lack of awareness that this critical content even exists.

Original Deep Content Now Exceeds All Printed


Global Content
International Data Corporation predicts that the number of surface Web documents will grow from
the current two billion or so to 13 billion within three years, a factor increase of 6.5 times; [47] [#fn47]
deep Web growth should exceed this rate, perhaps increasing about nine-fold over the same period.
Figure 8 compares this growth with trends in the cumulative global content of print information
drawn from a recent UC Berkeley study. [48a] [#fn48]

[/j/jep/images/3336451.0007.104-00000008.gif]
Figure 8. 10-yr Growth Trends in Cumulative Original Information Content (log scale)

The total volume of printed works (books, journals, newspapers, newsletters, office documents)
has held steady at about 390 terabytes (TBs). [48b] [#fn48] By about 1998, deep Web original
information content equaled all print content produced through history up until that time. By
2000, original deep Web content is estimated to have exceeded print by a factor of seven and is
projected to exceed print content by a factor of sixty three by 2003.
Other indicators point to the deep Web as the fastest growing component of the Web and will
continue to dominate it. [49] [#fn49] Even today, at least 240 major libraries have their catalogs on
line; [50] [#fn50] UMI, a former subsidiary of Bell & Howell, has plans to put more than 5.5 billion
document images online; [51] [#fn51] and major astronomy data initiatives are moving toward putting
petabytes of data online. [52] [#fn52]
These trends are being fueled by the phenomenal growth and cost reductions in digital, magnetic
storage. [48c] [#fn48] [53] [#fn53] International Data Corporation estimates that the amount of disk
storage capacity sold annually grew from 10,000 terabytes in 1994 to 116,000 terabytes in 1998,
and it is expected to increase to 1,400,000 terabytes in 2002. [54] [#fn54] Deep Web content accounted
for about 1/338th of magnetic storage devoted to original content in 2000; it is projected to
increase to 1/200th by 2003. As the Internet is expected to continue as the universal medium for
publishing and disseminating information, these trends are sure to continue.

The Gray Zone


There is no bright line that separates content sources on the Web. There are circumstances where
"deep" content can appear on the surface, and, especially with specialty search engines, when
"surface" content can appear to be deep.
Surface Web content is persistent on static pages discoverable by search engines through crawling,
while deep Web content is only presented dynamically in response to a direct request. However,
once directly requested, deep Web content comes associated with a URL, most often containing the
database record number, that can be re-used later to obtain the same document.
We can illustrate this point using one of the best searchable databases on the Web, 10Kwizard
[http://www.10kwizard.com] . 10Kwizard provides full-text searching of SEC corporate filings [55] [#fn55]

. We issued a query on "NCAA basketball" with a restriction to review only annual filings filed
between March 1999 and March 2000. One result was produced for Sportsline USA, Inc. Clicking
on that listing produces full-text portions for the query string in that annual filing. With another
click, the full filing text can also be viewed. The URL resulting from this direct request is:
http://www.10kwizard.com/blurbs.php?repo=tenk & ipage=1067295 &
exp=%22ncaa+basketball%22 & g=
Note two things about this URL. First, our query terms appear in it. Second, the "ipage=" shows a
unique record number, in this case 1067295. It is via this record number that the results are served
dynamically from the 10KWizard database.
Now, if we were doing comprehensive research on this company and posting these results on our
own Web page, other users could click on this URL and get the same information. Importantly, if
we had posted this URL on a static Web page, search engine crawlers could also discover it, use the
same URL as shown above, and then index the contents.
It is by doing searches and making the resulting URLs available that deep content can be brought
to the surface. Any deep content listed on a static Web page is discoverable by crawlers and
therefore indexable by search engines. As the next section describes, it is impossible to completely
"scrub" large deep Web sites for all content in this manner. But it does show why some deep Web
content occasionally appears on surface Web search engines.
This gray zone also encompasses surface Web sites that are available through deep Web sites. For
instance, the Open Directory Project [http://dmoz.org] , is an effort to organize the best of surface
Web content using voluntary editors or "guides." [56] [#fn56] The Open Directory looks something like
Yahoo!; that is, it is a tree structure with directory URL results at each branch. The results pages
are static, laid out like disk directories, and are therefore easily indexable by the major search
engines.
The Open Directory claims a subject structure of 248,000 categories, [57] [#fn57] each of which is a
static page. [58] [#fn58] The key point is that every one of these 248,000 pages is indexable by major
search engines.
Four major search engines with broad surface coverage allow searches to be specified based on
URL. The query "URL:dmoz.org" (the address for the Open Directory site) was posed to these
engines with these results:
Table 9. Incomplete Indexing of Surface Web Sites
Engine

OPD Pages Yield

Open Directory (OPD) 248,706

AltaVista

17,833

7.2%

Fast

12,199

4.9%

Northern Light

11,120

4.5%

Go (Infoseek)

1,970

0.8%

Although there are almost 250,000 subject pages at the Open Directory site, only a tiny percentage
are recognized by the major search engines. Clearly the engines' search algorithms have rules about
either depth or breadth of surface pages indexed for a given site. We also found a broad variation in
the timeliness of results from these engines. Specialized surface sources or engines should
therefore be considered when truly deep searching is desired. That bright line between deep and
surface Web shows is really shades of gray.

The Impossibility of Complete Indexing of Deep Web

Content
Consider how a directed query works: specific requests need to be posed against the searchable
database by stringing together individual query terms (and perhaps other filters such as date
restrictions). If you do not ask the database specifically what you want, you will not get it.
Let us take, for example, our own listing of 38,000 deep Web sites. Within this compilation, we
have some 430,000 unique terms and a total of 21,000,000 terms. If these numbers represented
the contents of a searchable database, then we would have to issue 430,000 individual queries to
ensure we had comprehensively "scrubbed" or obtained all records within the source database. Our
database is small compared to some large deep Web databases. For example, one of the largest
collections of text terms is the British National Corpus containing more than 100 million unique
terms. [59] [#fn59]
It is infeasible to issue many hundreds of thousands or millions of direct queries to individual deep
Web search databases. It is implausible to repeat this process across tens to hundreds of thousands
of deep Web sites. And, of course, because content changes and is dynamic, it is impossible to
repeat this task on a reasonable update schedule. For these reasons, the predominant share of the
deep Web content will remain below the surface and can only be discovered within the context of a
specific information request.

Possible Double Counting


Web content is distributed and, once posted, "public" to any source that chooses to replicate it.
How much of deep Web content is unique, and how much is duplicated? And, are there differences
in duplicated content between the deep and surface Web?

"Surface Web sites are


fraught with quality
problems"

This study was not able to resolve these questions.


Indeed, it is not known today how much duplication
occurs within the surface Web.

Observations from working with the deep Web sources


and data suggest there are important information
categories where duplication does exist. Prominent
among these are yellow/white pages, genealogical
records, and public records with commercial potential such as SEC filings. There are, for example,
numerous sites devoted to company financials.
On the other hand, there are entire categories of deep Web sites whose content appears uniquely
valuable. These mostly fall within the categories of topical databases, publications, and internal site
indices accounting in total for about 80% of deep Web sites and include such sources as
scientific databases, library holdings, unique bibliographies such as PubMed, and unique
government data repositories such as satellite imaging data and the like.
But duplication is also rampant on the surface Web. Many sites are "mirrored." Popular documents
are frequently appropriated by others and posted on their own sites. Common information such as
book and product listings, software, press releases, and so forth may turn up multiple times on
search engine searches. And, of course, the search engines themselves duplicate much content.
Duplication potential thus seems to be a function of public availability, market importance, and
discovery. The deep Web is not as easily discovered, and while mostly public, not as easily copied
by other surface Web sites. These factors suggest that duplication may be lower within the deep
Web. But, for the present, this observation is conjecture.

Deep vs. Surface Web Quality


The issue of quality has been raised throughout this study. A quality search result is not a long list
of hits, but the right list. Searchers want answers. Providing those answers has always been a
problem for the surface Web, and without appropriate technology will be a problem for the deep

Web as well.
Effective searches should both identify the relevant information desired and present it in order of
potential relevance quality. Sometimes what is most important is comprehensive discovery
everything referring to a commercial product, for instance. Other times the most authoritative
result is needed the complete description of a chemical compound, as an example. The searches
may be the same for the two sets of requirements, but the answers will have to be different.
Meeting those requirements is daunting, and knowing that the deep Web exists only complicates
the solution because it often contains useful information for either kind of search. If useful
information is obtainable but excluded from a search, the requirements of either user cannot be
met.
We have attempted to bring together some of the metrics included in this paper, [60] [#fn60] defining
quality as both actual quality of the search results and the ability to cover the subject.
Table 10. Total "Quality" Potential, Deep vs. Surface Web
Search Type

Total Docs (million) Quality Docs (million)

Surface Web
Single Site Search

160

Metasite Search

840

38

TOTAL SURFACE POSSIBLE 1,000

45

Deep Web
Mega Deep Search

110,000

14,850

Single Site Search

688:1

2,063:1

Metasite Search

131:1

393:1

TOTAL POSSIBLE

655:1

2,094:1

These strict numerical ratios ignore that including deep Web sites may be the critical factor in
actually discovering the information desired. In terms of discovery, inclusion of deep Web sites
may improve discovery by 600 fold or more.
Surface Web sites are fraught with quality problems. For example, a study in 1999 indicated that
44% of 1998 Web sites were no longer available in 1999 and that 45% of existing sites were halffinished, meaningless, or trivial. [61] [#fn61] Lawrence and Giles' NEC studies suggest that individual
major search engine coverage dropped from a maximum of 32% in 1998 to 16% in 1999. [7b] [#fn7]
Peer-reviewed journals and services such as Science Citation Index have evolved to provide the
authority necessary for users to judge the quality of information. The Internet lacks such authority.
An intriguing possibility with the deep Web is that individual sites can themselves establish that
authority. For example, an archived publication listing from a peer-reviewed journal such as
Nature or Science or user-accepted sources such as the Wall Street Journal or The Economist
carry with them authority based on their editorial and content efforts. The owner of the site vets
what content is made available. Professional content suppliers typically have the kinds of databasebased sites that make up the deep Web; the static HTML pages that typically make up the surface
Web are less likely to be from professional content suppliers.
By directing queries to deep Web sources, users can choose authoritative sites. Search engines,
because of their indiscriminate harvesting, do not direct queries. By careful selection of searchable
sites, users can make their own determinations about quality, even though a solid metric for that
value is difficult or impossible to assign universally.

Conclusion

Serious information seekers can no longer avoid the importance or quality of deep Web
information. But deep Web information is only a component of total information available.
Searching must evolve to encompass the complete Web.
Directed query technology is the only means to integrate deep and surface Web information. The
information retrieval answer has to involve both "mega" searching of appropriate deep Web sites
and "meta" searching of surface Web search engines to overcome their coverage problem. Clientside tools are not universally acceptable because of the need to download the tool and issue
effective queries to it. [62] [#fn62] Pre-assembled storehouses for selected content are also possible, but
will not be satisfactory for all information requests and needs. Specific vertical market services are
already evolving to partially address these challenges. [63] [#fn63] These will likely need to be
supplemented with a persistent query system customizable by the user that would set the queries,
search sites, filters, and schedules for repeated queries.
These observations suggest a splitting within the Internet information search market: search
directories that offer hand-picked information chosen from the surface Web to meet popular
search needs; search engines for more robust surface-level searches; and server-side contentaggregation vertical "infohubs" for deep Web information to provide answers where
comprehensiveness and quality are imperative.

Michael K. Bergman is chairman and VP, products and technology of BrightPlanet Corporation, a
Sioux Falls, SD automated Internet content-aggregation service. Although he trained for a Ph.D. in
population genetics at Duke University, he has been involved in Internet and database-software
ventures for the last decade. He was chairman of The WebTools Co., and is president and chairman
of VisualMetrics Corporation in Iowa City, IA, which developed a genome informatics data system.
He has frequently testified before the U.S. Congress on technology and commercialization issues,
and has been a keynote or invited speaker at more than 80 national industry meetings. He is also
the author of BrightPlanet's award-winning "Tutorial: A Guide to Effective Searching of the
Internet." http://completeplanet.com/Tutorials/Search/index.asp
[http://completeplanet.com/Tutorials/Search/index.asp] . You may reach him by e-mail at
mkb@brightplanet.com [mailto:mkb@brightplanet.com] .

Endnotes
1. Data for the study were collected between March 13 and 30, 2000. The study was originally
published on BrightPlanet's Web site on July 26, 2000. ([formerly
http://www.completeplanet.com/Tutorials/DeepWeb/index.asp]) Some of the references and Web
status statistics were updated on October 23, 2000, with further minor additions on February 22,
2001. [#fn1-ptr1]
2. A couple of good starting references on various Internet protocols can be found at
http://wdvl.com/Internet/Protocols/ [http://wdvl.com/Internet/Protocols/] and
http://www.webopedia.com/Internet_and_Online_Services/Internet/Internet_Protocols/.
[http://www.webopedia.com/Internet_and_Online_Services/Internet/Internet_Protocols/]

[#fn2-ptr1]

3. Tenth edition of GVU's (graphics, visualization and usability) WWW User Survey, May 14, 1999.
[formerly http://www.gvu.gatech.edu/user_surveys/survey-1998-10/tenthreport.html.] [#fn3ptr1]

4. 4a, 4b. "4th Q NPD Search and Portal Site Study," as reported by SearchEngineWatch [formerly
http://searchenginewatch.com/reports/npd.html]. NPD's Web site is at http://www.npd.com/.
[http://www.npd.com/]

[#fn4-ptr1]

[#fn4-ptr2]

5. 5a, 5b, 5c, 5d, 5e, 5f, 5g. "Sizing the Internet, Cyveillance [formerly
http://www.cyveillance.com/web/us/downloads/Sizing_the_Internet.pdf]. [#fn5-ptr1]
ptr2]

[#fn5-ptr3]

[#fn5-ptr4]

[#fn5-ptr5]

[#fn5-ptr6]

[#fn5-

[#fn5-ptr7]

6. 6a, 6b. S. Lawrence and C.L. Giles, "Searching the World Wide Web," Science 80:98-100, April
3, 1998. [#fn6-ptr1] [#fn6-ptr2]
7. 7a, 7b. S. Lawrence and C.L. Giles, "Accessibility of Information on the Web," Nature 400:107109, July 8, 1999. [#fn7-ptr1] [#fn7-ptr2]
8. See http://www.google.com. [http://www.google.com]

[#fn8-ptr1]

9. See http://www.alltheweb.com [http://www.alltheweb.com] and quoted numbers on entry page.


[#fn9-ptr1]

10. Northern Light is one of the engines that allows a "NOT meaningless" query to be issued to get
an actual document count from its data stores. See http://www.northernlight.com
[http://www.northernlight.com.] NL searches used in this article exclude its "Special Collections"
listing. [#fn10-ptr1]
11. 11a, 11b. An excellent source for tracking the currency of search engine listings is Danny
Sullivan's site, Search Engine Watch (see http://www.searchenginewatch.com
[http://www.searchenginewatch.com/] ). [#fn11-ptr1] [#fn11-ptr2]
12. See http://www.wiley.com/compbooks/sonnenreich/history.html.
[http://www.wiley.com/compbooks/sonnenreich/history.html]

[#fn12-ptr1]

13. 13a, 13b. This analysis assumes there were 1 million documents on the Web as of mid-1994.
[#fn13-ptr1]

[#fn13-ptr2]

14. See http://www.tcp.ca/Jan96/BusandMark.html. [formerly


http://www.tcp.ca/Jan96/BusandMark.html] [#fn14-ptr1]
15. See, for example, G Notess, "Searching the Hidden Internet," in Database, June 1997
(http://www.onlineinc.com/database/JunDB97/nets6.html
[http://notess.com/write/archive/9706.html] ). [#fn15-ptr1]
16. Empirical BrightPlanet results from processing millions of documents provide an actual mean
value of 43.5% for HTML and related content. Using a different metric, NEC researchers found
HTML and related content with white space removed to account for 61% of total page content (see
7). Both measures ignore images and so-called HTML header content. [#fn16-ptr1]
17. Rough estimate based on 700 million total documents indexed by AltaVista, Fast, and Northern
Light, at an average document size of 18.7 KB (see reference 7) and a 50% combined representation
by these three sources for all major search engines. Estimates are on an "HTML included" basis.
[#fn17-ptr1]

18. Many of these databases also store their information in compressed form. Actual disk storage
space on the deep Web is therefore perhaps 30% of the figures reported in this paper. [#fn18-ptr1]
19. See further, BrightPlanet, LexiBot Pro v. 2.1 User's Manual, April 2000, 126 pp. [#fn19-ptr1]
20. This value is equivalent to page sizes reported by most search engines and is equivalent to
reported sizes when an HTML document is saved to disk from a browser. The 1999 NEC study also
reported average Web document size after removal of all HTML tag information and white space to
be 7.3 KB. While a more accurate view of "true" document content, we have used the HTML basis
because of the equivalency in reported results from search engines themselves, browser document
saving and our technology. [#fn20-ptr1]
21. Inktomi Corp., "Web Surpasses One Billion Documents," press release issued January 18,
2000; see http://www.inktomi.com/new/press/2000/billion.html
[http://www.inktomi.com/new/press/2000/billion.html] and http://www.inktomi.com/webmap/
[http://www.inktomi.com/webmap/]

[#fn21-ptr1]

22. For example, the query issued for an agriculture-related database might be "agriculture." Then,
by issuing the same query to Northern Light and comparing it with a comprehensive query that
does not mention the term "agriculture" [such as "(crops OR livestock OR farm OR corn OR rice
OR wheat OR vegetables OR fruit OR cattle OR pigs OR poultry OR sheep OR horses) AND NOT
agriculture"] an empirical coverage factor is calculated. [#fn22-ptr1]
23. The compilation sites used for initial harvest were: [#fn23-ptr1]
AlphaSearch [formerly http://www.calvin.edu/library/searreso/internet/as/]
Direct Search http://www.freepint.com/gary/direct.htmdirect.htm
[http://www.freepint.com/gary/direct.htm]

Infomine Multiple Database Search http://infomine.ucr.edu/ [http://infomine.ucr.edu/]


The BigHub (formerly Internet Sleuth) [formerly http://www.thebighub.com/]
Lycos Searchable Databases [formerly
http://dir.lycos.com/Reference/Searchable_Databases/]
Internets (Search Engines and News) [formerly http://www.internets.com/]
HotSheet http://www.hotsheet.com [http://www.hotsheet.com/]
Plus minor listings from three small sites.
24. K. Bharat and A. Broder, "A Technique for Measuring the Relative Size and Overlap of Public
Web Search Engines," paper presented at the Seventh International World Wide Web Conference,
Brisbane, Australia, April 14-18, 1998. The full paper is available at
http://www7.scu.edu.au/1937/com1937.htm. [http://www7.scu.edu.au/1937/com1937.htm] [#fn24ptr1]

25. 25a, 25b. See, for example, http://www.surveysystem.com/sscalc.htm


[http://www.surveysystem.com/sscalc.htm] , for a sample size calculator. [#fn25-ptr1]

[#fn25-ptr2]

26. 26a, 26b. See http://cgi.netscape.com/cgi-bin/rlcgi.cgi?URL=www.mainsite.com./devscripts/dpd [formerly http://cgi.netscape.com/cgi-bin/rlcgi.cgi?URL=www.mainsite.com./devscripts/dpd] [#fn26-ptr1] [#fn26-ptr2]


27. See reference 38. Known pageviews for the logarithmic popularity rankings of selected sites
tracked by Alexa are used to fit a growth function for estimating monthly pageviews based on the
Alexa ranking for a given URL. [#fn27-ptr1]
28. See, for example among many, BetterWhois at http://betterwhois.com. [http://betterwhois.com/]
[#fn28-ptr1]

29. The surface Web domain sample was obtained by first issuing a meaningless query to Northern
Light, 'the AND NOT ddsalsrasve' and obtaining 1,000 URLs. This 1,000 was randomized to
remove (partially) ranking prejudice in the order Northern Light lists results. [#fn29-ptr1]
30. These three engines were selected because of their large size and support for full Boolean
queries. [#fn30-ptr1]
31. An example specific query for the "agriculture" subject areas is "agricultur* AND (swine OR
pig) AND 'artificial insemination' AND genetics." [#fn31-ptr1]
32. The BrightPlanet technology configuration settings were: max. Web page size, 1 MB; min. page
size, 1 KB; no date range filters; no site filters; 10 threads; 3 retries allowed; 60 sec. Web page
timeout; 180 minute max. download time; 200 pages per engine. [#fn32-ptr1]
33. The vector space model, or VSM, is a statistical model that represents documents and queries
as term sets, and computes the similarities between them. Scoring is a simple sum-of-products
computation, based on linear algebra. See further: Salton, Gerard, Automatic Information
Organization and Retrieval, McGraw-Hill, New York, N.Y., 1968; and, Salton, Gerard, Automatic

Text Processing, Addison-Wesley, Reading, MA, 1989. [#fn33-ptr1]


34. The Extended Boolean Information Retrieval (EBIR) uses generalized distance functions to
determine the similarity between weighted Boolean queries and weighted document vectors; see
further Salton, Gerard, Fox, Edward A. and Wu, Harry, (Cornell Technical Report TR82-511)
Extended Boolean Information Retrieval. Cornell University. August 1982. We have modified EBIR
to include minimal term occurrences, term frequencies and other items, which we term mEBIR.
[#fn34-ptr1]

35. See the Help and then FAQ pages at [formerly http://www.invisibleweb.com]. [#fn35-ptr1]
36. K. Wiseman, "The Invisible Web for Educators," [formerly
http://www3.dist214.k12.il.us/invisible/article/invisiblearticle.html] [#fn36-ptr1]
37. C. Sherman, "The Invisible Web," [formerly
http://websearch.about.com/library/weekly/aa061199.htm] [#fn37-ptr1]
38. 38.I. Zachery, "Beyond Search Engines," presented at the Computers in Libraries 2000
Conference, March 15-17, 2000, Washington, DC; [formerly
http://www.pgcollege.org/library/zac/beyond/index.htm] [#fn38-ptr1]
39. The initial July 26, 2000, version of this paper stated an estimate of 100,000 potential deep
Web search sites. Subsequent customer projects have allowed us to update this analysis, again
using overlap analysis, to 200,000 sites. This site number is updated in this paper, but overall deep
Web size estimates have not. In fact, still more recent work with foreign language deep Web sites
strongly suggests the 200,000 estimate is itself low. [#fn39-ptr1]
40. Alexa Corp., "Internet Trends Report 4Q 99." [#fn40-ptr1]
41. B.A. Huberman and L.A. Adamic, "Evolutionary Dynamics of the World Wide Web," 1999; see
http://www.hpl.hp.com/research/idl/papers/webgrowth/
[http://www.hpl.hp.com/research/idl/papers/webgrowth/]

[#fn41-ptr1]

42. The Northern Light total deep Web sites count is based on issuing the query "search OR
database" to the engine restricted to Web documents only, and then picking its Custom Folder on
Web search engines and directories, producing the 27,195 count listing shown. Hand inspection of
the first 100 results yielded only three true searchable databases; this increased in the second 100
to 7. Many of these initial sites were for standard search engines or Web site promotion services.
We believe the yield of actual search sites would continue to increase with depth through the
results. We also believe the query restriction eliminated many potential deep Web search sites.
Unfortunately, there is no empirical way within reasonable effort to verify either of these assertions
nor to quantify their effect on accuracy. [#fn42-ptr1]
43. 1024 bytes = I kilobyte (KB); 1000 KB = 1 megabyte (MB); 1000 MB = 1 gigabyte (GB); 1000
GB = 1 terabyte (TB); 1000 TB = 1 petabyte (PB). In other words, 1 PB = 1,024,000,000,000,000
bytes or 1015. [#fn43-ptr1]
44. 44a, 44b. Our original paper published on July 26, 2000, use d estimates of one billion surface
Web documents and about 100,000 deep Web sea rchable databases. Since publication, new
information suggests a total of about 200,000 deep Web searchable databases. Since surface Web
document growth is no w on the order of 2 billion documents, the ratios of surface to Web
documents ( 400 to 550 times greater in the deep Web) still approximately holds. These tren ds
would also suggest roughly double the amount of deep Web data storage to fifteen petabytes than is
indicated in the main body of the report. [#fn44-ptr1] [#fn44-ptr2]
45. We have not empirically tested this assertion in this study. However, from a logical standpoint,
surface search engines are all indexing ultimately the same content, namely the public indexable
Web. Deep Web sites reflect information from different domains and producers.
46. M. Hofstede, pers. comm., Aug. 3. 2000, referencing http://www.alba36.com/ [formerly
http://www.alba36.com/]. [#fn46-ptr1]

47. As reported in Sequoia Software's IPO filing to the SEC, March 23, 2000; see
http://www.10kwizard.com/filing.php?
repo=tenk&ipage=1117423&doc=1&total=266&back=2&g=. [http://www.10kwizard.com/filing.php?
repo=tenk&ipage=1117423&doc=1&total=266&back=2&g=]

[#fn47-ptr1]

48. 48a, 48b, 48c. P. Lyman and H.R. Varian, "How Much Information," published by the UC
Berkeley School of Information Management and Systems, October 18. 2000. See
http://www.sims.berkeley.edu/research/projects/how-much-info/index.html
[http://www2.sims.berkeley.edu/research/projects/how-much-info/] . The comparisons here are limited to
archivable and retrievable public information, exclusive of entertainment and communications
content such as chat or e-mail. [#fn48-ptr1] [#fn48-ptr2] [#fn48-ptr3]
49. As this analysis has shown, in numerical terms the deep Web already dominates. However,
from a general user perspective, it is unknown. [#fn49-ptr1]
50. See http://lcweb.loc.gov/z3950/. [http://lcweb.loc.gov/z3950/]

[#fn50-ptr1]

51. See http://www.infotoday.com/newsbreaks/nb0713-3.htm.


[http://www.infotoday.com/newsbreaks/nb0713-3.htm]

[#fn51-ptr1]

52. A. Hall, "Drowning in Data," Scientific American, Oct. 1999; [formerly


http://www.sciam.com/explorations/1999/100499data/]. [#fn52-ptr1]
53. As reported in Sequoia Software's IPO filing to the SEC, March 23, 2000 ; see
http://www.10kwizard.com/filing.php?
repo=tenk&ipage=1117423&doc=1&total=266&back=2&g=. [http://www.10kwizard.com/filing.php?
repo=tenk&ipage=1117423&doc=1&total=266&back=2&g=]

[#fn53-ptr1]

54. From Advanced Digital Information Corp., Sept. 1, 1999, SEC filing; [formerly
http://www.tenkwizard.com/fil_blurb.asp?iacc=991114 & exp=terabytes%20and%20online &
g=">http://www.tenkwizard.com/fil_blurb.asp?iacc=991114 & exp=terabytes%20and%20online &
g=]. [#fn54-ptr1]
55. See http://www.10kwizard.com/. [http://www.10kwizard.com/]

[#fn55-ptr1]

56. Though the Open Directory is licensed to many sites, including prominently Lycos and
Netscape, it maintains its own site at http://dmoz.org. [http://dmoz.org/] An example of a node
reference for a static page that could be indexed by a search engine is:
http://dmoz.org/Business/E-Commerce/Strategy/New_Business_Models/EMarkets_for_Businesses/ [formerly http://dmoz.org/Business/ECommerce/Strategy/New_Business_Models/E-Markets_for_Businesses/]. One characteristic of
most so-called search directories is they present their results through a static page structure. There
are some directories, LookSmart most notably, that present their results dynamically. [#fn56-ptr1]
57. As of Feb. 22, 2001, the Open Directory Project was claiming more than 345,000 categories.
[#fn57-ptr1]

58. See previous reference. This number of categories may seem large, but is actually easily
achievable, because subject node number is a geometric progression. For example, the URL
example in the previous reference represents a five-level tree: 1 - Business; 2 - E-commerce; 3 Strategy; 4 - New Business Models; 5 - E-markets for Businesses. The Open Project has 15 top-level
node choices, on average about 30 second-level node choices, etc. Not all parts of these subject
trees are as complete or "bushy" as other ones, and some branches of the tree extend deeper
because there is a richer amount of content to organize. Nonetheless, through this simple
progression of subject choices at each node, one can see how total subject categories - and the static
pages associated with them for presenting result - can grow quite large. Thus, for a five-level
structure with an average number or node choices at each level, Open Directory could have ((15 *
30 * 15 * 12 * 3) + 15 + 30 + 15 + 12) choices, or a total of 243,072 nodes. This is close to the
248,000 nodes actually reported by the site. [#fn58-ptr1]
59. See http://info.ox.ac.uk/bnc/. [http://info.ox.ac.uk/bnc/]

[#fn59-ptr1]

60. Assumptions: SURFACE WEB: for single surface site searches - 16% coverage; for metasearch
surface searchers - 84% coverage [higher than NEC estimates in reference 4; based on empirical
BrightPlanet searches relevant to specific topics]; 4.5% quality retrieval from all surface searches.
DEEP WEB: 20% of potential deep Web sites in initial CompletePlanet release; 200,000 potential
deep Web sources; 13.5% quality retrieval from all deep Web searches. [#fn60-ptr1]
61. Online Computer Library Center, Inc., "June 1999 Web Statistics," Web Characterization
Project, OCLC, July 1999. See the Statistics section in http://wcp.oclc.org/ [http://wcp.oclc.org/] .
[#fn61-ptr1]

62. Most surveys suggest the majority of users are not familiar or comfortable with Boolean
constructs or queries. Also, most studies suggest users issue on average 1.5 keywords per query;
even professional information scientists issue 2 or 3 keywords per search. See further
BrightPlanet's search tutorial at http://www.completeplanet.com/searchresources/tutorial.htm.
[#fn62-ptr1]

63. See, as one example among many, CareData.com, at [formerly


http://www.citeline.com/pro_info.html]. [#fn63-ptr1]

Some of the information in this document is preliminary. BrightPlanet plans future revisions as
better information and documentation is obtained. We welcome submission of improved
information and statistics from others involved with the Deep Web. Copyright BrightPlanet
Corporation. This paper is the property of BrightPlanet Corporation. Users are free to copy and
distribute it for personal use.

Michael K. Bergman may be reached by e-mail at mkb@brightplanet.com


[mailto:mkb@brightplanet.com] .

Links from this article:


10Kwizard http://www.10kwizard.com [http://www.10kwizard.com]
About.com http://www.about.com/ [http://www.about.com/]
Agriculture.com http://www.agriculture.com/ [http://www.agriculture.com/]
AgriSurf http://www.agrisurf.com/agrisurfscripts/agrisurf.asp?index=_25
[http://www.agrisurf.com/agrisurfscripts/agrisurf.asp?index=_25]

AltaVista http://www.altavista.com/ [http://www.altavista.com/]


Bluestone [formerly http://www.bluestone.com]
Excite http://www.excite.com [http://www.excite.com]
Google http://www.google.com/ [http://www.google.com/]
joefarmer [formerly http://www.joefarmer.com/]
LookSmart http://www.looksmart.com/ [http://www.looksmart.com/]
Northern Light http://www.northernlight.com/ [http://www.northernlight.com/]

Open Directory Project http://dmoz.org [http://dmoz.org]


Oracle http://www.oracle.com/ [http://www.oracle.com/]
Patent and Trademark Office http://www.uspto.gov [http://www.uspto.gov]
Securities and Exchange Commission http://www.sec.gov [http://www.sec.gov]
U.S. Census Bureau http://www.census.gov [http://www.census.gov]
Whois http://www.whois.net [http://www.whois.net]
Yahoo! http://www.yahoo.com/ [http://www.yahoo.com/]

Product of Michigan Publishing, University of Michigan Library

jep-info@umich.edu

ISSN 1080-2711

The Structure and Content of Online Child Exploitation


Networks
Richard Frank

Bryce Westlake

Martin Bouchard

School of Computing Science


School of Criminology
Simon Fraser University
Burnaby, BC, Canada

School of Criminology
Simon Fraser University
Burnaby, BC, Canada

School of Criminology
Simon Fraser University
Burnaby, BC, Canada

bwestlak@sfu.ca

mbouchard@sfu.ca

rfrank@sfu.ca
ABSTRACT
The emergence of the Internet has provided people with the ability
to find and communicate with others of common interests.
Unfortunately, those involved in the practices of child
exploitation have also received the same benefits. Although law
enforcement continues its efforts to shut down websites dedicated
to child exploitation, the problem remains uncurbed. Despite this,
law enforcement has yet to examine these websites as a network
and determine their structure, stability and susceptibleness to
attack. We extract the structure and features of four online child
exploitation networks using a custom-written webpage crawler.
Social network analysis is then applied with the purpose of
finding key players websites whose removal would result in the
greatest fragmentation of the network and largest loss of hardcore
material. Our results indicate that websites do not link based on
the hardcore content of the target website; however, blogs do
contain more hardcore content per page than non-blog websites.

Categories and Subject Descriptors


H.5.3 [Information Interfaces and Presentation]: Group and
Organization Interfaces Web-based interaction

General Terms
Algorithms, Measurement, Security, Human Factors

Keywords
Child exploitation, social network analysis, target prioritization,
Internet

1. INTRODUCTION
It is estimated that 1.8 billion individuals worldwide use the
Internet, with 260 million users being from North America [13].
Of the 1.8 billion users, adolescents and college students make up
the largest proportion [10, 24, 31]. Through access at home and at
school, it is estimated that 90% of youth have regular access to the
Internet [5]. Although the vast majority of individuals who use the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
ISI-KDD 2010, July 25, 2010, Washington, D.C., USA
Copyright 2010 ACM ISBN 978-1-4503-0223-4/10/07 $10.00.

Internet for sexual pursuits do so in a safe and legal way [4, 9],
the anonymity of the Internet has resulted in a growing percentage
who sexually solicit youth [23]. What makes this problem worse
is the ease with which one can obtain illegal pornographic
material [30, 35]. Searching the words boy, teen, or child,
brings up countless websites and photos of youth in sexually
exploitive roles [24, 34].
The growth of the Internet has resulted in a substantial
increase in research aimed at understanding online networks [8,
17, 29, 33]. However, most of the research to date has focused on
the structure of social networking websites such as Facebook and
MySpace, and has stopped short of investigating child
exploitation networks. This is despite the United Nations
announcement that there are more than four million websites
containing child pornography [6].
Much of the existing efforts to curb child exploitation have
taken the form of Internet chat room stings and injunctions against
online groups seen to be facilitating the proliferation of child
sexual abuse (e.g., North American Man-Boy Love Association,
Pedophile Information Network, Freespirit and BoyChat). At
times this process has come against roadblocks from those who
argue Internet stings are a form of entrapment1 [7]. In addition,
website owners often find loopholes, arguing that their websites
are merely support forums that do not host exploitative material
and that they cannot be held responsible for the private messages
people send back and forth, that may or may not contain
information on obtaining illegal material2.
As online child exploitation is seen as a global issue, the
United Nations International Criminal Police Organization
(INTERPOL) has taken a leading role in addressing the problem.
One of the ways child exploitation has been combated is with the
creation of a database containing all known sexually explicit
photos of children (the International Child Sexual Exploitation
image database) [14]. Additionally, INTERPOL partners with the
COSPOL Internet Related Child Abuse Material Project and the
Virtual Global Taskforce to help coordinate multi-country
investigations and spread awareness of the problem. These efforts
1

One such example is the FBI posting fake links to explicit


images of children and then raided the homes of those who
clicked on the links [20].

For instance, one of the most well know sites Free Spirits state
that the sites linked from these pages are operated by private
citizens exercising their right to free speech under the U.S.
Constitution and Universal International Human Rights
Convention [12].

Algorithm CENE(StartPage, PageLimit, WebsiteLimit, Keywords(), BadWebsites())


1:
Queue() {StartPage}
2:

KeywordsInWebsiteCounter() 0, LinkFrequency() {}, WebsitesUsed() {}, FollowedLinks() {}

3:

while |FollowedPages| < PageLimit and |Queue| > 0


P Queue(1), DP domain of P
//start evaluating next page in queue

4:
5:
6:
7:
8:

if DP WebsitesUsed() and |WebsitesUsed| < WebsiteLimit then


WebsitesUsed() WebsitesUsed() + DP
if DP WebsitesUsed() and DP BadWebsites() then
PageContents Retrieve page P

//evaluate this page

9:

FollowedPages FollowedPages + P

10:

if PageContents contains Keywords()


KeywordsInWebsiteCounter() get frequency of all Keywords()

11:
12:

LinksToFollow() all {href} elements in PageContents

13:

for each L in LinksToFollow()

14:
15:

if L Queue() and L FollowedPages


Queue() Queue() + L

16:

DL domain of L

17:

LinkFrequency(DP, DL) LinkFrequency(DP, DL) + 1

18:
19:

//initialize variables

KeywordsInWebsite(DP) KeywordsInWebsite(DP) + KeywordsInWebsiteCounter()


return WebsitesUsed(), KeywordsInWebsite(), LinkFrequency()
Figure 1 - Algorithm CENE

have had some results. In 2001, a thirteen country operation,


organized by the British National Crime Squad, resulted in the
arrest of 107 suspected members of the Wonderland Club; the
largest Internet pedophile ring [28]. This resulted in the
conviction of seven individuals and the confiscation of 750,000
images and 1,800 videos, containing 1,263 identifiable children 3.
Although a lot of time and money has been placed into
various units across the world, child exploitation is nowhere near
under control. The best available statistics suggest that less than
1% of all virtual pedophiles are apprehended [22]. This is not
necessarily an attack against law enforcement, but rather speaks to
the extent of the problem. With so many websites containing child
sexual abuse images (and videos), and the limited resources
available to various organizations to combat the problem, there
needs to be continued efforts to automate and simplify the process
of selecting and prioritizing targets for the purpose of criminal
investigation. With the cessation of online child exploitation
unlikely, the focus needs to be on the severity and exposure of the
content rather than simply the presence of the content.
Social Network Analysis (SNA) is a tool that can be used to
fulfill these objectives. SNA focuses on the patterns of
connections among various entities, whether they are individuals,
organizations or, in our case, websites. It has been shown to be a
valuable tool for criminologists and law enforcement in
determining the organizational structure of various criminal
networks, from street gangs [21, 27] and drug trafficking
organizations [19, 26, 25] to terrorist groups [18]. It has also been

used to analyze the online communication of terrorist groups by


collecting information from terrorist forums on the web [16].
Prior to the Internet, child exploitation could have been
viewed as more of a solitary crime with very sparse networks [2].
Although images and videos were transferred through the mail,
the speed of the exchange was low and the chance of getting
caught sending material was high. More importantly, it was
difficult for people to find and get in contact with one another.
However, the advent of the Internet has changed the crime of
child exploitation and sexual abuse [32]. With many websites
(e.g., Bliss and Rene Guyon Society) outwardly supporting
relationships between young children and adults, the ease at
which material can be obtained and shared has grown
exponentially. To our knowledge, child exploitation websites have
not yet been studied from the perspective of a network. Doing so
has direct implications for law enforcement agencies involved in
targeting websites and offenders through the Internet as it allows
them to determine the key websites to target.
The current study develops a method to extract child
exploitation networks, map their structure and analyze their
content. Our objective is to uncover the structure of online child
porn networks, and to identify their hardcore key players:
websites whose removal would result in the greatest fragmentation
of the network and largest loss of hardcore material. From a lawenforcement perspective, this would allow the prioritization of
targets, to only highly connected websites that also display more
harmful content.

2. METHODS
3

The children in the images and videos ranged from 3 months to


16 years. The majority were under the age of 10 with many
being 2 or 3 years old.

We propose a method to undertake this analysis efficiently


by extracting networks of websites, and their features, then
creating measures to determine the severity of content on each
website and its importance within the network. We use a custom-

a) Blog A

b) Blog B

c) Site A

d) Site B
Figure 2 The 4 networks

written web-crawler which, given a starting webpage, will


recursively follow the links out of that webpage, until some
termination conditions apply. During this process, in order to
construct a coherent network for analysis, the web-crawler
establishes the links between websites and collects statistics on
the type of content on the pages hosted on that website. The
algorithm we designed to do this is described below, along with a
description of the networks extracted.

2.1 NETWORK EXTRACTION


For this paper, we use a custom-written crawler (Figure 1),
called Child Exploitation Network Extractor (CENE) to extract
the network structure and statistics (features) of the network for
analysis. A variety of starting locations can be used to extract
multiple networks for comparison purposes. For each network
extracted, features are collected about the content of the pages and
the links between them. The statistics are then aggregated up to
the website level. For example the features for www.website.com
are calculated from the statistics collected from all pages on that
website.

A few conditions were used to keep the network manageable


in size and relevance. Since the Internet is extremely large and a
crawler would most likely never stop crawling, we had to
implement limits into CENE in two ways. First, to keep the
network extraction time bounded, a limit was put on the number
of pages retrieved (PageLimit line 3). Second, the network size
was fixed at a specific number of websites (WebsiteLimit line 5).
This was done so that the network extracted would be focused on
websites dealing only with the specified topic. The end result of
this process is a network where all the websites in the network are
sampled approximately equally, with (PageLimit/WebsiteLimit) pages
being sampled per website.
In order to keep the network extraction process relevant, and
on the chosen topic, a set of websites (BadWebsites) and a set of
keywords (Keywords) were also defined. BadWebsites contained
websites known to be safe and assumed to not host any pages
relevant to child exploitation. Examples of these websites
included www.microsoft.com and www.google.com. Without
these made explicit, the crawler could wander into a search-engine
leading it completely off topic and making the resulting network
irrelevant to the specified topic. Keywords also gave CENE some

Website

# of Pages on
Starting Website

Severity Score

% of All Websites
Connected To It

Degree Centrality
(Normalized)

Blog A

285

62.93

100.0

100.0

Blog B

583

1.82

81.8

81.6

Site A

237

80.19

78.8

78.6

Site B

2.00

27.1

68.4

Figure 3 - Description of Starting Websites


Network

Blog A

Blog B

Site A

Site B

Density (Ties)

0.13 (1214)

0.21 (2006)

0.09 (866)

0.04 (371)

Severity Score

High

0.23 (n=22)

0.37 (n=27)

0.11 (n=23)

0.08 (n=27)

Density

Low

0.12 (n=77)

0.16 (n=72)

0.08 (n=76)

0.02 (n=69)

0.39

0.48

0.28

0.22

Clustering Coefficient

0.04

0.06

0.04

0.02

Out-Degree

88.38%

61.58%

70.36%

65.03%

In-Degree

71.89%

31.65%

60.05%

36.31%

0.25

0.37

0.17

0.02

Fragmentation
Centralization

Reciprocity

Figure 4 - Social Network Analysis Summary


boundaries which guided it during the exploration. For the
crawler to include the page being analyzed, at least one keyword
from Keywords had to exist (line 10). If a keyword existed on the
page, the page was assumed to be relevant to the network and the
statistics on that webpage were calculated (line 11). The links
pointing out of the page were also retrieved (line 12) and added to
the queue of pages to visit if they had not been visited yet (lines
14-16). If however no keywords exist on the page, it was
discarded and no further links were followed.
In order to construct the features of the network, the links
between websites were tracked (line 17), as well as the occurrence
of each keyword aggregated to the website level (line 18). Thus,
all pages on a website contributed to the features for that website.
This allowed for the construction of a coherent network, complete
with features assigned to both the websites and links (line 19).
Based on the keywords, and set of websites CENE could not
explore, the network constructed remained on topic.

2.2 CHILD EXPLOITATION NETWORKS


Four websites were randomly chosen as starting points
through four separate search engine searches, using the keywords
boy and love. A search using the keywords girl/lolita/lolli
and love was also attempted, but the results were unsatisfactory,
leading to adult pornography websites. Of the four starting
websites, two were based on user-generated posts, referred to as a
Blog, and two had the traditional structure of interlinking-pages,
simply called a Site. These were selected so that we could
compare the findings within type (Blog A vs. Blog B and Site A
vs. Site B) and between type (Blogs vs. Sites). Although this type
of content can most likely be found via other Internet Protocols,
such as IRC, NNTP (Usenets) or FTP, they require different types
of analysis and hence were not included in this study. Forums are
also significantly different from Blogs and Sites in that they

require a slightly different approach for access, extraction, and


analysis.
Each resulting network was analyzed for the presence of
specific keywords which were divided between two categories:
hardcore and softcore words. Although the focus of this study
is on the most harmful content, it was important to collect a
broader range of keywords for comparative and network
extraction purposes (see above). Keywords labeled as hardcore
were those with explicit sexual content: mastur*, sex, penis,
vagina, anal/anus, oral, virgin, and naked/nude. The softcore
words were: boy, girl, child, love, teen, lolli, young, and bath*. As
smooth/hairless could be found in both hardcore and softcore
settings, they were included in both categories. Each network was
capped at 100 websites and 50,000 pages. Therefore, the networks
analyzed should not be viewed as complete networks but rather
samples of larger networks. CENE retrieved up to 25 pages in
parallel, requiring between 6-12 hours to extract each network.

3. RESULTS
First, we draw on SNA to examine the structure of the four
extracted networks. More specifically, we derive the following
measures:
Density: the percentage of network connections present
in relation to all possible network connections [11, 15]
Clustering coefficient: the likelihood that two websites,
both connected to another website, are connected to
each other [11, 19]
Fragmentation: the percentage of the network
connections disconnected by the removal of any one
website [3]
In-degree centrality: for website a it is the number of
other network websites that links to a
Out-degree: it is based on how many other websites
website a links to [11]

Percentage of All Keywords Found in


Network

Figure 5 Site A network after removal of the website with


the highest fragmentation score
-

Reciprocity: the proportion of websites that reference


one another [11, 15]
Second, we analyze the content of the networks through the
keyword analysis developed earlier. We compare the relative
severity of content by examining the mean number of hardcore
words present per page. Third, we turn to the main research
question examined in this paper: are the Blogs/Sites with the most
harmful content also the most central in the overall network? We
do so by comparing network centrality measures to a severity
score: number of hardcore words found per page.

3.1 NETWORK STRUCTURE


Figure 2 shows the four networks extracted, with a triangle
towards the upper left corner denoting the starting location for
each network. The circles denoting the websites vary in size based
on the severity score (number of hardcore words/page). As shown
in Figure 3, the final networks consisted of fewer than 100
websites as a few of them were aliases for each other and were
consequently merged. The starting Blogs and Sites differed in size
and content. For example, the starting blog for Blog A comprised
285 pages, averaging 63 hardcore words per page, while Blog Bs
starting blog was 583 pages in size, with an average of 2 hardcore
words per page. Site As starting point was 237 pages and 80
hardcore words per page and Site B was 1 page and averaged 2
hardcore words per page4. Such variety was an advantage as one
of the objectives of the paper was to examine whether such
differences led to different network structure and content.
A simple visual examination of the networks in Figure 2
reveals that different structures emerged for all four. Figure 4
provides more details on the similarities and differences that
emerged. First, we found that the Blog networks were much
denser than the Site networks. Blog A and Blog B had a density of
0.13 and 0.21, respectively, compared to 0.09 and 0.04 for the
4

The starting page for Site B was a front page for a much larger
site. For example, all sections of the website www.hostsite.xxx
followed the url www.section.hostsite.xxx. Therefore, the
number of pages and hardcore words are low as there were no
additional pages on the front page.

Blog A

Blog B

Site A

Site B

Boy

60.82

35.89

55.78

70.59

Girl

0.61

4.90

0.43

4.54

Child

1.42

4.20

1.06

6.42

Love

6.75

30.53

19.15

7.66

Teen

4.09

2.30

4.04

0.95

Lolli*

0.00

0.15

0.00

0.04

Young

2.42

3.73

1.70

0.48

Bath*

0.02

0.01

0.12

0.04

Innocent

0.01

0.04

0.00

0.03

Smooth/Hairless

0.21

0.27

0.16

0.41

Mastur*

0.65

0.03

0.27

0.03

Sex

9.58

8.95

8.70

2.46

Penis

2.76

1.10

0.30

0.06

Vagina

0.00

0.12

0.00

0.03

Anal

4.23

3.38

1.03

4.17

Oral

0.74

0.45

0.35

0.55

Naked

5.42

3.27

6.70

0.81

Virgin

0.06

0.40

0.04

0.32

Figure 6 - Percentage of All Keywords Found in Network


two Site networks. The clustering coefficients were also higher for
the Blog networks at 0.39 (A) and 0.48 (B) compared to 0.28 (A)
and 0.22 (B) for the Site networks. As the clustering coefficient
for each network is more than double their network density, this
indicates that the average densities of individual neighborhoods
(websites) are diverse in size and dominated by several large,
highly connected websites. One of the reasons why the Blog
networks are denser is because of the higher levels of reciprocity.
As shown in Figure 4, Blog A and B had high levels of
reciprocation (25% and 37%), while Site A and B had much lower
levels of reciprocal ties (17% and 2%). This also ties in quite well
with (similarly focused) blogs perceived as being a community
of sorts, at least more so than other types of websites. The
significantly lower level of reciprocity for Site B may be
attributed to the high number of dead websites (19) or websites
without any of our keywords (24). However, the lack of
reciprocity may again be a precautionary tactic from the owners.
In the world of blogs there is little in repercussions for being
found to have illicit material besides getting shut down.
However, for an independent website, the risk is a lot greater as
individuals are tied to it through website registration and hosting
services. This increased risk may limit the amount of reciprocal
ties that are present. Furthermore, as search engines rank pages
based on their popularity, having more links to a site increases its
exposure on search engines, which in turn likely increases the
possibility of being shut down.
Second, we found that content matters in determining the
overall structure of the network. When dividing the network
between those with higher severity scores (greater than the

Number of Websites in Network

Blog A

Blog B

Site A

Site B

99

99

99

96

Number of
Pages/Website

Range

0-651

0-470

0-1,420

0-1,575

Average

405

265

268

394

Hardcore
Words

Average
(Range)

1501
(0-14,203)

9352
(0-133,526)

7435
(0-107,016)

1287
(0-21,226)

Average/Page
(Range)

3
(0-27)

38
(0-583)

52
(0-593)

3
(0-30)

% of Keywords

23.64

17.98

17.55

8.83

Average
(Range)

6,847
(0-41,588)

30,214
(0-298,602)

34,934
(0-63,951)

13,283
(0-617,748)

Average/Page
(Range)

15
(0-93)

108
(0-1061)

97
(0-896)

39
(0546)

% of Keywords

76.37

82.02

82.45

91.17

Range

0-45,061

0-380,348

0-746,526

0-618,586

Softcore
Words

Total Words

Average

8,348

39,566

42,369

14,570

Network Total

3,917,045

826,441

4,194,544

1,398,756

Figure 7 Word content analysis for the four networks


network average) we found that the Blogs/Sites with the most
harmful content were more likely to be connected to each other.
Figure 4 shows, for example, that the network density was always
higher for those websites compared to others.
Overall, these results suggest that each of the networks is
dominated by several mega websites, or key players. Initially,
this does not seem to be the case as the fragmentation scores were
very low for each network: Blog A (0.04), Blog B (0.06), Site A
(0.04), and Site B (0.02). Recall that this score indicates that the
removal of any random website would result in a loss of 2-6% of
connections within the network. However, a targeted removal of a
website may produce more disruption within the networks. For
example, the removal of the starting blog for the Blog A network
would result in a 16% fragmentation. This was followed by a
second blog, whose removal would result in a 10% fragmentation
of the network independent of the removal of the starting blog.
For Site A, Figure 5 illustrates how the removal of the starting
website for the network results in a 62% fragmentation of the
network. For Site B, the removal of the starting website would
only result in a 6% disruption; however, there were two other
websites, whose removal would result in a fragmentation of 48%
each. This indicates that substantial fragmentation can occur if the
proper websites are targeted by law enforcement agencies. That is,
removing a random website does little to disrupt the network,
however, targeting specific websites that link to a lot of other
websites can result in larger impacts to the online network.

3.2 NETWORK CONTENT


CENE collected statistics about every single page on the
website which was crawled. This was done through the frequency
of keywords on each page which was aggregated up to the website
level. For a complete list of keywords, and their frequencies, see
Figure 6.
Given that our initial search engine search was boy-centered,
it comes as no surprise that 90.2% of the websites across networks
were classified as such; while the other 10% were girl-centered.

This was based on the higher ratio of boy to girl references on


the website. Blog A and B were 100% and 83% boy-centered,
while Site A and B were 99% and 75%.
It is important to note that there was some cross-over
between boy and girl centered websites. However, given the small
number of girl-centered websites it is unclear if the 10% of the
websites classified as girl-oriented is evidence of the two network
types being connected or simply by chance. Regardless, most
websites within the network seemed to be predominantly boycentered or girl-centered. This implies that child exploitation
websites do not mix boy/girl material; rather they tend to focus on
a specific gender, possibly impacting the choice of strategies for
police investigations.
Across networks, 81.3% of keywords found belonged to the
group we defined as softcore, while 18.7% belonged to
hardcore. Figure 6 shows that the most common keywords were
boy (58.1%), love (13.8%), sex (8.2%), nude/naked (5.1%),
and anus/anal (2.3%).
Figure 7 presents the results of the content analysis of the
four networks extracted. All networks contained the same average
number of pages per website; however, blogs had higher counts of
hardcore words, expressed as a higher severity scores per page
(16.2 to 13.4) and per website (5426.3 to 4408.4). Despite this,
Figure 7 shows that Site A and B had the larger ranges. Blogs
were fairly consistent, while there was a wide range of values
obtained from Sites. Additionally, Blogs contain more hardcore
content per page. This could be attributed to the ease of setting up
a blog as well as the increased anonymity afforded to the operator.
This is in comparison to sites, whose operators personal
information is linked to the website and thus are at an increased
risk of facing formal charges.
To determine whether severity was related to the number of
links coming into, and going out of, a website, the total number of
hardcore words per website and per page was correlated with in
and out-degree centrality. Although none of the correlations were

Website
Ranking

Blog A
In-degree
Severity

Blog B
In-degree
Severity

Site A
In-degree
Severity

Site B
In-degree
Severity

82

1.20

50

2.05

77

80.19

38

3.50

80

1.20

49

14.17

23

4.01

15

1.12

79

1.00

48

3.61

22

62.00

13

1.17

79

1.13

47

4.16

19

29.00

12

10.04

78

0.91

46

2.34

18

17.36

10

9.64

30

62.93

45

2.23

17

119.00

0.15

26

1.13

45

4.31

17

36.00

0.33

25

51.21

45

3.27

16

39.56

0.75

22

20.94

43

7.89

16

12.39

11.00

10

22

10.07

42

1.97

15

137.00

3.76

52.30

15.17

46.00

4.60

18.40

53.65

13.10

4.15

12.26

38.26

20.26

3.16

8.75

52.21

3.86

2.98

Mean for
Top 10
Mean for
Network

Figure 8 - Top 10 In-degree websites in each network compared to the overall network
significant, the pattern suggested that hardcore blogs and sites
have a tendency to reach out more to others (r=0.10, 0.13, 0.12, 0.05) than others reach out to them (r=-0.09, 0.04, -0.16, 0.12).
These findings support the previous analyses that there are mega
websites with a lot of material and a lot of connections, as well as
small independent websites, with only a little bit of material and
relatively unconnected to the rest of the network. Put another way,
the mean number of hardcore words per website and per page are
mainly driven by several extreme websites on both ends (websites
with a lot of content and websites with little to no content).

3.3 FINDING THE HARDCORE KEY PLAYERS


The last set of results focuses on the question at the core of
this paper: are the most central websites also the ones hosting the
most serious content? Figure 8 examines the top 10 most central
websites (sorted by in-degree) and compares their severity score
to the overall network. The results show that the top 10 most
central websites were no more or no less likely to contain
offensive material. Instead, there seemed to be a more or less
random variation in the four networks analyzed. Thus, offensive
material on a website does not seem to influence centrality.
Figure 9 proceeds the other way around, listing the top 10
websites with the highest severity score and examines their
centrality. These results are similar: the most hardcore websites
were no more or no less likely to be central players in their
networks. This procedure, however, allowed us to identify the
hardcore key players: websites that were both central in the
network and contained offensive content. For example, one of the
blogs with the most offensive material in the Blog A network
(62.93 score) had 30 other websites linking to it (rank 6, Figure
8). Compare this to the website ranking as number 2 for the Blog
B network in Figure 8 (and ranking 4 in Figure 9): a much lower
severity score of 14.12, but more websites (49) linking to it.

4. DISCUSSION
The Internet has changed the way society communicates and
obtains information. Despite the positive contributions the
Internet has made to society, it has also created a new avenue

where individuals can engage in criminal activity. With the ease at


which people can communicate with one another, from all over
the world, the Internet has facilitated the proliferation of child
exploitation. The simplicity of obtaining and sharing this material
was clearly evident within the current study. Clearly, there is no
way to effectively eliminate online child exploitation. Therefore,
steps need to be taken to lessen the impact and severity, as well as
maximize the efficiency of current efforts. This is where SNA can
be of the greatest use to law enforcement.
The use of blog websites for child exploitation provides
plenty of advantages. If an individual were to setup their own
non-blog website, they would have to have some knowledge of
how to design a website as well as have the financial capital to
pay for the website. In addition, they would have to be cautious of
detection by law enforcement. However, blogs provide a much
cheaper, more efficient, and more anonymous way to distribute
material. Many blog webhosts such as Blogger, LiveJournal or
Sensualwritter provide members with free space to post their
blogs. This eliminates the out-of-pocket expense and knowledge
needed by an individual to set-up their own website. In addition,
and possibly the most important advantage, blog webhosts do not
verify personal information about their members. Therefore, the
blog webhost allows the individual to be completely anonymous.
Although each blog webhost has terms of service that state that
copyrighted or illegal material is not allowed, it is usually the
responsibility of patrons to report a blog containing material that
violates the terms of service. If a blog is seen to be in violation of
the terms of service it is usually removed by the webhost.
However, there is nothing preventing the owner of the blog from
creating a new account and starting the blog all over again.
Therefore, the removal of the blog could be viewed as no more
than a mild inconvenience for the blog creator.
As the problem of online child exploitation is continually
growing and more websites are becoming hosts of material, the
responsibility to combat the problem is not solely on law
enforcement. Understanding the immense number of hours and
resources that go into finding sexual explicit material online, one

Website
Ranking

Blog A

Blog B

Site A

Site B

In-degree

Severity

In-degree

Severity

In-degree

Severity

In-degree

Severity

583.08

27.50

593.00

30.00

543.14

12

24.07

531.00

26.00

193.71

36

17.33

431.00

16.00

177.53

49

14.17

292.45

14.92

15

130.06

18

12.97

244.00

13.48

10

128.05

30

11.39

244.00

12.00

125.13

11.11

182.00

11.00

14

123.95

20

10.40

11

149.88

10.08

16

113.83

10.00

146.00

12

10.04

10

95.81

37

8.90

137.00

10.00

Mean for
Top 10

9.50

221.43

21.5

14.78

3.20

295.03

4.60

15.35

Mean for
Network

12.26

38.26

20.26

3.16

8.75

52.21

3.86

2.98

Figure 9 Top 10 severe website in each network compared to overall network.


of the largest search engines in the world, Google, has begun to
help. In conjunction with the National Center for Missing and
Exploited Children (NCMEC), Google announced the creation of
new software that would aid in organizing and indexing
NCMECs information so that analysts can both deal with new
images and videos more efficiently and also reference historical
material more effectively. [1]. However, blog webhosts need to
get on-board as well. Considering that one of the top blog
webhost, Blogger, is owned by Google, it should be possible for
Google Inc. to also use its image recognition software on Blogger.
Obviously, this process would have to be automated as it would
be very difficult, and highly inefficient, for people to have to
routinely check all blogs for illegal material. However, its
implementation could have a substantial impact on curbing child
exploitation on blogs.

5. CONCLUSIONS
The current study drew on social network analysis to
examine the content and structure of online child exploitation
networks. We extracted the structure and features of child
exploitation networks by performing a guided crawl of the
Internet. Our crawler, CENE, was guided by a set of keywords,
and exclusion websites, which kept it on topic. This provided very
focused networks for analysis.
Using social network analysis we attempted to find the key
playersthose websites displaying a combination of connectivity
and hardcore material. This analysis looked at two types of
websites: blogs and sites, covering four independent starting
points. Our results indicate, first, that the presence of hardcore
content is not the basis for linkages between websites. Second,
that blogs contain more hardcore content per page than sites.
Although this exploratory study has made substantial
additions to our current understanding of online child
exploitation, it has also laid the groundwork for the incorporation
of SNA into future research on this topic. Subsequent research
needs to expand on the network size(s) and shift to a more
detailed analysis of the attributes, including the content of forums,

videos and pictures, as well as data on the number of people


visiting the websites. Finally, there needs to be a refinement of a)
the keywords list (are hardcore words truly hardcore?), b) the
list of websites the crawler cannot enter, and c) the criteria to
reduce the occurrence of false positives.

6. ACKNOWLEDGEMENTS
Partial funding for this project was provided by the
International Cybercrime Research Centre, Simon Fraser
University.

7. REFERENCES
1)

Baluja, S. Building software tools to find child victims.


Retrieved April 2, 2010, from
http://googleblog.blogspot.com/2008/04/buildingsoftware-tools-to-find-child.html

2)

Beech, A.R., Elliot, I.A., Birgden, A., & Findlater, D.


(2008). The Internet and child sexual offending: A
criminological review. Aggression and Violent Behavior,
13, 216-228.

3)

Borgatti, S.P. (2003). The key player problem. In


R.Breiger, K.Carley, and P.Pattison (Eds.), Dynamic
social network modeling and analysis: Workshop
summary and papers (pp.241-252). Washington, D.C.:
National Academy of Sciences Press.

4)

Cooper, A., Scherer, C.R., Boies, S.C., & Gordon, B.L.


(1999). Sexuality on the Internet: From sexual exploration
to pathology expression. Professional Psychology,
Research and Practice, 30, 154-161.

5)

Dretzin, R. (Writer), & Dretzin, R., & Maggio, J.


(Directors). (2008). Growing up online [Television series
episode]. In D. Fanning (Executive Producer), Frontline.

6)

Engeler, E. (2009, September 16). UN expert: Chil