Sie sind auf Seite 1von 50

Text Categorization

Moshe Koppel

Lecture 5: Author Profiling


With Shlomo Argamon, Jonathan Schler, James Pennebaker, Kfir Zigdon and others

Profiling
In real life:
1. We dont have a closed set of candidate authors 2. We dont have writing samples from each of them We can still try to say something about the author:
Gender Age group Linguistic background

Which is Male/Female?
My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance . The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

British National Corpus


920 documents labelled for
author gender document genre
Fiction / Female 132 132 151

Used 566 controlled for genre

Fiction / Male Non-fiction / Female

Non-fiction / Male
Arts (Non-academic) Arts (Academic) Belief & Thought Biography Commerce Leisure Science Soc. Sci. (Non-ac.) Soc. Sci. (Ac.) World Affairs

151
16 24 24 54 10 16 26 52 38 42

Experiment
Features: 400+ FW ; 600+ POS n-grams
Learner: exponential gradient / linear SVM Test: 10-fold cross-validation

Results per Feature Set


85 80 75 70 65 60 55 50

FW POS FW+POS

All docs

Fiction

Non-Fiction

Handle fiction and non-fiction separately

Use full feature set

Results per Genre


Testing on Genre: Fiction Fiction / Female Fiction / Male Non-fiction Non-fiction / Female Non-fiction / Male Arts (Non-academic) Arts (Academic) Belief & Thought Biography Commerce Leisure Science Social Science (Non-academic) Social Science (Academic) World Affairs # of docs 264 132 132 302 151 151 16 24 24 54 10 16 26 52 38 42 Train on All 74.5 74.8 74.2 79.7 79.2 80.2 76.0 75.6 85.0 87.0 60.0 85.7 74.2 77.5 82.9 79.2 Train on Fiction 79.5 81.7 77.3 Train on Non-fiction 82.6 83.3 81.9 76.3 77.5 85.0 90.0 84.0 81.3 78.5 83.0 78.4 82.9

Learning-Based Feature Reduction


Apply learning algorithm Eliminate features with low weights Learn again

Results: Feature Reduction


Fiction
0.9 0.85 0.8 0.75 0.7 0.65 0.6 all 128 64 32 16 8 Number of features

accuracy

FWPOS FW POS

Results: Feature Reduction


Feature reduction for Nonfiction
0.9 0.85

Accuracy

0.8 0.75 0.7 0.65 0.6 all 128 64 32 16 8 Nu mb er of featu res

F WPOS POS FW

What are the Distinguishing Features?


Fiction
Male: a, the, as Female: she, for, with, not

Non-Fiction
Male: that, one, of, PRP, AT0 Female: she, for, with, and, in, PNP

Feature Frequencies
Fiction Feature PNP he she AT0 DT0 the XX0 PRP PRF for with and Male stderr 732 14 145 4.7 67 4.3 735 9.5 160 2.9 520 8.6 84 2.4 623 6.0 170 4.2 55.7 1.1 58.6 1.1 234 4.9 Female stderr 809 15 135 4.7 139 6.9 626 8.7 153 2.0 418 7.5 98 2.2 615 5.7 158 3.7 61.3 1.0 66.5 1.0 249 5.5 Non-fiction Male stderr 291 12 47.5 3.5 8.73 1.7 884 9.1 220 4.0 611 8.4 54 1.5 767 5.9 355 7.2 77.9 1.6 56.9 1.1 242 3.9 Female stderr 331 17 48.1 4.3 21.5 2.3 822 12 204 4.6 614 12 55 2.3 763 7.0 324 7.9 90.7 1.4 67.8 1.4 287 5.2

Summary: Male vs. Female Style


Males use more Determiners Adjectives of modifiers (e.g. pot of gold)
Females use more Pronouns for and with Negation Present tense

Informational features

Involvedness features

Which is Male/Female?
My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance . The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Which is Male/Female?
My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance . The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Which is Male/Female?
My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance . The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Which is Male/Female?
My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance . The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Which is Male/Female?
My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance . The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Blog Corpus
85,000 blogs blogger-provided profiles (gender, age, occupation, astrological sign) harvested August 2004 non-text ignored (formatting, quoting)

Example 1
Yesterday we had our second jazz competition. Thank God we weren't competing. We were sooo bad. Like, I was so ashamed, I didn't even want to talk to anyone after. I felt so rotton, and I wanted to cry, but...it's ok.

Example 2
My gracious boss had agreed to let me have one week off of "work." He did finally give me my report back after eight freakin' days! Now I only have the rest of this week and then one full week after my vacation to finish this damned thing.

Example 3
So about a month or two ago, I met Katy N. at a party in New York. Katy's friend, Kevin M., whom she met while living in Barcelona last year, lives in Miami and is working on getting a TV series produced. Kevin is friends with a guy named Charlie P.

Blog Corpus
age unknown 13-17 18-22 23-27 28-32 33-37 38-42 43-48 >48 Total gender female male 12287 12259 6949 4120 7393 7690 4043 6062 1686 3057 860 1827 374 819 263 584 314 906 34169 37324 Total 24546 11069 15083 10105 4743 2687 1193 847 1220 71493

Final balanced corpus: 19,320 total blogs


8240 in 10s 8086 in 20s 2994 in 30s

681,288 total posts 141,106,859 total words

Experimental Setup
Feature sets:
Content: words (filtered by infogain on train set) Style: parts-of-speech, function words, blog slang Learning algorithms: Real-valued balanced winnow (RBW) Bayesian Multinomial Regression (BMR)

Evaluation: 10-fold cross-validation

Age: Classification

Style & Content Function Words Content Words

RBW 75.0% 67.7% 75.9%

BMR 77.4% 69.4% 76.2%

The lifecycle of the common blogger...


feature bored boring awesome mad homework mum maths dumb sis crappy 10s 3.84 3.69 2.92 2.16 1.37 1.25 1.05 0.89 0.74 0.46 20s 1.11 1.02 1.28 0.8 0.18 0.41 0.03 0.45 0.26 0.28 30s 0.47 0.63 0.57 0.53 0.15 0.23 0.02 0.22 0.1 0.11

The lifecycle of the common blogger...


feature bored boring awesome mad homework mum maths dumb sis crappy 10s 3.84 3.69 2.92 2.16 1.37 1.25 1.05 0.89 0.74 0.46 20s 1.11 1.02 1.28 0.8 0.18 0.41 0.03 0.45 0.26 0.28 30s 0.47 0.63 0.57 0.53 0.15 0.23 0.02 0.22 0.1 0.11

feature college bar apartment beer student drunk album dating semester someday

10s 1.51 0.45 0.18 0.32 0.65 0.77 0.64 0.31 0.22 0.35

20s 1.92 1.53 1.23 1.15 0.98 0.88 0.84 0.52 0.44 0.4

30s 1.31 1.11 0.55 0.7 0.61 0.41 0.56 0.37 0.18 0.28

The lifecycle of the common blogger...


feature bored boring awesome mad homework mum maths dumb sis crappy 10s 3.84 3.69 2.92 2.16 1.37 1.25 1.05 0.89 0.74 0.46 20s 1.11 1.02 1.28 0.8 0.18 0.41 0.03 0.45 0.26 0.28 30s 0.47 0.63 0.57 0.53 0.15 0.23 0.02 0.22 0.1 0.11

feature college bar apartment beer student drunk album dating semester someday

10s 1.51 0.45 0.18 0.32 0.65 0.77 0.64 0.31 0.22 0.35

20s 1.92 1.53 1.23 1.15 0.98 0.88 0.84 0.52 0.44 0.4

30s 1.31 1.11 0.55 0.7 0.61 0.41 0.56 0.37 0.18 0.28

feature son local marriage development tax campaign provide democratic systems workers

10s 0.51 0.38 0.27 0.16 0.14 0.14 0.15 0.13 0.12 0.1

20s 0.92 1.18 0.83 0.5 0.38 0.38 0.54 0.29 0.36 0.35

30s 2.37 1.85 1.41 0.82 0.72 0.7 0.69 0.59 0.55 0.46

Gender: Classification

Style & Content Style Words Content Words

RBW 80.0% 77.0% 73.0%

BMR

Men are from Mars... Women are from Venus...


LIWC category job money sports tv sex family eating friends sleep pos-emotions neg-emotions male 68.10.6 43.60.4 31.20.4 21.10.3 32.40.4 27.50.3 23.90.3 20.50.2 18.40.2 248.21.9 159.51.3 female 56.50.5 37.10.4 20.40.2 15.90.2 43.20.5 40.60.4 30.40.3 25.90.3 23.50.2 265.12 1781.4

Relating Age & Gender


Let's examine the connection between age and gender a little more generally... Consider the most distinctive words for both Age and Gender:
Intersection of the 1000 words with highest Age information gain and the 1000 words with highest Gender information gain Total of 316 words Consider log(30s/10s) vs. log(male/female)

Relating Age & Gender


8 6 4

log(30s/10s)

-2

-4

-6

-8 -2 -1 0 1 2

log(male/female)

Relating Age & Gender


8 6 4

husband

log(30s/10s)

-2

-4

-6

-8 -2 -1 0 1 2

log(male/female)

Native Language
Given English text, can we determine the authors native language?

Try it yourself. These were written by Russian, French and Spanish speakers, respectively. Can you tell which is which?

In the second part of this outhors novel, called Time Passes, time has passed indeed and Mrs Ramsay has died. There are pejudments of small groups, such as homosexuals, inmigrants, aids diseaseds, etc. But "political correctness" has have positive and negative consecuences. There is one more kind of films irritating many television viewers - "soap" serials. Santa Barbara has even won "Oskar" prize.

Possible Clues
Patterns of native language are typically reflected in how other languages are spoken (Rado61, Corder81): Word selection Syntax Spelling

Measurable Features for Automated Native Language Detection

Frequency of function words Frequency of letter sequences (adapted from Peng+ 04) Idiosyncrasies
We will gather idiosyncrasies data automatically.

Orthographic Idiosyncrasies
Repeated letter (e.g. remmit instead of remit) Double letter appears once (e.g. comit instead of commit) Letter instead of (e.g. firsd instead of first) Letter inversion (e.g. fisrt instead of first) Inserted letter (e.g. friegnd instead of friend) Missing letter (e.g. frend instead of friend) Conflated words (e.g stucktogether)

Syntactic Idiosyncrasies
Sentence Fragment Run-on Sentence Repeated Word Missing Word Mismatched Singular/Plural Mismatched Tense that/which confusion Rare POS pairs (Chodorow-Leacock 00)

Automatically Finding Idiosyncrasies


1. Run text through automated spell/grammar checker 2. Compare flagged word to best suggestion 3. Mark error accordingly e.g. text=remmit suggestion=remit mark as repeated letter

Summary: Features Used


400 function words 200 letter sequences 185 error types 250 rare POS pairs

Each document is represented as numerical vector of length 1035

Test Corpus
International Corpus of Learner English (Granger98)

11 countries Subjects same age, proficiency level Samples same genre, length Actually used in study- 258 docs from each of
France Spain Bulgaria Czech Rep. Russia

SVM Classification Accuracy (10-fold CV)


90 80 70 60 50 40 30 Errors Letter n-gram s Function words Function words + Letter n-gram s

Baseline=20%

shaded: w/o error features

white: with error features

Confusion Matrix
Classified As
Czech
French
Bulgarian
Russian

Spanish

Actual

Czech

209

18

20

10

French
Bulgarian

9
14 24 16

219
8 8 10

13
211 24 10

12
18 194 7

5
7 8 215

Russian Spanish

What Gives It Away?


Russian over, the (infrequent), number_reladverb
French indeed, Mr (no period), misused o (e.g. outhor)

Spanish c-q confusion (e.g. cuality), m-n confusion (e.g. confortable), undoubled consonant (e.g. comit)
Bulgarian most_ADVERB, cannot (uncontracted) Czech doubled consonant (e.g. remmit)

Lets look back at our examples. Now its pretty obvious.

French: In the second part of this outhors novel, called Time Passes, time has passed indeed and Mrs Ramsay has died. Spanish: There are pejudments of small groups, such as homosexuals, inmigrants, aids diseaseds, etc. But "political correctness" has have positive and negative consecuences.

Russian: There is one more kind of films irritating many television viewers - "soap" serials. Santa Barbara has even won "Oskar" prize.

Real-Life Issues
Many candidate languages
Very short texts Unpredictable English proficiency

Personality
Pennebaker data:
Students wrote essays Same students took personality assessment tests

Experiment:Given text, determine if author is


Open Conscientious Neurotic Extroverted Agreeable

Accuracy Results
Open 66%

Conscientious 65%

Neurotic
Extroverted

63%
62%

Agreeable

60%

Key Features
Openness
consciousness, strange, thoughts, maybe, you hope, feel, home, friends, football, team

Conscientiousness
school, always, high, grades damn, bad, hate, you, more

Das könnte Ihnen auch gefallen