Sie sind auf Seite 1von 1

Semisupervised Project Ideas

Christopher Brown
Department of Linguistics
University of Texas at Austin
1 University Station B5100
Austin, TX 78712-0198 USA
chbrown@mail.utexas.edu

Summary set is that active learning involves the human user


telling the machine what to learn, and semisuper-
Two very different ideas: 1) A personally-
vised learning entails the machine inferring what
adaptive spelling correction and style
to learn from the user’s normal behavior (i.e. un-
matching text-entry engine. 2) Collo-
labeled data); my implementation would be a com-
quial/formal classification of texts based
bination of the two.
on cross-classification between colloquial
and formal corpora. 2 Colloquial/formal classification

1 Spelling Correction This classifier will train on very large corpora that
are labeled at the global level; I will consider the
Spelling correction is sub-optimal in many of its LDC’s switchboard corpus or the Simple English
modern implementations. While some text editing language of Wikipedia as “colloquial,” and New
programs allow personal dictionaries, none that I York Times or Wall Street Journal articles as “for-
know of track personal mistake tendencies. Two mal.” These will be very noisy data sets, but I be-
treatments of spelling correction algorithms, Juraf- lieve some traits of formality or colloquialism can be
sky and Martin (2008) and Kukich (1992), describe teased from the documents. Given those two poles
various complications of and solutions to the prob- of corpora, I will calculate the distance between
lem (the latter in much more depth), but neither con- my unlabeled, to-be-classified documents, and doc-
sider the continuous process of user input in the pro- uments in those corpora. Do and Ng (2006) present
duction and correction of errors. Their spellcheck- an improved algorithm for the “multiclass text clas-
ing algorithms assume a completed document and a sification task” that seems to be suited to similar
cessation of user input. classification between such corpora.1
I propose a more intelligent text entry engine, in
which the errors that contribute to easily corrected References
misspellings (by minimum edit distance methods, Chuong Do and Andrew Ng. Transfer learning for text classifi-
for instance) will be incorporated into the model cation. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Ad-
vances in Neural Information Processing Systems 18, pages
of edit distances. For example, if a user mistypes 299–306. MIT Press, Cambridge, MA, 2006.
“aling,” the engine will quickly autocorrect it as Daniel Jurafsky and James H. Martin. Speech and Language
“align,” but if the user then mistypes “desing,” the Processing (2nd Edition), pages 72–79. Prentice Hall, 2 edi-
engine will remember the “ng” → “gn” edit it tion, 2008. ISBN 0131873210.
performed before, and suggest “design” instead of Karen Kukich. Techniques for automatically cor-
recting words in text. ACM Comput. Surv., 24
“dewing” (my naive iPad suggests “dewing” even (4):377–439, 1992. ISSN 0360-0300. doi:
though I make that typo all the time). This might be http://doi.acm.org/10.1145/146370.146380.
more like active learning, but as I understand it, the 1
The results section is somewhat opaque to me, but overall it
difference between active learning and semisuper- seems like a reasonable place to start; it was not so easy finding
vised learning over a non-finite (continuous?) data relevant articles for this idea.