Sie sind auf Seite 1von 22

Classifying Text using CNN

Sohail Manzoor
(17-MS-SE-19)
Muhammad Zeshan
(17-MS-SE-02)
Hadeed Ullah
(17-MS-SE-08)
Outline

• Goal of this presentation


• Text classification at Walmart
• Why to use DeepLearning
• CNN for TextClassification
• Characters as input
• Word tokens as input
• Comparison against SVM
• Conclusion

2
Goal of thispresentation

Understand how Convolutional Neural Network can be used in Text Classification

3
Text Classification at Walmart

• Assign item to acategory


• Assign query to acategory
• Identify positive/negative reviews
• Determine relevant/irrelevant attributes

Today we will focus on a simpler problem of determining the “level 2


category” from the title of a givenitem.

4
Steps of TextClassification
Read Documents Read Documents

Feature Extraction Tokenize


Tokenization, ngrams, stemming,
phrase detection, topicmodeling

Network Design
Feature Selection CNN, RNN, number oflayers
Informational Gain (IG),Chi-
square, odds ratio

Vector Representation Parameter Tuning


binary, tf,tf*idf

Learning Algorithm
Naïve bayes, logisticregression,
SVM, decision trees

Traditional Approach Deep LearningApproach


Traditional vs Deep LearningApproach

Traditional Approach Deep Learning Approach


• Well Understood • Nascent, started around 2014-2015
• Morethan 2 decades of active research
• Successfully used in manyapplications

• More number of steps and several choices for each • Less number of steps
step
• Right choices are wellestablished

• Major time is spent on feature engineering • Major time is spent on parameter tuning
• Easy to serve model in realtime • Real time serving of model can bechallenging

It is hard to beat the accuracy of traditional approach intext classification!!!

6
Why to use DeepLearning in Text Classification

• Leverage hyperactive and volume of research in deep learning

• Create uniform approach for all kind of data (image, video, voice, text)
• Enables multi-modal learning from text andimage

• Replace domain specific feature engineering knowledge with broader knowledge of


network design and parametertuning
• Enables more sharing ofknowledge
• Enables sharing of pre-trained models
• Mostdeep learning networks are open source

Democratize Machine Learning through uniform approach and knowledge sharing

7
Deep Neural Networks for Text: RNNor CNN

• CNN extractfeatures
• Works well where feature detection is important (e.g. Sentiment classification, positive/negative review
classification)

• CNN is faster totrain


• Convolutions can be done in parallel, utilize full advantage of GPU parallelism

• Historically RNN has outperformed CNN where length of the document is important (e.g.
language translation)
• But RNN takes longer to train due to its sequential nature
• Recent research shows CNN can outperform RNN accuracy on language translation
https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/

8
CNN Architectures for TextClassification

We experimented with the following 2 architectures

1. Character-level CNN
• Zhang, X. et al. Character-level Convolutional Networks for Text Classification, 2015,
https://arxiv.org/pdf/1509.01626.pdf
• Absolutely no preprocessing ofinput
• More familiar Deep CNNarchitecture
• convolution – max pooling layers followed by fully connected layers

2. Word-level CNN
• Kim, Y.Convolutional Neural Networks for Sentence Classification, EMNLP 2014,https://arxiv.org/pdf/1408.5882.pdf
• Only word tokenization is used as preprocessing
• Uses max-pooling across theinput

9
Character-level CNN

• Input text is represented as k x n matrix of one-hot encoding of the characters


• k is size of the alphabetset
• n is the maximum number of characters in the input text. Padded/truncated when necessary
• Imagine this as a single channel, gray scale, k x n image
• Apply series of convolution, max-pooling and then fully connected layers

Figure from https://arxiv.org/pdf/1509.01626.pdf


10
Character-level CNN: Characteristics
Layer Filter Subsample Output shape Activation Param #
Input - 70 x1014
Convolution1 256@70 x7 1 x3 1 x 336 x256 Relu 125,696
Convolution2 256@1 x7 1 x3 1 x 110 x256 Relu 2,048
Convolution3 256@1 x3 - 1 x 108 x256 Relu 1,024
Convolution4 256@1 x3 - 1 x 106 x256 Relu 1,024
Convolution5 256@1 x3 - 1 x 104 x256 Relu 1,024
Convolution6 256@1 x3 1 x3 1 x 34 x256 Relu 1,024
Flatten - - 8704 - -
FC1 - - 1024 - 8,913,920
FC2 - - 1024 - 1,049,600
FC3 - - 380 - 389,500
Total 10,484,860

• Slow to train
• Slow during inference, more than 100 millisecond on a P100 GPU
• Achieves 79% accuracy on thetest set

11
Word-level CNN

Figure from https://arxiv.org/pdf/1408.5882.pdf

• Input text is represented as a nxk matrix using word embeddings


• n is the maximum number of words in the text. Padded/truncated when necessary
• k is the length ofembedding
• Apply multiple convolutions of width k and different heights fi
• Height of a filter output is (n – fi + 1)
• Apply max-pooling across (n– fi + 1) height to select 1 output per filter
• Intuitively detects presence of a feature in the text
• n x k representation can be learned as part of the network, or pre-trained word embedding can be
used 12
Word-level CNN: Our implementation
Parameter Setting
1 xo

Fully connected Sentence length n = 25


Vocabulary size v = 500K
1 xp 1 xp 1 xp Embedding size d = 128
Convolutions f1, f2, f3 = 2, 3 ,4
Max pooling p = 128
Output o =380
(n – f1 + 1) x p (n – f2 + 1) x p (n – f3 + 1) x p

Total Number of Parameters

Convolution p@f3 xk Embedding v x k = 64M


p@f1 xk p@f2 xk
Convolution Filters (f1 + f2 + f3) * k
* p = 147K
Fully Connected = 3 * p x o = 145K
n xk
Total = 64,293,376
Embedding (v xk)
n xv
One-hot encoding 13
Convolution output

Phrase Weight Phrase Weight


sensitive skin moisturizingcream 3.814296 fairytale dress sandal END_TOKEN 4.5367112
dress sandal END_TOKEN 3.122334
dry sensitive skinmoisturizing 2.8061242
fairytale dress sandal END_TOKEN 2.9044547
cream 16.0 ozEND_TOKEN 2.5697758
mojo moxy 2.8222353
skin moisturizing cream16.0 2.3056493
dress sandal END_TOKENEND_TOKEN 2.6823337
moisturizing cream 16.0 2.1790688

tokens around “moisturizing cream” weighted highto tokens around “dress sandal” weighted high to
categorize under “Personal Care/ Bath & Body” categorize under “Clothing/Shoes”. Also the bran1d4
“mojo moxy” which makes shoes got highweight
Word Embedding

the pasta lego

• Obtained from the vxk embeddinglayer


• Randomly initialized
• Trained as part of the classification task
15
Learning Curve

Accuracy vs steps

Achieves 85% accuracy on the validation and test set

16
Parameter Tuning

Method Accuracy
Baseline 85.20%
More filters ofsize [2, 3, 4, 5, 6] 85.50%
Dropout probability from 0.5 increased to 0.75 85.97%
Batch size 2048 instead of512 84.91%
Batch size 64 instead of512 79.00%

17
Scaling
Processor Word-CNN Char-CNN

P100 112 395

K80 209 662

Intel Xeon 301 8000


1.8Gz, 8 core

Training time in minutes for 1 epoch over 10s of


millions of producttitles

Word-CNN Char-CNN
4-8 millisecond >100 millisecond

Inference time in millisecond for one example

Inference can be done on CPU in few milliseconds!!! 18


Scaling ideas – low hanging fruits

- More than 60% of the time was spend in preparing the next batch of Word-CNN on a P100
- Batch preparation can be done inparallel
- Tensorflow reader can possibly be of great help

- Tensorflow compiled for SSE, AVX2 and FMA can be 4-8x faster
- Word-CNN training can be completed in 4-5 hours on 10s of millions of examples on a CPU

- Data parallel training in case of multiple GPUs

19
Comparison against SVM

• SVM with unigram + bigram features also achieves with 85% accuracy with training on
1/10th of the data

• Stochastic gradient descent on full data does not achieve more than 80% accuracy after
same number of epochs

• SVM has comparable accuracy with faster training and inference

20
Conclusion

• Word-CNN is better and faster than Character-CNN


• Tokenization (i.e. some feature engineering) is still important even in case of DNN

• Word-CNN is a very promising network for Text classification


• Very robust, easy to achieve good accuracy with very little parameter tuning
• Can be trained in few hours on a CPU on 10s of millions of examples
• Inference can be done within few milliseconds even on a CPU
• Can be deployed to do inference (scoring) in real time

• It is promising to see CNN achieving state of the art accuracy on a very well studied
problem with very littleeffort
• And the field is rapidly makingprogress
• Hopefully much higher accuracysoon!!!

21
We are Hiring!!!
https://www.linkedin.com/in/somnath-banerjee

22

Das könnte Ihnen auch gefallen