Advertisement No 07 2018

Classifying Text using CNN
Sohail Manzoor
(17-MS-SE-19)
Muhammad Zeshan
(17-MS-SE-02)
Hadeed Ullah
(17-MS-SE-08)
Outline
• Goal of this presentation

• Text classification at Walmart
• Why to use DeepLearning
• CNN for TextClassification
• Characters as input
• Word tokens as input
• Comparison against SVM
• Conclusion
2
Goal of thispresentation
Understand how Convolutional Neural Network can be used in Text Classification
3
Text Classification at Walmart
• Assign item to acategory

• Assign query to acategory
• Identify positive/negative reviews
• Determine relevant/irrelevant attributes
Today we will focus on a simpler problem of determining the “level 2

category” from the title of a givenitem.
4
Steps of TextClassification
Read Documents Read Documents
Feature Extraction Tokenize

Tokenization, ngrams, stemming,
phrase detection, topicmodeling
Network Design
Feature Selection CNN, RNN, number oflayers
Informational Gain (IG),Chi-
square, odds ratio
Vector Representation Parameter Tuning

binary, tf,tf*idf
Learning Algorithm
Naïve bayes, logisticregression,
SVM, decision trees
Traditional Approach Deep LearningApproach

Traditional vs Deep LearningApproach
Traditional Approach Deep Learning Approach

• Well Understood • Nascent, started around 2014-2015
• Morethan 2 decades of active research
• Successfully used in manyapplications
• More number of steps and several choices for each • Less number of steps
step
• Right choices are wellestablished
• Major time is spent on feature engineering • Major time is spent on parameter tuning
• Easy to serve model in realtime • Real time serving of model can bechallenging
It is hard to beat the accuracy of traditional approach intext classification!!!
6
Why to use DeepLearning in Text Classification
• Leverage hyperactive and volume of research in deep learning
• Create uniform approach for all kind of data (image, video, voice, text)
• Enables multi-modal learning from text andimage
• Replace domain specific feature engineering knowledge with broader knowledge of

network design and parametertuning
• Enables more sharing ofknowledge
• Enables sharing of pre-trained models
• Mostdeep learning networks are open source
Democratize Machine Learning through uniform approach and knowledge sharing
7
Deep Neural Networks for Text: RNNor CNN
• CNN extractfeatures
• Works well where feature detection is important (e.g. Sentiment classification, positive/negative review
classification)
• CNN is faster totrain

• Convolutions can be done in parallel, utilize full advantage of GPU parallelism
• Historically RNN has outperformed CNN where length of the document is important (e.g.
language translation)
• But RNN takes longer to train due to its sequential nature
• Recent research shows CNN can outperform RNN accuracy on language translation
https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/
8
CNN Architectures for TextClassification
We experimented with the following 2 architectures
1. Character-level CNN
• Zhang, X. et al. Character-level Convolutional Networks for Text Classification, 2015,
https://arxiv.org/pdf/1509.01626.pdf
• Absolutely no preprocessing ofinput
• More familiar Deep CNNarchitecture
• convolution – max pooling layers followed by fully connected layers
2. Word-level CNN
• Kim, Y.Convolutional Neural Networks for Sentence Classification, EMNLP 2014,https://arxiv.org/pdf/1408.5882.pdf
• Only word tokenization is used as preprocessing
• Uses max-pooling across theinput
9
Character-level CNN
• Input text is represented as k x n matrix of one-hot encoding of the characters

• k is size of the alphabetset
• n is the maximum number of characters in the input text. Padded/truncated when necessary
• Imagine this as a single channel, gray scale, k x n image
• Apply series of convolution, max-pooling and then fully connected layers
Figure from https://arxiv.org/pdf/1509.01626.pdf

10
Character-level CNN: Characteristics
Layer Filter Subsample Output shape Activation Param #
Input - 70 x1014
Convolution1 256@70 x7 1 x3 1 x 336 x256 Relu 125,696
Convolution3 256@1 x3 - 1 x 108 x256 Relu 1,024
Flatten - - 8704 - -
FC1 - - 1024 - 8,913,920
FC2 - - 1024 - 1,049,600
FC3 - - 380 - 389,500
Total 10,484,860
• Slow to train
• Slow during inference, more than 100 millisecond on a P100 GPU
• Achieves 79% accuracy on thetest set
11
Word-level CNN
Figure from https://arxiv.org/pdf/1408.5882.pdf
• Input text is represented as a nxk matrix using word embeddings

• n is the maximum number of words in the text. Padded/truncated when necessary
• k is the length ofembedding
• Apply multiple convolutions of width k and different heights fi
• Height of a filter output is (n – fi + 1)
• Apply max-pooling across (n– fi + 1) height to select 1 output per filter
• Intuitively detects presence of a feature in the text
• n x k representation can be learned as part of the network, or pre-trained word embedding can be
used 12
Word-level CNN: Our implementation
Parameter Setting
1 xo
Fully connected Sentence length n = 25

Vocabulary size v = 500K
1 xp 1 xp 1 xp Embedding size d = 128
Convolutions f1, f2, f3 = 2, 3 ,4
Max pooling p = 128
Output o =380
(n – f1 + 1) x p (n – f2 + 1) x p (n – f3 + 1) x p
Total Number of Parameters
Convolution p@f3 xk Embedding v x k = 64M

p@f1 xk p@f2 xk
Convolution Filters (f1 + f2 + f3) * k
* p = 147K
Fully Connected = 3 * p x o = 145K
n xk
Total = 64,293,376
Embedding (v xk)
n xv
One-hot encoding 13
Convolution output
Phrase Weight Phrase Weight

sensitive skin moisturizingcream 3.814296 fairytale dress sandal END_TOKEN 4.5367112
dress sandal END_TOKEN 3.122334
dry sensitive skinmoisturizing 2.8061242
fairytale dress sandal END_TOKEN 2.9044547
cream 16.0 ozEND_TOKEN 2.5697758
mojo moxy 2.8222353
skin moisturizing cream16.0 2.3056493
dress sandal END_TOKENEND_TOKEN 2.6823337
moisturizing cream 16.0 2.1790688
tokens around “moisturizing cream” weighted highto tokens around “dress sandal” weighted high to
categorize under “Personal Care/ Bath & Body” categorize under “Clothing/Shoes”. Also the bran1d4
“mojo moxy” which makes shoes got highweight
Word Embedding
the pasta lego
• Obtained from the vxk embeddinglayer

• Randomly initialized
• Trained as part of the classification task
15
Learning Curve
Accuracy vs steps
Achieves 85% accuracy on the validation and test set
16
Parameter Tuning
Method Accuracy
Baseline 85.20%
More filters ofsize [2, 3, 4, 5, 6] 85.50%
Dropout probability from 0.5 increased to 0.75 85.97%
Batch size 2048 instead of512 84.91%
Batch size 64 instead of512 79.00%
17
Scaling
Processor Word-CNN Char-CNN
P100 112 395
K80 209 662
Intel Xeon 301 8000

1.8Gz, 8 core
Training time in minutes for 1 epoch over 10s of

millions of producttitles
Word-CNN Char-CNN
4-8 millisecond >100 millisecond
Inference time in millisecond for one example
Inference can be done on CPU in few milliseconds!!! 18

Scaling ideas – low hanging fruits
- More than 60% of the time was spend in preparing the next batch of Word-CNN on a P100
- Batch preparation can be done inparallel
- Tensorflow reader can possibly be of great help
- Tensorflow compiled for SSE, AVX2 and FMA can be 4-8x faster
- Word-CNN training can be completed in 4-5 hours on 10s of millions of examples on a CPU
- Data parallel training in case of multiple GPUs
19
Comparison against SVM
• SVM with unigram + bigram features also achieves with 85% accuracy with training on
1/10th of the data
• Stochastic gradient descent on full data does not achieve more than 80% accuracy after
same number of epochs
• SVM has comparable accuracy with faster training and inference
20
Conclusion
• Word-CNN is better and faster than Character-CNN

• Tokenization (i.e. some feature engineering) is still important even in case of DNN
• Word-CNN is a very promising network for Text classification

• Very robust, easy to achieve good accuracy with very little parameter tuning
• Can be trained in few hours on a CPU on 10s of millions of examples
• Inference can be done within few milliseconds even on a CPU
• Can be deployed to do inference (scoring) in real time
• It is promising to see CNN achieving state of the art accuracy on a very well studied
problem with very littleeffort
• And the field is rapidly makingprogress
• Hopefully much higher accuracysoon!!!
21
We are Hiring!!!
https://www.linkedin.com/in/somnath-banerjee
22

Advertisement No 07 2018

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Advertisement No 07 2018

Hochgeladen von

Copyright:

Verfügbare Formate

Classifying Text using CNN

• Goal of this presentation

Understand how Convolutional Neural Network can be used in Text Classification

• Assign item to acategory

Today we will focus on a simpler problem of determining the “level 2

Feature Extraction Tokenize

Vector Representation Parameter Tuning

Traditional Approach Deep LearningApproach

Traditional Approach Deep Learning Approach

It is hard to beat the accuracy of traditional approach intext classification!!!

• Leverage hyperactive and volume of research in deep learning

• Replace domain specific feature engineering knowledge with broader knowledge of

Democratize Machine Learning through uniform approach and knowledge sharing

• CNN is faster totrain

We experimented with the following 2 architectures

• Input text is represented as k x n matrix of one-hot encoding of the characters

Figure from https://arxiv.org/pdf/1509.01626.pdf

Figure from https://arxiv.org/pdf/1408.5882.pdf

• Input text is represented as a nxk matrix using word embeddings

Fully connected Sentence length n = 25

Total Number of Parameters

Convolution p@f3 xk Embedding v x k = 64M

Phrase Weight Phrase Weight

the pasta lego

• Obtained from the vxk embeddinglayer

Achieves 85% accuracy on the validation and test set

P100 112 395

K80 209 662

Intel Xeon 301 8000

Training time in minutes for 1 epoch over 10s of

Inference time in millisecond for one example

Inference can be done on CPU in few milliseconds!!! 18

- Data parallel training in case of multiple GPUs

• SVM has comparable accuracy with faster training and inference

• Word-CNN is better and faster than Character-CNN

• Word-CNN is a very promising network for Text classification

Das könnte Ihnen auch gefallen