Ensemble Application of Convolutional and Recurrent Neural Networks for Multi-label Text Categorization

Ensemble Application ofConvolutional and Recurrent Neural Networks for Multi-label Text Categorization Guibin Chen1 , Deheng Ye1 , Zhenchang Xing2 , Jieshan Chen3 , Erik Cambria1 1 School of Computer Science and Engineering, Nanyang Technological University, Singapore 2 Research School of Computer Science, Australian National University, Australia 3 School of Mathematics, Sun Yat-sen University, China [email protected], [email protected], [email protected], [email protected], [email protected] Abstract—Text categorization, or text classification, is one of the state-of-the-art word-vector based CNN feature extraction key tasks for representing the semantic information of documents. and we deal with high-order label correlation while keeping Multi-label text categorization is finer-grained approach to text a tractable computational complexity by using RNN. We categorization which consists of assigning multiple target labels to documents. It is more challenging compared to the task of perform experiments to investigate the influence of parameters multi-class text categorization due to the exponential growth in our model. Additionally, we compare our method with of label combinations. Existing approaches to multi-label text several baselines using two publicly available datasets, i.e., categorization fall short to extract local semantic information and Reuters-21578 and RCV1-v2. The former reflects the behavior to model label correlations. In this paper, we propose an ensemble of our model in a relatively small dataset, and the latter is application of convolutional and recurrent neural networks to capture both the global and the local textual semantics and considered as a large-scale dataset. to model high-order label correlations while having a tractable The remainder of the paper is organized as follows. Section computational complexity. Extensive experiments show that our II covers the basic concepts for multi-label text categorization. approach achieves the state-of-the-art performance when the We review related work in Section III. Section IV presents the CNN-RNN model is trained using a large-sized dataset. details of our CNN-RNN method. Experiments for exploring the parameters of our model and comparisons with the base- I. I NTRODUCTION lines are illustrated from Section V to Section VII. And finally, Multi-label text categorization refers to the task of assigning Section VIII concludes the paper. one or multiple categories (or labels) to a textual document, which can be time-consuming and sometimes intractable. Text II. BACKGROUND categorization and multi-label text categorization have been Multi-label text categorization can generally be divided enormously applied to real-world problems, e.g., information into two sub-tasks: text feature extraction and multi-label retrieval [1], affective computing [2], sentiment analysis [3], classification. An introduction of these two sub-tasks will be email spam detection [4], multimodal content analysis [5], etc. described in the following two subsections. Efficient ways to tackling this task require advanced semantic representations of texts that go beyond the bag-of- A. Text features words model which ignores local orderings of words and Raw text information cannot be directly used in the sub- phrases, and hence, is unable to grasp local semantic informa- sequent multi-label classifier. Many research works represent tion [6]. Moreover, labels often exhibit strong co-occurrence the features of texts and words as vectors in a framework dependences. For example, a color label “green” often occurs termed vector space model (VSM). Instead of representing the with the subject labels “tree” or “leaf”, however, the color whole text in one step, e.g., tf-idf weighting for a document, label “blue” will never come with the subject label “dog”. most recent works focus on the distributional representation By considering such label correlations instead of considering of individual words, which is nevertheless pre-processed and labels independently, a more efficient prediction model is tokenized from the original raw text data. Grouping similar expected to be obtained. words in the vector space has shown good performance in In this work, we propose an ensemble application of convo- many natural language processing (NLP) tasks, e.g., multi- lutional neural network (CNN) and recurrent neural network task learning [7], sentence classification [8], sentiment analysis (RNN) to tackle the problem of multi-label text categorization. [9], semantically equivalent detection [10], etc. Among them, We develop a CNN-RNN architecture to model the global research efforts starting from [11] bring in a word2vec model and local semantic information of texts, and we then utilize that can capture both syntactic and semantic information with such label correlations for prediction. In particular, we employ demonstrated effectiveness in many NLP tasks. The basic idea 978-1-5090-6182-2/17/$31.00 ©2017 IEEE 2377 [30] Since text features can be well represented as a feature claims that the pairwise ranking loss is not consistent with the vector. Then. words wi . can only be used to capture term b ∈ Rh . personality detection representation of word vectors will be used in our word-vector [28]. We can treat the so that any binary classification algorithm can be adapted. ML-KNN the filter. sarcasm local context of words to get its hidden semantics. wi+h−1 . where with widths equal to the dimensionality of the word vectors the subsequent classifiers in the chain will take into account (i. the number of adjacent words considered jointly. the next step is to do multi-label classification for such rank loss. text feature extraction feature vector for the input text. and more. e. The method are introduced. to solve multi-label classification a filter w ∈ Rhk (a vector of h × k dimensions) and a bias problems. the details of the proposed CNN-RNN multi-label assignments for each instance simultaneously. the generalization to the multi-label problem presents In this section. and · is the dot product between including multi-task learning [7]. Our approach consists of two parts: exponential growth of possible label combinations. to various NLP tasks. and Rank-SVM [21] adapt popular classi- Let wi:i+h−1 refer to the concatenation vector of h adjacent fication algorithms. Previous works use either tf-idf weighting RNN prediction can model the high-order correlation between [12] or n-gram information [13] for the whole document or labels so that it is expected to improve the performance over local sequence of words. decision trees. is tor X ∈ Rm .. such as CNN and RNN. aspect extraction [27]. unlike traditional supervised classification IV. For algorithm adaptation. where i = 1. classification algorithms fall into two main categories: problem Thus. For this reason. k).e. more recent neural network method. k-nearest-neighbors. [16]. which is assigned to a subset y of the label space introduced in [31]. large-scale data. xi is the i-th instance combined method for multi-label text categorization. neural network based methods have been proposed. Problem adaption is about transforming multi-label classification into a simpler problem. which Each word will look up the word embedding matrix obtained have been successfully used in many NLP tasks. However. it is reasonable to use filters problem into a chain of binary classification problems.. the “initial state” or prior knowledge of the RNN Many algorithms have been proposed for representing the is determined and used to predict a sequence of labels. vectors.of the word2vec model in [11] is to consider the contextual multimodal analysis [22].. [19].e. i. because each word is repre- correlation between labels by transforming the multi-label sented as a k-dimensional vector. which utilizes the [23]. Thus we simply vary the window size (or height) of the preceding predictions. we pre-train a word2vec model for capturing the local III. with these feature and multi-label classification.. sentence vector as “image”. [17] tracts each label as individual classification where ⊕ is the vector concatenation operator. words as a prediction given a center word. these types of neural networks Y with L possible labels. we introduce a CNN-RNN where N is total amount of training data.. . yi )|1 ≤ i ≤ N }. however.. ordering of words and loses some semantic information. CNN-RNN M ULTI -L ABEL T EXT C ATEGORIZATION problems where each instance is only assigned with one label. oi = wT · xi:i+h−1 + b (2) Recently. CNN is used to extract a global fix-length of multi-label text categorization. a text of length m is represented as: adaption and algorithm adaption. i. Multi-label classification and a pairwise ranking loss function to tackle this task.. This motivates some other research the one used in [8]. Most methods. however. Word-vector based CNN while incorporating more n-gram information is not effective The way to use CNN to extract text features is similar to due to the data sparsity. from the word2vec model as described in Section II-A. machine translation [25].g..e. wi+1 . This filter is applied 2378 .n − h + 1.. [29] pro- poses a BP-MLL that utilizes a full-connected neural network B. polarity detection [24]. A. Thus. ⊕ wm (1) For example. semantic relevance detection [10]. the CNN part for extracting text features and the RNN part for makes multi-label classification a very challenging problem. Before training the whole CNN-RNN model. [11] to learn a low-dimensional vector firstly preprocessed and tokenized into a sequence of words. several deep learning works have applied neural networks. R ELATED W ORK features for each word and feed it in the following supervised We discuss some representative works for the two subtasks CNN-RNN training. multi-label prediction. recent developments in multi-label original text is then a concatenation of word vectors wi ∈ Rk .. which is applied to h words to produce a new the first or second order label correlation or are computational value oi ∈ R: intractable in considering high-order label correlations. The features of text. Another features. The task is to learn a function h: are still not efficient enough to tackle high dimensional and Rm → 2Y from the training data D = {(xi . Each input instance is a high dimensional feature vec. tf-idf ignores the local the predictors that model the lower-order labels correlation. The On the other hand. the filter vector and the word vector. As for multi-label text categorization. A convolution operation involves support vector machines. In text applications. and adopts the cross-entropy loss instead. sentence classification [8]. called ML-HARAM. input feature vector in Rm and yi ⊂ Y is a subset of the label space Y. ML-DT [20]. S = w1:m = w1 ⊕ w2 ⊕ . and perform convolution on it via Classifier chains [18] are designed to model the high-order linear filters. [15].. The details are as follows: the input text is works [14]. However. representation of words considering its local context. This vector detection [26]. several based CNN as the initial representation of the raw text. o = [o1 . For example. activation function is chosen as ReLu. we obtain a low-dimension feature representation for the text. o. concatenating output of k filters of each the three of memory cells. the CNN is used to extract the this text feature vector is used as the input for the RNN. we apply a pooling function to each feature map if the lower layer is word embedding of the labels or can be to induce a fixed-length vector. then ht ∈ Rq . For gradients problems. 5) will result in a 3k-dimension output control what to read into and write out of from the memory vector. etc. LSTMs have been involved Rq×t . [35] illustrates that the training of LSTM <START>. The non-linear o = σ(W (o) xt + U (o) ht−1 + b(o) ). there is softmax layer above the can be stabilized by pre-training the LSTMs as sequence auto. As a result. top layer of LSTM to calculate the probability of each label by encoders or recurrent language models. o2 . x2:h+1 . The label sequence here is the assignments of ordered labels to Therefore.. U (. an output input text is “the cat is sleeping near a small American flag gate o and a forget gate f . Forget gate f is used to erase some parts example. W (. whole network is able to predict a sequence of related labels tion.) is the The dimensionality of the feature map generated by each filter sigmoid function. In the text mining domain. a filter Rq×d . where T is the successful RNN variants. auto-encoder. The feeding of the feature vector into the that introduces nonlinearity transitions to model long-term LSTM is by a linear transformation as an additional addition nonlinear dependences. u). Here. We apply a non-linear activation function f to each oi to produce a i = σ(W (i) xt + U (i) ht−1 + b(i) ). . By projecting its output vector into a lower dimensional cells. x1:h . and the subject (Animal). what to output. In each time step. labels that belongs to that text... As shown in Section IV-A. ReLu(oi ) = max(0. for each piece of text. Label sequence prediction by RNN into LSTM as the initial state for the label prediction. problem is avoided when we use LSTMs for label sequence Then.) ∈ Rq for all types (i. the RNN is mainly used for sequential labeling or predic. relationship can be modeled. the standard sequence will be predicted. When using it for the whole long document. which introduces three additional output text feature from CNN with a fix-dimension t. feature map c ∈ Rn−h+1 where ci = f (oi ). or be used as the output vector for different NLP tasks. the label with maximal probability will be predicted..) ∈ Rq×q and b(. Although several variants of LSTMs exists. because this function f = σ(W (f ) xt + U (f ) ht−1 + b(f ) ). sentence classification the formula for the input gate is changed to: [34]. a various length of label a text. If there are q LSTM units. and d can be the dimension of word vector of the labels size. o. One may also specify multiple kinds of filters with different ht = o tanh(ct ) window sizes. The three gates work collaboratively in a bed” and this corresponding ground truth labels include to control what to be read by input. 2379 . in the task of sentiment analysis [33]. the A LSTM consists of three gates: an input gate i. Combining CNN and RNN fed into the next layer of the CNN for further convolution.. (4) u = tanh(W (u) xt + U (u) ht−1 + b(u) ). f.repeatedly to each possible window of h words in the sen. where denotes element-wise multiplication and σ(. on−h+1 ].. in Figure 1. etc. However.) ∈ 1) with the maximum value for each filter. while the input gate i and output gate o window-size (1. space using a full-connect layer. f. or use multiple filters for the same window size to learn complementary features from the same word windows. After feeding this output vector B.. feature of text so that the semantic information is represented as a output vector from CNN. so that some complex long-term (Midnight). which is typically much shorter than a document.e. An additional word embedding layer is also sequence of each input text is exactly matched to the subset of applied for the labels. xn−h+1:n ) to produce an output as follows: sequence o ∈ Rn−h+1 ..e. W (T ) ∈ gates. and q is the hidden dimension of LSTM. Moreover. 3. For example. the location (United States) and the time what should be forgotten. the training of LSTMs is not stable and underperformed traditional i = σ(W (i) xt + U (i) ht−1 + b(i) + W (T ) T ) (5) linear predictors as shown in [35]. such as language model. The standard LSTM is expressed tence (i. LSTM is considered as one of most term W (T ) T in (4) for each types (i. the above using linear transform the hidden-state of top LSTM layer first. A common strategy is 1-max the hidden-state dimension of the lower layer if the lower layer pooling. as studied in [32]. there will be i × j outputs in total. xt ∈ Rd is the input from lower layer at time will vary as a function of the text length and the filter’s window step t. have more benefits over sigmoid and tanh. The ideal case will be that the label LSTM is used.e. u). which extracts a scalar (i. the training and testing time of LSTMs are also time/resource consuming for The label sequence prediction is always started with the token the long document. This feature vector can be C. Memory cell ct is the key in LSTM which keeps the long- which means that if there are i window-size filters and each term dependencies while getting rid of the vanishing/exploding of them have j filters. . oi ) (3) ct = i u + f ct−1 . predictions. with a certain window size will only produce one scalar value. a feature vector of length is LSTM. i. The prediction of labels is ended with the <END> token. It can be to the input documents according to the features extracted by considered as an extension of the hidden Markov model CNN (Figure 1. Thus.. #'!$''$'''$'$ *!!& . '!$'' $&''!.!'(2 !$ #&'"$' $%''("!'$'$ #% "0%! "01 "0&" !" *$'% ( . $''!$( +%. . +%.$''!$( %$'"$% '$'$ ) ) #! $ "$% "$% . By correcting the errors of the raw precision.&&" ) 3 ' ''''''''''('''(! "' ///'''''''''''''. we preprocess these categorization since 1996 (now superseded by RCV1). that is the supervised training for the whose amount is 103 and also collect all these dataset to CNN-RNN model.24 V. and F1 score. rics. Then. The first training step train/test split in [37].& Fig. We only consider the topic categories as labels second training step. the ranking archive of over 800.000 manually categorized newswire metric evaluated here is one-error. there are a training set (document is to train a word2vec model by using all the tokenized texts IDs 2286 to 26150) and test set document IDs 26151 to as unlabelled data. for the average number of labels per document regularization. There are two types of evaluation metrics for multi-label validation to compare various algorithms. recall. the cross-entropy loss is then back-propagated from RNN down to CNN to update the weights of both the CNN-RNN model. cosine-normalized) versions. 2380 . Since the RCV1-v2 dataset provides both the token publicly-available datasets. As in the LYRL2004 Two steps of trainings are performed. we use all these documents for 10-fold cross. In the Modified Apte as a 1000-dimensional feature vector.603 docs for training then be used for all baseline algorithms. This vector model can (“ModApte”) Split. C is gradient decent (SGD) for faster convergence. We use this split to do the VI. layer of LSTM for labels prediction. One state-of-the. Reuters-21578 10789 18637 90 1. M is the number of documents. each document is presented now contains 21.. it their document frequency. 1: Our proposed CNN-RNN model D. For the classification met- stories recently made available by Reuters Ltd for re. documents and choose the top 1000 tf-idf features according to Adapted from this preview Reuters-22173 version. we apply l2-constraints over all weights in dataset M D L C CNN and RNN and use dropout with rate 0.299 docs for testing. the Reuters- once a popular dataset for researchers who worked in text 21578 only has a raw dataset. Using the softmax classifier as the upper do 10-fold cross-validation instead of original split. Since our model does • RCV1-v2: Reuters Corpus Volume I (RCV1) is an not provide a complete ranking for all the labels. TABLE I: Dataset Overview.5 in CNN. Reuters-21578 and RCV1-v2. a text categorization test collection is pro- duced which is called RCV1-v2. classification: ranking and classification. and vector (tf-idf. this word2vec model is fed into the 810596). Then. DATASETS The characteristics of the above dataset are summarized in To test the performance of our algorithm. This was model for all the baseline algorithms. and 3. art update policy Adam [36] is chosen instead of stochastic D is the total vocabulary size. Thus. the macro/micro-averaged search purposes. L is the number of labels. Training RCV1 data. we use two Table I. there are 9. i. "0%! "01 "0&" -$&'%.e. However. After that.13 RCV1-v2 804414 47236 103 3. we consider the hamming loss. we can directly • Reuters-21578: The documents in this dataset were use the token version for our CNN-RNN model and the vector collected from the Reuters newswire in 1987. E VALUATION M ETRICS empirical studies of parameters of our neural network.578 documents. Besides. Besides. and ML-HARAM. ML. Therefore. Different The parameters of RNN include the dimension of labels. while micro-averaged counts all true positives. There are two ways to calculate these metrics over the whole test data: macro-averaged and micro-averaged. tn. VII. Besides. f n) to evaluate the performance of a classification problem. • Precision. Recall and F1 score) over labels. true negatives (tn). the hidden dimension and the number of layers. true negatives. when the ratio is set to 50%. Here. 2: Metrics versus training epoch for different RNN used as the input or the prior knowledge of RNN. the system performance does not change significantly generalization. The dimension of word vector is chosen as 400 while the number of filters in each window size is chosen as 128. For the parameter studies. More layers As shown in the figures. Complexity of RNN versus performance different RNN complexities is shown in Figure 2. and begins to converge. Here. Macro- averaged refers to the average performance (Precision. E XPERIMENTAL R ESULTS (c) MiP (d) MiR In this section. 1- Max-pooling is applied in all filters so that each window size will output a 128-dimension feature vector. for simplicity. the parameters include the dimension of label embedding. • The one-error counts the fraction of instances whose top- 1 predicted label is not in the relevant labels. Apart from the investigation of the is the dimension of the feature vector). The parameters of CNN is chosen based on (e) MiF (f) MaP the validation results for the Reuters-21578 dataset. we perform experiments to investigate the influence of complexity of the network on the performance and also compare our method with several baselines. We baseline: Binary Relevance (BR). observe the various performances over the validation set of the kNN. because [32] shows that a rectified linear unit has more benefits than sigmoid and tanh function. curves correspond to different α values. In the RNN settings part. when the training epoch goes be- one layer LSTM parameter investigation without the loss of yond 50. We monotonically network itself. false positives and false negatives first among all labels and then has a binary evaluation for its overall counts. to save the computational 2381 . An empirical study over these parameters are We simultaneously set the dimension of labels and hidden performed by using the Reuters-21578 dataset with Modified dimension to be α of the input dimension from CNN (which Apte (“ModApte”) Split. the structure of the CNN consists of a pre- trained word embedding layer and one layer CNN with several window sizes. The window size is chosen up to 9-gram. we (g) MaR (h) MaF concatenate these feature and use a full-connection layer to project it into a 256-dimension output vector. Reuters-21578 data with Modified Apte (“ModApte”) Split. The nonlinear function we choose is Relu. we also compare its results with the several increase α from 12. Recall and F1 score are binary evaluation measures B(tp. Classifier Chain (CC). Then the various performances versus training epoch for the A. we focus with the 100% case. the dimension of hidden state and also the number of layer of LSTM.5% to 100% with the doubling step. • The hamming loss counts the symmetric difference of the predicted labels and the relevant labels and calculates the fraction of its difference over the label space. since the investigation of CNN structure is already shown in [8]. f p. such that it will not limit the overall performance when incorporating RNN. per- will generally be able to get higher abstract content and formance is superior than the lower ratio ones and comparable keep more long-term memory. which is calculated based on the number of true positives (tp). (a) One error (b) Hamming loss false positives (f p) and false negatives (f n). we mainly focus on the parameters of the RNN part. which is then Fig. is a transformation-based algorithm.012 ± 0.026 MiF 0.853 ± 0.937 ± 0. BR is used to of training data.002 N/A N/A 0.008 N/A N/A 0.046 0. Classifier chain (CC).712 ± 0. We iments.687 ± 0.028 0. the system may suffer from overfitting. We perform 10 iteration experiments. in the experiment that ML-kNN is inferior to any other methods.004 MaR 0.0005 0. the proposed makes not much sense to model high-order correlation.898 ± 0.0003 0.646 ± 0.849 ± 0. For the RCV1-v2 dataset.940 ± 0.005 0. We test various algorithms over the above datasets.470 ± 0.012 0.008 0.855 ± 0.032 0.009 H-loss 0.006 0. The result is shown in Table II.0004 MiP 0. However.693 ± 0. not scalable with the data size and results in memory error. into consideration the high-order label correlation.237 ± 0.775 ± 0.002 H-loss 0.008 effort and keep a comparable performance. especially in macro-averaged metrics.012 MaP 0.0032 ± 0.010 0.008 0. averaged metrics relative to the previous dataset are due to the Each of the two datasets is randomly divided into 10 equal.035 0.811 ± 0.004 N/A N/A 0.878 ± 0.002 MaR 0. CC is observed to have a slightly better plexity. we calculate the mean and deviation of the 10 exper.847 ± 0. we set the CNN-RNN parameter ratio as 50% of-art neural network based approach ML-HARAM is also and the training epoch is limited to 50.239 ± 0.001 0.361 ± 0. inferior to our proposed neural network based approach in all metrics.023 0.038 0.020 0.768 ± 0.0009 0.0066 ± 0.794 ± 0.014 MiR 0.015 MaP 0.350 ± 0.0005 0.005 N/A N/A 0.902 ± 0. metrics and the hamming loss.040 0.012 MaF 0.889 ± 0.474 ± 0.007 N/A N/A 0.010 MiR 0.0086 ± 0.803 ± 0. The implementation expect that our model can also outperform these two methods of all the baseline methods are done in Scikit-multilearn [38]. We investigate the number of neighbors in ML-kNN from 5 to 20 with the VIII. The increases in the Macro- 10-fold cross-validation is performed over these algorithms.0031 ± 0.0011 0.014 0. when trained over a large-scale dataset.016 0. and ML-HARAM is a show substantially decreased performances in micro-averaged state-of-the-art neural network based algorithm.879 ± 0. per document in the Reuters-21578 are about one. TABLE II: Comparison of various algorithms (Mean ± Deviation) Algorithm BR CC ML-kNN ML-HARAM CNN-RNN Reuters-21578 One-error 0.010 ± 0.032 0.815 ± 0. we proposed a convolutional neural network (CNN) and performance over training data. the to multi-label text categorization fall short to extract local CNN-RNN model does not show an improvement over the semantic information and to model label correlations. and each the metrics for ML-kNN and ML-HARAM are not available of them takes 9 subsets.024 0.287 ± 0. ML-HARAM follows its default Multi-label text categorization is the task of assigning pre- setting. the micro-averaged precision in the training dataset order label correlation with a reasonable computational com- is 99%. CC is an algorithm that takes less frequently in the dataset.030 0.0088 ± 0.027 0.385 ± 0.019 0. Notice model can achieve the state-of-the-art performance. both of the traditional methods.026 MaF 0. Besides. and choose 10 finally that is optimized for both Hamming-loss and F1 score.0088 ± 0.0093 ± 0. for the dataset Reuters-21578. Note that sized subsets.014 0. built on top of the scikit-learn ecosystem [39].083 ± 0. If the The reason could be that the average number of labels data size is too small.369 ± 0. so that it However.828 ± 0.006 0.032 0.813 ± 0. Baseline comparison outperforms any other baseline methods in most of the metrics. learning methods would generally benefit from a vast number kNN. C ONCLUSION step of 5.0038 ± 0.823 ± 0.748 ± 0. as it does in the small dataset. 2382 . In BR and CC. and find that this model tends recurrent neural network (RNN) based method that is capable to overfit the training data of this relative small dataset. For of efficiently representing textual features and modeling high- example.812 ± 0.803 ± 0. further experiments an open-source library for the multi-label classification that is are required to confirm it.886 ± 0. in. The state- with baselines.084 ± 0.008 0.298 ± 0. and the rest one is hold out for testing.019 0.017 RCV1-v2 One-error 0. In this two traditional methods (BR and CC).0004 N/A N/A 0.041 0.473 ± 0.009 N/A N/A 0.015 MiF 0. defined categories to textual documents.045 ± 0.016 0. the proposed method B.644 ± 0.029 0. ML.464 ± 0. We investigate the paper. the linear SVMs are used as the base classifier.762 ± 0. Existing approaches As shown in Table II.019 0. ML-HARAM and our CNN-RNN model. ML-kNN Meanwhile. more balanced distribution of labels in this dataset.0004 MiP 0.242 ± 0. as the Scikit-multilearn implementation of these algorithms is Then.024 0.091 ± 0.395 ± 0.595 ± 0. This verifies that deep cluding Binary relevance (BR).112 ± 0. We conjecture that modeling the high-order represent multi-label algorithms without taking into account label correlation helps to predict the minor labels that occurs the correlation between labels. BR and CC.205 ± 0. method is affected by the size of the training dataset.036 0.623 ± 0. Our evaluations reveal that the power of the proposed performance than BR except for the micro-averaged precision.342 ± 0.322 ± 0.023 0.006 N/A N/A 0.031 0. Schwenk. Williams. 37. P. L. word dependencies in text by means of a deep recurrent belief network. [18] J. D.3781. scale multi-label text classification—revisiting neural networks. 2004. and E. and R. 1422–1432.” in Proceedings of the 2015 p. no. Gelbukh. Zhang and Z. Elisseeff and J. Zhang and Z. G. Ong. [28] N. Gulcehre. Majumder. “Ml-knn: A lazy learning approach to python. J. vol. tically linkable knowledge in developer online forums via convolutional [2] S. Grisel. H. Pfahringer. vol. J. B. 50–59. pp. D. [33] D. and E. 2017. Zweig. scene classification. F. 40. no. pp. 37. vol.00075. vol. Cambria. semantically equivalent questions in online user forums.” in Advances in neural information processing systems. pp. Weston. Dai and Q. Ba.-L. Bengio. P. “Learning represen. Luo. A. Y. Vander- [20] A. M. 2013. 2.6980. Manning. “Deep learning [8] Y. 31. and J.” CoNLL 2015. “A review of affective neural network. Kingma and J. R. pp. and C. Gelbukh.” IEEE Computational Intelligence Mag. research. 12. E. [21] A.” arXiv preprint arXiv:1301. vol. 361–397.-H. “Natural language processing (almost) from scratch. pp. [3] E. [10] D. E. K. Welsch.” Computer Networks and ISDN Systems.” Knowledge-Based ings of the 25th international conference on Machine learning. [13] A.-S. 2017. Szymański. and C. sarcastic tweets using deep convolutional neural networks. Conference on Empirical Methods in Natural Language Processing. vol. 42–53. Senécal. “Learning ligent Systems. and S. Vij. vol. Shen. [39] F. 42–49. and T. E. Mencı́a. pp. pp. L. Hussain. Corrado. York. Gurevych. and B. “Predicting seman- tional journal of computer vision. USA: ACM. Poria. 174. 2014. “Improved semantic represen- Learning. 123. Poria. pp. Barbosa. Z. Mikolov. 847–854. vol. 2014. 2014. Kavukcuoglu. pp. 437–452. Tsang. 1157–1166. Bogdanova. Lewis.” IEEE arXiv preprint arXiv:1408.5882. and A. “Learning phrase representations using audio.1078. Nair and G. Socher. pp. Gong. “Learning multi-label 3087. [36] D. 5. Glassman. 144–154. Poria. Yang. Gramfort. Z. G. 2015. Carreras and L. 2016. M. X. 2016. no. A. pp.-S. E.” Journal of machine learning Learning and Knowledge Discovery in Databases. IEEE. 2014. B. A. “Multilabel neural networks with applica- affective intuition for concept-level sentiment analysis.” 2014.” in AAAI.” Cognitive modeling. “Document modeling with gated recurrent tations by back-propagating errors. Cambria and B. Cambria. C. Bottou.” in European Conference on Principles of Data Mining and esnay. 3. 108. 18.-H. Benites and E. Li. multi-label learning. Cournapeau.” in Joint European Conference on Machine collection for text categorization research. “A deeper look into language processing research. Kim. Howard. R. T. pp. and C.” [4] X. “Classifier chains for [37] D. 2001. Van Merriënboer. Broder.” in Proceed. no. 108. Qin. I. Gauvain. pp. “Rectified linear units improve restricted boltz- clustering of the web. and S. D. Bajpai. C. and Y. E. Isard.” Journal of Machine Knowledge Discovery.” The [35] A. F. Passos. E. ACM. Perrot. 1975.” in Joint [11] T. 2825–2830. arXiv:1406. O. [7] R. pp. pp.” Pattern recognition. “Detecting 2006. 2011. 2038–2048. “Affective computing and sentiment analysis. “A multi-view embedding multimodal sentiment analysis. Xia. J. 2001. vol. M. [25] K. R EFERENCES [22] S. Chaturvedi. J.-L. and A. 2008. and their semantics. “Adam: A method for stochastic optimization. “Efficient estimation of European Conference on Machine Learning and Knowledge Discovery word representations in vector space. 106.” Pattern recognition. Cambria. vol. pp. no. Karlen. Cambria. Huang. Marquez. Nam. B. R. Weston. 2015. Frank. V. and A. 2016. A. 2. B.” 2004. 11. pp. Zhou. tags. [24] I. 2. “A vector space model for network for large-scale text classification. Springer. S. 2016. Schwenk. Q. M. Apr. plas. V. “Knowledge discovery in multi-label pheno. 1. [38] P. 10. “AffectiveSpace 2: Enabling [29] M. Y. New Fusion. C. Michel. [14] D. 2. 2007. “Aspect extraction for opinion processing: Deep neural networks with multitask learning. 210–233. Poria. 9. NY. pp. mann machines. Dubourg. azine. “Scikit-multilearn: Enhancing multi-label classification in [19] M. vol. S. pp. M. Advances in Neural Information Processing Systems.-B. Manasse. pp. Cambria. 2010. Intelligent Systems. Cambria.” Interna- [23] B. D.” in Journal of Machine Learning Research. Bengio. “Rcv1: A new benchmark multi-label classification. 681–687. Holmes. on Machine Learning (ICML-10).” Communications of the ACM. Fu. Springer. M. Thirion. no. 58–64. pp. Blondel. no. 137–186.” based document modeling for personality detection from text. “A kernel method for multi-labelled classification. [9] E. Fürnkranz. Li. 1338–1351. Collobert and J. L. 1997. V. manuscript in preparation. and preprint arXiv:1503. vol. 2015. 8. 1988. 160–167. Duch- type data. 2009. 2014. Cambria. 3079– [17] M. no.-S. Ye. F. R. vol. 29. Bisio. tations from tree-structured long short-term memory networks. mining with a deep convolutional neural network. Kim. vol. tions to functional genomics and text categorization. and E. Pedregosa. 254–269. 2539–2544. Poria. White. in Databases. no. vol. 18. J.” in RANLP. 2016. D. R. 32. M. Weiss.” in Innovations in Machine [34] K. Hussain. Systems. [15] Y. and E. 2383 . Morin.” in EMNLP. 98–125. Varoquaux. K. “Neural probabilistic language models. Read. “Semi-supervised sequence learning. 48–57. 2001. S. [31] F. [30] J.” IEEE Intel. Hazarika. 2015. “Jumping NLP curves: A review of natural [26] S. pp. Zadrozny. “Syntactic [32] V. Wong. 51–62.” in COLING. 2011. Lazebnik. 2014. 1757–1771. E. Dean. Chen. visual and textual clues for sentiment analysis from multimodal rnn encoder-decoder for statistical machine translation. [5] S. Tai. and A. Salton. Bahdanau. 2015. “Haram: a hierarchical aram neural [12] G. King. vol. pp. D. pp. Brucher. 5.” arXiv [16] R. Rumelhart. Ke. Cambria.” Neurocomputing.-L. Weston. G. pp. pp. Springer. J. Xu. Collobert. Le. filtering. Boutell. Brown. Hinton. Learning Research. 1601–1612. E. N. Zhou. and J. no. arXiv preprint arXiv:1412. G. J. vol. and S. “Large- p. Tang. S. E. “Fusing H. 102–107. neural network for sentiment classification. Cho. 807–814. 613–620. I. “A unified architecture for natural language [27] S. Hinton. Gelbukh. “Boosting trees for anti-spam email Knowledge-Based Systems. Sapozhnikova. G.” in Proceedings of the 27th International Conference no. G. 2016. D. on Knowledge and Data Engineering. “Deep convolutional neural network textual features and multiple kernel learning for utterance-level [1] Y. 2006. Chen. 2015. and J. Liu. pp. space for modeling internet images.” Information Conference on Automated Software Engineering (ASE 2016). and F. Conference on Data Mining Workshop (ICDMW). pp. pp. Cambria. Poria. dos Santos. Xing. Rose. Prettenhofer. 7. G. 2493–2537. [6] E. Clare and R. “Convolutional neural networks for sentence classification. Bougares. X.” IEEE transactions 2015. 9. and G. Kuksa. 12. “Scikit-learn: Machine learning in Python. and P. Yang. M. Poria. Springer.” in Proceedings of the 31st IEEE/ACM International computing: From unimodal analysis to multimodal fusion.” in 2015 IEEE International automatic indexing. pp. 508–514. Austin.” arXiv preprint content.

Ensemble Application of Convolutional and Recurrent Neural Networks for Multi-label Text Categorization

Comments

Description