Natural Language Processing Projects Summary

Hi, I am Yuji. Here are some projects regarding NLP I have implemented.

Always updating :)

1. Commenter-Based Prediction on the Helpfulness of Online Product Reviews

code available

Baselines: Implement of previously used features in the related field

• STR (Structure) (Kim et al., 2006; Xiong and Litman, 2011)

total number of tokens
total number of sentences
average length of sentences
number of exclamation marks
the percentage of question sentences

• UGR (Unigram) (Kim et al., 2006; Xiong and Litman, 2011; Agarwal et al., 2011)

Very reliable features
a vocabulary with all stop-words and non-frequent words (df < 3) removed
the size of the vocabulary was limited to 1000
each review is represented by the vocabulary with TF-IDF weighting for each appeared term

• GALC (Geneva Affect Label Coder) (Scherer, 2005)

proposes to recognize 36 effective states commonly distinguished by words.
construct a feature vector with the number of occurrences of each emotion plus one additional dimension for non-emotional words

• SEN (Semantic features) (Yang et al., 2015)

are introduced to describe the sentiment in texts
LIWC and General Inquirer (INQUIRER) were used M: since we don’t have free access to the dictionary data, we extracted and designed two new semantic features:
sentiment polarity [-1, +1]
subjectivity [0, +1] by open source Natural Language Processing tool TextBlob

Proposed Feature: USR (Commenter Based Features)

Extract commenter-based features by focusing on statistics computed on the users’ historical information.
Extract commenter-based features by FunkSVD
$H_{m\times n} = U^T_{m\times k}P_{k\times n}$
H is the helpfulness matrix, U could represent user information, and P could represent product information. We use training data to obtain those information.

2. Relation Classification Based on CNN with Relation Feature

• In order to study the influence of the relations contained in different texts on the extraction task of other relations, a framework of deep neural network models was used to consider the word level and sentence level features respectively: vectorization of words using word embedding representation, sentence-level feature extraction on the input word features, position features, and relation features using Convolutional Neural Network (CNN).

• The relation features proposed in this paper are constructed by introducing texts with different relations and the similarity between these texts and target text. When using all the features, the F1-score of this model on the test set provided in SemEval-2010 #8 task is 0.66, and the result of the relation classification task is the best.

• The experiment not only proves the validity of the selected features and the deep neural network model, but also shows that the descriptive texts of different relations can help to understand other certain relations to some extent.

3. Chinese Address Standardization

• Provided accurately real-time chinese address standardization Service based on Conditional Random Field (CRF), well-designed Natural Language Processing(NLP) techniques and multi-city, scalable, high-grained standard address database.

• Word Segmentation: Based on Conditional Random Field algorithm and Bi-LSTM model (Accuracy: 99%+)

• Address Gradation: Rule-based reasoning method with standard gazetteer (Accuracy: 99%+)

• Address Correction: Natural Language Processing techniques with standard address DB (Accuracy: 95%+)

4. Knowledge Priority in Domain-Specific Knowledge Bases

• Jointly developed a method to calculate the priority of knowledge based on the specificity and popularity through cross domain predicate distributions

• Implemented experiments over Yahoo! Answers corpus, compared high priority predicates generated by our model with that determined by experts and verified the effectiveness of our model

5. Sentence Order Sentiment Analysis of Weibo texts

code available

• Proposed a sentiment classification model based on sentence order information

• Conducted extensive experiments on two real datasets in different languages, verified results against classical models

training phase

Sentiment Classifier:

We use training data to train a Sentiment Classifier $C_1$ to classify a Weibo (Twitter in China) text to be positive or negative

Sentence Order Classifier:

For one text, firstly we use Sentiment Classifier $C_1$ to predict whether it is positive or negative, predict label is $y$

Then, we split the text into $M$ sentences $[s_1, \cdots, s_M]$, for each sentence, we use Sentiment Classifier $C_1$ to predict whether it is positive or negative, if the sentiment of $s_j$ is the same as $y$, then $y^j=1$; otherwise, $y^j=0$, $j\in {1,2,\cdots,M}$, $y_j$ could be named as consistency label.

Then we use a Sentence Order Vector $S_j$ to represent every sentence, $S_j$ is a 10-dimension vector:

X = [if it is the first sentence, if it is the second sentence, if it is the third sentence, if it is the fourth sentence, if it is the fifth sentence, if it is the third sentence from the end, if it is the fourth sentence from the end, if it is the third sentence from the end, if it is the second sentence from the end, if it is the last sentence]

y = $y^j$

Then we have training data for Sentence Order Classifier

predicting phase

For a Weibo text, we first predict its sentiment by Sentiment Classifier $C_1$, and then split text into sentences. For each sentence, using $C_1$ predicts the sentiment of each sentence.

They use the sentence order classifier to predict the consistency label of each sentence, which can be understood as the reliability/confidence of the sentence.

And use this confidence to vote for sentiment of each sentence to get the final sentiment label