Please see Github Repository
This repository presents our work during a project realized in the context of the IEOR 4523 Data Analytics Class at Columbia University. This Natural Language Processing project comes originally from a Kaggle competition.
This is the description of the problem as detailed on Kaggle: “The competition dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP’s MaxEnt sentence tokenizer, so you may notice the odd non-sentence here and there. Your objective is to accurately identify the author of the sentences in the test set.”
This is a multi-labels classification, with 3 labels:
Kaggle offers 2 datasets:
Goal: For each extract, give probability to the potential authors (among the three mentioned above) to determine which one is the most likely to be its author.
The evaluation metric is multilogloss.
The project is decomposed in 3 parts:
The files are:
Originally coming from a Kaggle contest, the dataset was clean without missing values. From the texts, we were able to generate two kind of features:
Then, this bunch of thousands (because of the use of bag-of-words technic) of numerical features has been scaled using min-max normalization.
We used sklearn pipeline class in order to define the pipeline processing of the data. Indeed, because of the use of the text features, a huge bunch of features is generated (thousands…) and we needed first to reduce the number of features before applying the machine learning algorithm.
We defined the pipelines composed of 3 elements:
Then, we took 80% of TR0 (so called tr1) and used 10-fold cross validation in order to compare the performances of the pipelines.
Among all the trained pipelines, the 10 best pipelines are selected for a final test. We used tr1 as training dataset and took the 20% remaining from TR0 (so called ts1) as test set.
The best pipeline has been selected which was: (1/2 of the features (892), PCA, Logistic Regression) which achieves a log-loss of 0.61.
This pipeline was then trained with the whole TR0 (tr1 + ts1) and used to predict the author of the TS0 dataset.