Anda di halaman 1dari 12

STACK

OVERFLOW TAG
PREDICTOR
CONTENTS
● REAL BUSINESS PROBLEM
● BUSINESS OBJECTIVES & CONSTRAINTS
● DATA OVERVIEW
● TYPE OF MACHINE LEARNING PROBLEM
● PERFORMANCE METRICS
● ANALYSIS OF TAGS
● DATA PREPROCESSING
● FEATURIZATION
● CLASSIFIERS TO BE USED
REAL BUSINESS PROBLEM
In a bunch of questions
provided, each contains
three segments Title,
Description and Tags.

We should suggest the


tags related to the
subject of the question
automatically by using
the text in the title and
description.
BUSINESS OBJECTIVES &
CONSTRAINTS
i. Predict as many tags as possible with high precision and recall.

ii. Incorrect tags could impact customer experience on Stack Overflow

iii. No Strict Latency Constraints


DATA OVERVIEW
Train: 6.75GB size

Test: 2GB size.

Data set contains 6,034,195 rows.

The columns include:

Id : Unique identifier for each question

Title: The title of question

Body: The body of the question

Tags: The tags associated with the question in a space separated format
TYPE OF MACHINE LEARNING
PROBLEM
Multi-class classification problem:

If yi belongs to two or more values, let’s say 0,1,2,3,4,5,6,7,8,9 .

It can’t belongs to two classes at a time.

Multi-label Classification:

If yi is assigned to each sample a set of target labels.yi is a set of classes.

xi belongs to one or more classes i.e a set of classes.


PERFORMANCE METRICS
As part of the business requirement we want high precision and recall rates
for each and every predicted tag.

We can use F1 Score here as it only gives good value if both the Precision and
Recall are high. For Multi Label Setting F1 score is modified as:

i. Micro Averaged F1 Score

ii. Macro Averaged F1 Score

iii. Hamming Loss


ANALYSIS OF TAGS
Tags are our class labels.After
removing all the duplicated data
we are left with 4.2 Million data
points and 42k unique tags.
OBSERVATIONS:
DATA PREPROCESSING
Steps to process further

i. Sampled 1M data points because of compute and memory limitations

ii. Separated code-snippets from Body

iii. Removed Special characters from Question title and description (not in
code)

iv. Removed stop words (Except ‘C’)

v. Removed HTML Tags using Regular Expressions

vi. Converted all the characters into small letters

vii. Used SnowballStemmer to stem the words


FEATURIZATION
Term Frequency Inverse Document
Frequency [TFIDF]:

Bag Of Words:
CLASSIFIERS TO BE USED
Our One vs Rest classifier can take any model

Preferred:

Logistic Regression

Not Preferred:

Support vector Machine(Linear SVM)

Random Forest

GBDT

Anda mungkin juga menyukai