Bag of Words ยท TF-IDF ยท Logistic Regression ยท Sentiment Analysis
Text classification is teaching a computer to automatically sort text into categories. Given a tweet, can the computer decide if it is about a real disaster or not? That is classification!
| Term | Meaning | In this lesson |
|---|---|---|
| target | The thing we are trying to predict | Is a tweet disaster-relevant? (1 or 0) |
| label | The known answer for each example | 'Relevant' or 'Not Relevant' |
| features | The input data the model uses | Word counts (Bag of Words / TF-IDF) |
| training set | Data used to teach the model | 80% of tweets |
| test set | Data used to measure accuracy | 20% of tweets |
| accuracy | % of correct predictions | ~75% or higher |
Computers cannot read words โ only numbers. The Bag of Words method converts each tweet into a row of numbers, where each number represents how many times a word appeared in that tweet.
| Tweet | fire | flood | lunch | bridge | happy |
|---|---|---|---|---|---|
| "fire at the bridge" | 1 | 0 | 0 | 1 | 0 |
| "flood and fire warning" | 1 | 1 | 0 | 0 | 0 |
| "happy lunch today" | 0 | 0 | 1 | 0 | 1 |
Each row is a tweet turned into a vector of counts. The model learns that rows with high "fire" and "flood" counts are disaster tweets.
Bag of Words counts words equally. But some words appear in almost every tweet (like "the", "http") and tell us nothing useful. TF-IDF rewards words that appear often in one tweet but rarely across all tweets.
Counts word frequency
Treats all words equally
"the" gets same weight as "fire"
Weights by importance
Rewards rare, specific words
"fire" gets much higher weight than "the"
TfidfTransformer does all of this automatically โ no manual loops needed!Logistic Regression is a machine learning algorithm for classification. Despite its name, it classifies โ it learns to predict whether something belongs to one category or another (relevant=1 or not relevant=0).
The model sees these tweets and their correct labels. It learns patterns.
~8,600 tweets
The model has NEVER seen these. We use them to measure real accuracy.
~2,200 tweets
Everything in this lesson connects into one end-to-end pipeline. A raw tweet goes in โ a classification (disaster or not) comes out.
Sentiment analysis classifies text as positive or negative based on the writer's opinion. Instead of "disaster or not", we ask: "does this person like or dislike the movie?"
.pkl files contain thousands of IMDB movie reviews, each labelled 1 (positive) or 0 (negative). A .pkl file is a Python-specific format that stores data exactly as it was saved โ columns, types, and all.
fit_transform() โ learns the vocabulary FROM training data AND converts it. Used only on training data.transform() โ uses the already-learned vocabulary to convert NEW data. Used on test data. This is important: the test data must use the same vocabulary as training, not build its own!| Manual approach (this lesson) | scikit-learn approach (challenge) |
|---|---|
| Write map_book(), bagofwords() yourself | CountVectorizer does it automatically |
| Write TF-IDF loop yourself | TfidfTransformer does it automatically |
| More code, full understanding | Less code, faster results |
| ~75% accuracy | ~80%+ accuracy |
In machine learning, a label is the correct answer for each piece of data. In our tweet dataset, human volunteers read each tweet and labelled it as one of three categories:
These labels are stored in the choose_one column of the dataframe.
set() is a Python built-in that removes duplicates and returns only unique values.
Type a list of words separated by commas and see the unique values:
We want binary classification โ just two categories: Relevant (1) or Not Relevant (0). Rows labelled "Can't Decide" are ambiguous and would confuse the model, so we remove them:
Computers and machine learning models work with numbers, not text strings. We need to convert our text labels ("Relevant", "Not Relevant") into numbers (1, 0).
map() replaces every value in a column using a dictionary as a lookup table:
Click to see how a column of labels transforms to numbers:
Before tokenizing tweets, we need to clean them โ remove punctuation, special characters, URLs, etc. We use the re (regular expressions) library for this.
| Pattern | Matches | Example |
|---|---|---|
| [^\w] | NOT a word character (removes punctuation) | ! , . @ # โ space |
| \w | Word characters (letters, digits, _) | a-z, A-Z, 0-9, _ |
| \s | Whitespace (space, tab, newline) | spaces and tabs |
| \d | Digits only | 0, 1, 2 ... 9 |
| http\S+ | URLs starting with http | http://t.co/abc |
| @\w+ | Twitter @mentions | @username |
| #\w+ | Hashtags | #NLP |
Type a tweet and see how re.sub("[^\w]", " ", text) cleans it:
Well done!