๐Ÿ”
NLP Lesson 2
Enter your access password to begin
Ask your teacher for the password

๐Ÿค– NLP Lesson 2 โ€” Text Classification

Bag of Words ยท TF-IDF ยท Logistic Regression ยท Sentiment Analysis

Not entered
0
Completed
0
Correct โœ“
0
Wrong โœ—
0%
Score

๐Ÿค– What is Text Classification?

classification supervised learning labels target

Text classification is teaching a computer to automatically sort text into categories. Given a tweet, can the computer decide if it is about a real disaster or not? That is classification!

๐ŸŒ Real-world examples
Gmail classifying emails as spam or not spam. YouTube deciding if a comment is toxic. A bank flagging a transaction as fraud. Your phone's autocorrect predicting the next word. All of these are classification problems.

Key vocabulary

TermMeaningIn this lesson
targetThe thing we are trying to predictIs a tweet disaster-relevant? (1 or 0)
labelThe known answer for each example'Relevant' or 'Not Relevant'
featuresThe input data the model usesWord counts (Bag of Words / TF-IDF)
training setData used to teach the model80% of tweets
test setData used to measure accuracy20% of tweets
accuracy% of correct predictions~75% or higher
๐Ÿค” Why do we need labels?
Our model learns from examples that already have the right answer. This is called supervised learning. Human volunteers read thousands of tweets and labelled each one โ€” this is expensive but necessary. Platforms like Amazon Mechanical Turk pay workers to do this labelling.

The 4 steps in this lesson

1
Prepare the dataClean, filter, and map labels to numbers (Relevant=1, Not Relevant=0)
2
Convert text to numbersBag of Words โ†’ TF-IDF. Computers only understand numbers, not words.
3
Train the modelFeed the numbers into a Logistic Regression model and let it learn
4
Test and predictMeasure accuracy on unseen data and make predictions on new tweets

๐ŸŽ’ Bag of Words โ€” Turning Text into Numbers

bag of words hash map vocabulary word frequency

Computers cannot read words โ€” only numbers. The Bag of Words method converts each tweet into a row of numbers, where each number represents how many times a word appeared in that tweet.

๐ŸŽ’ The analogy
Imagine emptying a tweet into a bag and shaking it so the word order is lost. All you know is which words are in the bag and how many times each appears. That's Bag of Words โ€” order doesn't matter, only frequency does.

Step by step

1
Build a vocabularyFind all unique words across every tweet. If there are 500 unique words, your vocabulary has size 500.
2
Create a vector for each tweetA vector is a row of 500 numbers. Each position corresponds to one word in the vocabulary.
3
Fill in the countsFor each word in the tweet, put how many times it appears at its position in the vector.

Example

Tweetfirefloodlunchbridgehappy
"fire at the bridge"10010
"flood and fire warning"11000
"happy lunch today"00101

Each row is a tweet turned into a vector of counts. The model learns that rows with high "fire" and "flood" counts are disaster tweets.

# Build a hash map (dictionary of word frequencies) def map_book(hash_map, tokens): for word in tokens: if word in hash_map: hash_map[word] += 1 # word exists โ†’ increment else: hash_map[word] = 1 # new word โ†’ set to 1 return hash_map # Keep only the top 500 most frequent words vocab = frequent_vocab(hash_map, 500)
๐Ÿ’ก Why only 500 words?
If we used every word, we might have 35,000+ features. This is too slow to process and many rare words add noise. We keep only the 500 most common words โ€” they carry the most signal.

๐Ÿ“ TF-IDF โ€” Smarter Word Weighting

TF-IDF term frequency inverse document frequency

Bag of Words counts words equally. But some words appear in almost every tweet (like "the", "http") and tell us nothing useful. TF-IDF rewards words that appear often in one tweet but rarely across all tweets.

๐Ÿค” The intuition
"fire" appearing in 5 tweets out of 10,000 is very informative โ€” that tweet is probably about a disaster. "the" appearing in 9,000 tweets out of 10,000 tells us nothing. TF-IDF gives "fire" a high score and "the" a low score automatically.

The formula

TF-IDF = TF ร— IDF

TF (Term Frequency) = how many times the word appears in THIS tweet
IDF (Inverse Document Frequency) = log(total tweets รท tweets containing this word)

A word that appears in every tweet has IDF โ‰ˆ 0 (useless). A word that appears in 10 tweets has high IDF (informative).

๐ŸŽ’ Bag of Words

Counts word frequency

Treats all words equally

"the" gets same weight as "fire"

๐Ÿ“ TF-IDF

Weights by importance

Rewards rare, specific words

"fire" gets much higher weight than "the"

# Calculate IDF for each word N = numdocs # total number of tweets word_frequency = np.empty(numwords) for word in range(numwords): word_frequency[word] = np.sum((bag_o[:,word] > 0)) idf = np.log(N / word_frequency) # Multiply bag of words ร— idf to get TF-IDF tfidf = np.empty([numdocs, numwords]) for doc in range(numdocs): tfidf[doc, :] = bag_o[doc, :] * idf
๐ŸŽ“ In the challenge section, scikit-learn's TfidfTransformer does all of this automatically โ€” no manual loops needed!

๐Ÿง  Machine Learning โ€” Teaching a Computer to Classify

logistic regression train/test split accuracy scikit-learn

Logistic Regression is a machine learning algorithm for classification. Despite its name, it classifies โ€” it learns to predict whether something belongs to one category or another (relevant=1 or not relevant=0).

๐Ÿ“š Analogy
Imagine you studied 1,000 past exam papers and marked each "pass" or "fail". Now a new paper arrives โ€” based on patterns you learned from the 1,000 examples, you predict if it will pass or fail. That's exactly what the model does with tweets.

Train / Test split โ€” why we split the data

Training set (80%)

The model sees these tweets and their correct labels. It learns patterns.

~8,600 tweets

Test set (20%)

The model has NEVER seen these. We use them to measure real accuracy.

~2,200 tweets

โš ๏ธ Why not just test on training data?
A student who memorises all the exam answers scores 100% โ€” but hasn't actually understood the topic. Same problem in ML: testing on training data gives falsely high scores. The test set reveals the true performance.
# Step 1: Split data X_train, X_test, y_train, y_test = train_test_split( tfidf, df['relevance'].values, shuffle=True) # Step 2: Create model instance logreg = LogisticRegression(solver='lbfgs') # Step 3: Train the model logreg.fit(X_train, y_train) # Step 4: Measure accuracy score = logreg.score(X_test, y_test) print(f"Accuracy: {score:.3f}") # โ†’ e.g. 0.762
๐ŸŽฏ What does 0.762 mean?
The model correctly classifies 76.2% of tweets it has never seen before. For a simple first model, that is quite good! The challenge section pushes this even higher using scikit-learn's built-in tools.

๐Ÿ”„ The Full Classification Pipeline

Everything in this lesson connects into one end-to-end pipeline. A raw tweet goes in โ€” a classification (disaster or not) comes out.

1
Load raw datapd.read_csv() loads the tweet CSV into a DataFrame
2
Clean dataRemove 'Can't Decide' rows. Keep only 'text' and 'choose_one' columns.
3
Map labels to numbers'Relevant'โ†’1, 'Not Relevant'โ†’0 using df.map()
4
Tokenize with extract_words()Split tweets into words, remove stop words, lowercase
5
Build Bag of Wordsmap_book() builds frequency dictionary. bagofwords() creates the vector for each tweet.
6
Apply TF-IDFMultiply bag of words by IDF weights to get the final feature matrix
7
Train/test splittrain_test_split() randomly puts 80% in training, 20% in testing
8
Train model โ†’ Predict โ†’ MeasureLogisticRegression.fit() โ†’ .predict() โ†’ .score()
# The twitter_predictor โ€” full pipeline in one function def twitter_predictor(tweet): word_vector = bagofwords(tweet, vocab) # text โ†’ numbers word_tfidf = word_vector * idf # apply TF-IDF prediction = logreg.predict(word_tfidf.reshape(1,-1)) results = {1:'Relevant', 0:'Not Relevant'} print(results[int(prediction)])

๐Ÿ˜Š Sentiment Analysis โ€” The Challenge Section

sentiment analysis IMDB reviews pkl files CountVectorizer TfidfTransformer

Sentiment analysis classifies text as positive or negative based on the writer's opinion. Instead of "disaster or not", we ask: "does this person like or dislike the movie?"

๐ŸŽฌ The dataset: IMDB movie reviews
The two .pkl files contain thousands of IMDB movie reviews, each labelled 1 (positive) or 0 (negative). A .pkl file is a Python-specific format that stores data exactly as it was saved โ€” columns, types, and all.
# Load pkl files (pre-saved DataFrames) df_raw = pd.read_pickle('df_raw.pkl') # training reviews df_raw_test = pd.read_pickle('df_raw_test.pkl') # test reviews # scikit-learn does BOW + TF-IDF automatically vectorizer = CountVectorizer(max_features=5000) train_bow = vectorizer.fit_transform(df_raw['text']) test_bow = vectorizer.transform(df_raw_test['text']) tfidfier = TfidfTransformer() train_tfidf = tfidfier.fit_transform(train_bow) test_tfidf = tfidfier.transform(test_bow)
๐Ÿ’ก fit_transform vs transform
fit_transform() โ€” learns the vocabulary FROM training data AND converts it. Used only on training data.
transform() โ€” uses the already-learned vocabulary to convert NEW data. Used on test data. This is important: the test data must use the same vocabulary as training, not build its own!
Manual approach (this lesson)scikit-learn approach (challenge)
Write map_book(), bagofwords() yourselfCountVectorizer does it automatically
Write TF-IDF loop yourselfTfidfTransformer does it automatically
More code, full understandingLess code, faster results
~75% accuracy~80%+ accuracy

๐Ÿท๏ธ Labels & the set() Function

labels set() unique values choose_one column

What are labels?

In machine learning, a label is the correct answer for each piece of data. In our tweet dataset, human volunteers read each tweet and labelled it as one of three categories:

โœ“ Relevant โœ— Not Relevant ? Can't Decide

These labels are stored in the choose_one column of the dataframe.

๐Ÿ’ฐ Why labelling is expensive
Someone had to read every single tweet and decide its label. For 10,000 tweets, that is a lot of human effort! Platforms like Amazon Mechanical Turk pay workers to do this labelling task.

The set() function

set() is a Python built-in that removes duplicates and returns only unique values.

# set() removes duplicates โ€” only unique values remain set(['apple', 'orange', 'apple', 'orange', 'pears']) # โ†’ {'apple', 'orange', 'pears'} only 3 unique values! # Find unique labels in the dataset set(df_raw.choose_one.values) # โ†’ {"Can't Decide", 'Not Relevant', 'Relevant'}

๐ŸŽฎ Try it yourself โ€” set() explorer

Type a list of words separated by commas and see the unique values:

Result will appear here...

Why we filter out "Can't Decide"

We want binary classification โ€” just two categories: Relevant (1) or Not Relevant (0). Rows labelled "Can't Decide" are ambiguous and would confuse the model, so we remove them:

# Keep only rows that are NOT "Can't Decide" df = df_raw[df_raw.choose_one != "Can't Decide"] print(len(df_raw), "โ†’", len(df), "rows after filtering")

๐Ÿ—บ๏ธ The map() Function โ€” Converting Labels to Numbers

map() dictionary numerical labels

Computers and machine learning models work with numbers, not text strings. We need to convert our text labels ("Relevant", "Not Relevant") into numbers (1, 0).

How map() works

map() replaces every value in a column using a dictionary as a lookup table:

'Relevant'
'Not Relevant'
'Relevant'
โ†’
map()
โ†’
1
0
1
# Step 1: create a dictionary mapping text โ†’ number relevance = {'Relevant': 1, 'Not Relevant': 0} # Step 2: apply the mapping to the column df['relevance'] = df.choose_one.map(relevance) # Result: new column with 1s and 0s print(df[['choose_one', 'relevance']].head())

๐ŸŽฎ See map() in action

Click to see how a column of labels transforms to numbers:

Also select only needed columns

# Keep only the columns we need: text and choose_one df = df[['text', 'choose_one']] # Result: from 13 columns โ†’ just 3 (text, choose_one, relevance) print(df.head())

๐Ÿ” Regular Expressions (Regex) โ€” Cleaning Text

regex re.sub() text cleaning special characters

Before tokenizing tweets, we need to clean them โ€” remove punctuation, special characters, URLs, etc. We use the re (regular expressions) library for this.

๐Ÿค” What is a Regular Expression?
A regular expression is a pattern that describes a set of strings. Think of it like a very powerful search-and-replace that can match complex patterns, not just exact words.

The pattern used in this lesson

import re # re.sub(pattern, replacement, string) # Pattern "[^\w]" means: anything that is NOT a word character # Replace with " " (a space) sentence = "Hello, world! #NLP @twitter http://t.co/abc" cleaned = re.sub("[^\w]", " ", sentence) print(cleaned) # โ†’ "Hello world NLP twitter http t co abc" # All punctuation replaced with spaces!

Common regex patterns

PatternMatchesExample
[^\w]NOT a word character (removes punctuation)! , . @ # โ†’ space
\wWord characters (letters, digits, _)a-z, A-Z, 0-9, _
\sWhitespace (space, tab, newline)spaces and tabs
\dDigits only0, 1, 2 ... 9
http\S+URLs starting with httphttp://t.co/abc
@\w+Twitter @mentions@username
#\w+Hashtags#NLP

๐ŸŽฎ Try regex cleaning

Type a tweet and see how re.sub("[^\w]", " ", text) cleans it:

Result will appear here...

Full extract_words() function with regex

def extract_words(sentence): '''Clean and tokenize a sentence''' ignore_words = ['a', 'the', 'if', 'and', 'of', 'to', 'is', 'are', 'it', 'how'] # Step 1: replace all special chars with space using regex words = re.sub("[^\w]", " ", sentence).split() # Step 2: lowercase everything words = [word.lower() for word in words] # Step 3: remove stop words words_cleaned = [w for w in words if w not in ignore_words] return words_cleaned # Test it print(extract_words("Good morning! It is a good day.")) # โ†’ ['good', 'morning', 'good', 'day']