🤖 NLP Lesson 2 — Text Classification

Bag of Words · TF-IDF · Logistic Regression · Sentiment Analysis

👤 Student name: Not entered

Completed

Correct ✓

Wrong ✗

Score

🤖 What is Text Classification?

classification supervised learning labels target

Text classification is teaching a computer to automatically sort text into categories. Given a tweet, can the computer decide if it is about a real disaster or not? That is classification!

🌍 Real-world examples
Gmail classifying emails as spam or not spam. YouTube deciding if a comment is toxic. A bank flagging a transaction as fraud. Your phone's autocorrect predicting the next word. All of these are classification problems.

Key vocabulary

Term	Meaning	In this lesson
target	The thing we are trying to predict	Is a tweet disaster-relevant? (1 or 0)
label	The known answer for each example	'Relevant' or 'Not Relevant'
features	The input data the model uses	Word counts (Bag of Words / TF-IDF)
training set	Data used to teach the model	80% of tweets
test set	Data used to measure accuracy	20% of tweets
accuracy	% of correct predictions	~75% or higher

🤔 Why do we need labels?
Our model learns from examples that already have the right answer. This is called supervised learning. Human volunteers read thousands of tweets and labelled each one — this is expensive but necessary. Platforms like Amazon Mechanical Turk pay workers to do this labelling.

The 4 steps in this lesson

Prepare the dataClean, filter, and map labels to numbers (Relevant=1, Not Relevant=0)

Convert text to numbersBag of Words → TF-IDF. Computers only understand numbers, not words.

Train the modelFeed the numbers into a Logistic Regression model and let it learn

Test and predictMeasure accuracy on unseen data and make predictions on new tweets

🎒 Bag of Words — Turning Text into Numbers

bag of words hash map vocabulary word frequency

Computers cannot read words — only numbers. The Bag of Words method converts each tweet into a row of numbers, where each number represents how many times a word appeared in that tweet.

🎒 The analogy
Imagine emptying a tweet into a bag and shaking it so the word order is lost. All you know is which words are in the bag and how many times each appears. That's Bag of Words — order doesn't matter, only frequency does.

Step by step

Build a vocabularyFind all unique words across every tweet. If there are 500 unique words, your vocabulary has size 500.

Create a vector for each tweetA vector is a row of 500 numbers. Each position corresponds to one word in the vocabulary.

Fill in the countsFor each word in the tweet, put how many times it appears at its position in the vector.

Example

Tweet	fire	flood	lunch	bridge	happy
"fire at the bridge"	1	0	0	1	0
"flood and fire warning"	1	1	0	0	0
"happy lunch today"	0	0	1	0	1

Each row is a tweet turned into a vector of counts. The model learns that rows with high "fire" and "flood" counts are disaster tweets.

# Build a hash map (dictionary of word frequencies) def map_book(hash_map, tokens): for word in tokens: if word in hash_map: hash_map[word] += 1 # word exists → increment else: hash_map[word] = 1 # new word → set to 1 return hash_map # Keep only the top 500 most frequent words vocab = frequent_vocab(hash_map, 500)

💡 Why only 500 words?
If we used every word, we might have 35,000+ features. This is too slow to process and many rare words add noise. We keep only the 500 most common words — they carry the most signal.

📐 TF-IDF — Smarter Word Weighting

TF-IDF term frequency inverse document frequency

Bag of Words counts words equally. But some words appear in almost every tweet (like "the", "http") and tell us nothing useful. TF-IDF rewards words that appear often in one tweet but rarely across all tweets.

🤔 The intuition
"fire" appearing in 5 tweets out of 10,000 is very informative — that tweet is probably about a disaster. "the" appearing in 9,000 tweets out of 10,000 tells us nothing. TF-IDF gives "fire" a high score and "the" a low score automatically.

The formula

TF-IDF = TF × IDF

TF (Term Frequency) = how many times the word appears in THIS tweet
IDF (Inverse Document Frequency) = log(total tweets ÷ tweets containing this word)

A word that appears in every tweet has IDF ≈ 0 (useless). A word that appears in 10 tweets has high IDF (informative).

🎒 Bag of Words

Counts word frequency

Treats all words equally

"the" gets same weight as "fire"

📐 TF-IDF

Weights by importance

Rewards rare, specific words

"fire" gets much higher weight than "the"

# Calculate IDF for each word N = numdocs # total number of tweets word_frequency = np.empty(numwords) for word in range(numwords): word_frequency[word] = np.sum((bag_o[:,word] > 0)) idf = np.log(N / word_frequency) # Multiply bag of words × idf to get TF-IDF tfidf = np.empty([numdocs, numwords]) for doc in range(numdocs): tfidf[doc, :] = bag_o[doc, :] * idf

🎓 In the challenge section, scikit-learn's TfidfTransformer does all of this automatically — no manual loops needed!

🧠 Machine Learning — Teaching a Computer to Classify

logistic regression train/test split accuracy scikit-learn

Logistic Regression is a machine learning algorithm for classification. Despite its name, it classifies — it learns to predict whether something belongs to one category or another (relevant=1 or not relevant=0).

📚 Analogy
Imagine you studied 1,000 past exam papers and marked each "pass" or "fail". Now a new paper arrives — based on patterns you learned from the 1,000 examples, you predict if it will pass or fail. That's exactly what the model does with tweets.

Train / Test split — why we split the data

Training set (80%)

The model sees these tweets and their correct labels. It learns patterns.

~8,600 tweets

Test set (20%)

The model has NEVER seen these. We use them to measure real accuracy.

~2,200 tweets

⚠️ Why not just test on training data?
A student who memorises all the exam answers scores 100% — but hasn't actually understood the topic. Same problem in ML: testing on training data gives falsely high scores. The test set reveals the true performance.

# Step 1: Split data X_train, X_test, y_train, y_test = train_test_split( tfidf, df['relevance'].values, shuffle=True) # Step 2: Create model instance logreg = LogisticRegression(solver='lbfgs') # Step 3: Train the model logreg.fit(X_train, y_train) # Step 4: Measure accuracy score = logreg.score(X_test, y_test) print(f"Accuracy: {score:.3f}") # → e.g. 0.762

🎯 What does 0.762 mean?
The model correctly classifies 76.2% of tweets it has never seen before. For a simple first model, that is quite good! The challenge section pushes this even higher using scikit-learn's built-in tools.

🔄 The Full Classification Pipeline

Everything in this lesson connects into one end-to-end pipeline. A raw tweet goes in — a classification (disaster or not) comes out.

Load raw datapd.read_csv() loads the tweet CSV into a DataFrame

Clean dataRemove 'Can't Decide' rows. Keep only 'text' and 'choose_one' columns.

Map labels to numbers'Relevant'→1, 'Not Relevant'→0 using df.map()

Tokenize with extract_words()Split tweets into words, remove stop words, lowercase

Build Bag of Wordsmap_book() builds frequency dictionary. bagofwords() creates the vector for each tweet.

Apply TF-IDFMultiply bag of words by IDF weights to get the final feature matrix

Train/test splittrain_test_split() randomly puts 80% in training, 20% in testing

Train model → Predict → MeasureLogisticRegression.fit() → .predict() → .score()

# The twitter_predictor — full pipeline in one function def twitter_predictor(tweet): word_vector = bagofwords(tweet, vocab) # text → numbers word_tfidf = word_vector * idf # apply TF-IDF prediction = logreg.predict(word_tfidf.reshape(1,-1)) results = {1:'Relevant', 0:'Not Relevant'} print(results[int(prediction)])

😊 Sentiment Analysis — The Challenge Section

sentiment analysis IMDB reviews pkl files CountVectorizer TfidfTransformer

Sentiment analysis classifies text as positive or negative based on the writer's opinion. Instead of "disaster or not", we ask: "does this person like or dislike the movie?"

🎬 The dataset: IMDB movie reviews
The two .pkl files contain thousands of IMDB movie reviews, each labelled 1 (positive) or 0 (negative). A .pkl file is a Python-specific format that stores data exactly as it was saved — columns, types, and all.

# Load pkl files (pre-saved DataFrames) df_raw = pd.read_pickle('df_raw.pkl') # training reviews df_raw_test = pd.read_pickle('df_raw_test.pkl') # test reviews # scikit-learn does BOW + TF-IDF automatically vectorizer = CountVectorizer(max_features=5000) train_bow = vectorizer.fit_transform(df_raw['text']) test_bow = vectorizer.transform(df_raw_test['text']) tfidfier = TfidfTransformer() train_tfidf = tfidfier.fit_transform(train_bow) test_tfidf = tfidfier.transform(test_bow)

💡 fit_transform vs transform
fit_transform() — learns the vocabulary FROM training data AND converts it. Used only on training data.
transform() — uses the already-learned vocabulary to convert NEW data. Used on test data. This is important: the test data must use the same vocabulary as training, not build its own!

Manual approach (this lesson)	scikit-learn approach (challenge)
Write map_book(), bagofwords() yourself	CountVectorizer does it automatically
Write TF-IDF loop yourself	TfidfTransformer does it automatically
More code, full understanding	Less code, faster results
~75% accuracy	~80%+ accuracy

🏷️ Labels & the set() Function

labels set() unique values choose_one column

What are labels?

In machine learning, a label is the correct answer for each piece of data. In our tweet dataset, human volunteers read each tweet and labelled it as one of three categories:

✓ Relevant ✗ Not Relevant ? Can't Decide

These labels are stored in the choose_one column of the dataframe.

💰 Why labelling is expensive
Someone had to read every single tweet and decide its label. For 10,000 tweets, that is a lot of human effort! Platforms like Amazon Mechanical Turk pay workers to do this labelling task.

The set() function

set() is a Python built-in that removes duplicates and returns only unique values.

# set() removes duplicates — only unique values remain set(['apple', 'orange', 'apple', 'orange', 'pears']) # → {'apple', 'orange', 'pears'} only 3 unique values! # Find unique labels in the dataset set(df_raw.choose_one.values) # → {"Can't Decide", 'Not Relevant', 'Relevant'}

🎮 Try it yourself — set() explorer

Type a list of words separated by commas and see the unique values:

Result will appear here...

Why we filter out "Can't Decide"

We want binary classification — just two categories: Relevant (1) or Not Relevant (0). Rows labelled "Can't Decide" are ambiguous and would confuse the model, so we remove them:

# Keep only rows that are NOT "Can't Decide" df = df_raw[df_raw.choose_one != "Can't Decide"] print(len(df_raw), "→", len(df), "rows after filtering")

🗺️ The map() Function — Converting Labels to Numbers

map() dictionary numerical labels

Computers and machine learning models work with numbers, not text strings. We need to convert our text labels ("Relevant", "Not Relevant") into numbers (1, 0).

How map() works

map() replaces every value in a column using a dictionary as a lookup table:

'Relevant'
'Not Relevant'
'Relevant'

→
map()
→

1
0
1

# Step 1: create a dictionary mapping text → number relevance = {'Relevant': 1, 'Not Relevant': 0} # Step 2: apply the mapping to the column df['relevance'] = df.choose_one.map(relevance) # Result: new column with 1s and 0s print(df[['choose_one', 'relevance']].head())

🎮 See map() in action

Click to see how a column of labels transforms to numbers:

choose_one (text)	relevance (number)
Relevant	1
Not Relevant	0
Relevant	1
Not Relevant	0
Relevant	1

Also select only needed columns

# Keep only the columns we need: text and choose_one df = df[['text', 'choose_one']] # Result: from 13 columns → just 3 (text, choose_one, relevance) print(df.head())

🔍 Regular Expressions (Regex) — Cleaning Text

regex re.sub() text cleaning special characters

Before tokenizing tweets, we need to clean them — remove punctuation, special characters, URLs, etc. We use the re (regular expressions) library for this.

🤔 What is a Regular Expression?
A regular expression is a pattern that describes a set of strings. Think of it like a very powerful search-and-replace that can match complex patterns, not just exact words.

The pattern used in this lesson

import re # re.sub(pattern, replacement, string) # Pattern "[^\w]" means: anything that is NOT a word character # Replace with " " (a space) sentence = "Hello, world! #NLP @twitter http://t.co/abc" cleaned = re.sub("[^\w]", " ", sentence) print(cleaned) # → "Hello world NLP twitter http t co abc" # All punctuation replaced with spaces!

Common regex patterns

Pattern	Matches	Example
[^\w]	NOT a word character (removes punctuation)	! , . @ # → space
\w	Word characters (letters, digits, _)	a-z, A-Z, 0-9, _
\s	Whitespace (space, tab, newline)	spaces and tabs
\d	Digits only	0, 1, 2 ... 9
http\S+	URLs starting with http	http://t.co/abc
@\w+	Twitter @mentions	@username
#\w+	Hashtags	#NLP

🎮 Try regex cleaning

Type a tweet and see how re.sub("[^\w]", " ", text) cleans it:

Result will appear here...

Full extract_words() function with regex

def extract_words(sentence): '''Clean and tokenize a sentence''' ignore_words = ['a', 'the', 'if', 'and', 'of', 'to', 'is', 'are', 'it', 'how'] # Step 1: replace all special chars with space using regex words = re.sub("[^\w]", " ", sentence).split() # Step 2: lowercase everything words = [word.lower() for word in words] # Step 3: remove stop words words_cleaned = [w for w in words if w not in ignore_words] return words_cleaned # Test it print(extract_words("Good morning! It is a good day.")) # → ['good', 'morning', 'good', 'day']

🤖 NLP Lesson 2 — Text Classification

🤖 What is Text Classification?

Key vocabulary

The 4 steps in this lesson

🎒 Bag of Words — Turning Text into Numbers

Step by step

Example

📐 TF-IDF — Smarter Word Weighting

The formula

🎒 Bag of Words

📐 TF-IDF

🧠 Machine Learning — Teaching a Computer to Classify

Train / Test split — why we split the data

Training set (80%)

Test set (20%)

🔄 The Full Classification Pipeline

😊 Sentiment Analysis — The Challenge Section

🏷️ Labels & the set() Function

What are labels?

The set() function

🎮 Try it yourself — set() explorer

Why we filter out "Can't Decide"

🗺️ The map() Function — Converting Labels to Numbers

How map() works

🎮 See map() in action

Also select only needed columns

🔍 Regular Expressions (Regex) — Cleaning Text

The pattern used in this lesson

Common regex patterns

🎮 Try regex cleaning

Full extract_words() function with regex

🏆 Challenge Problems

📤 Submit your results to your teacher

How to submit

🤖 NLP Lesson 2 — Text Classification

🤖 What is Text Classification?

Key vocabulary

The 4 steps in this lesson

🎒 Bag of Words — Turning Text into Numbers

Step by step

Example

📐 TF-IDF — Smarter Word Weighting

The formula

🎒 Bag of Words

📐 TF-IDF

🧠 Machine Learning — Teaching a Computer to Classify

Train / Test split — why we split the data

Training set (80%)

Test set (20%)

🔄 The Full Classification Pipeline

😊 Sentiment Analysis — The Challenge Section

🏷️ Labels & the set() Function

What are labels?

The set() function

🎮 Try it yourself — set() explorer

Why we filter out "Can't Decide"

🗺️ The map() Function — Converting Labels to Numbers

How map() works

🎮 See map() in action

Also select only needed columns

🔍 Regular Expressions (Regex) — Cleaning Text

The pattern used in this lesson

Common regex patterns

🎮 Try regex cleaning

Full extract_words() function with regex

🏆 Challenge Problems

📤 Submit your results to your teacher

How to submit

All exercises done!