NLP Lesson 1 — Coding Exercises

🌐 What is Natural Language Processing (NLP)?

NLP Artificial Intelligence Text Processing

Natural Language Processing (NLP) is a branch of Artificial Intelligence that teaches computers to understand, read, and make sense of human language — the kind of language we use every day when we write text messages, tweets, or emails.

💬 Real-world example
When you ask Siri "What is the weather today?", Siri uses NLP to understand your words and find the right answer. Gmail's spam filter uses NLP to decide if an email is junk. Google Translate uses NLP to convert one language to another.

Why is this hard for computers?

Humans understand language naturally — we grew up learning it. Computers only understand numbers (0s and 1s). NLP is the bridge that converts human words into numbers that a computer can work with.

🤔 Think about it
The sentence "I saw the man with the telescope" could mean: (A) I used a telescope to see a man, or (B) I saw a man who had a telescope. Humans understand from context. Computers find this very difficult!

What will we do in this lesson?

Collect text dataUse web scraping to gather text from websites automatically

Load tweet dataLoad a real disaster tweet dataset using pandas

Clean and process the textTokenize, stem, and remove stop words from the tweets

Analyse the dataMeasure vocabulary size and tweet length distributions

🕷 Web Scraping — Collecting Data from Websites

requests BeautifulSoup HTML GET request

The internet has enormous amounts of text data. Web scraping is the technique of writing code to automatically visit websites and collect their text — just like a robot reading a page for you.

Step 1 — How a website works

📡 What happens when you visit a website?
Your browser sends a GET request to the website's server — it is like knocking on a door and saying "please give me your content". The server replies with an HTML file (a text file full of special tags). Your browser reads the HTML and displays it as the pretty page you see.

HTML uses tags — words inside angle brackets — to organise content on a page. Every tag has an opening and a closing version:

# Common HTML tags you will encounter: <h1>This is the biggest heading</h1> <h2>This is a medium heading</h2> <h3>This is a smaller heading</h3> <p>This is a paragraph of normal text.</p> <a href="http://...">This is a clickable link</a>

Step 2 — Response codes

✅ Response [200]

Success! The website found your page and sent back the HTML content.

r = requests.get(url)
print(r) → <Response [200]>

❌ Response [404]

Not found. The URL you typed does not exist — like a wrong address.

r = requests.get(bad_url)
print(r) → <Response [404]>

Step 3 — The two packages we use

import requests # Package 1: talks to websites import bs4 # Package 2: reads and understands HTML # Step A: visit the website and get its HTML # ✅ Fix: add User-Agent header — without this Wikipedia returns 403 headers_req = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0)'} r = requests.get('https://en.wikipedia.org/wiki/Jupiter', headers=headers_req) print(r) # → <Response [200]> means success! # Step B: ask BeautifulSoup to read and understand the HTML soup = bs4.BeautifulSoup(r.text, 'html5lib') # Step C: find all <h3> heading tags headers = [] for tag in soup.find_all("h3"): headers.append(tag.text) print(headers[:3]) # first 3 headings

Step 4 — Getting paragraph text under a heading

Once we find a heading, we want the paragraph text that follows it. We use find_next() to get the next tag after our heading:

# Find one paragraph after the first heading deet = soup.find('h3', string=lambda t: t and headers[0].strip() in t) para = deet.find_next('p') print(para.get_text()) # To get ALL paragraphs under a heading: for para in deet.find_next(): if para.name == "h2" or para.name == "h3": break # stop when we reach the next heading elif para.name == "p": print(para.get_text())

✅ Updated — fixes applied for bs4 v4.0+

⚠️ Additional fix: text= → string= and NoneType guard added

🔧 Two fixes for your Jupyter notebook:
Fix 1 — 403 error: Always add a User-Agent header: headers={'User-Agent':'Mozilla/5.0'} to requests.get()
Fix 2 — DeprecationWarning: Use find_all() instead of findAll() — findAll() is deprecated since bs4 v4.0

🎯 Complete working example — run as ONE cell:

headers_req = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0)'}
r = requests.get('https://en.wikipedia.org/wiki/Jupiter', headers=headers_req)
soup = bs4.BeautifulSoup(r.text, 'html5lib')

headers = [tag.text.strip() for tag in soup.find_all("h3")]
print(f"Found {len(headers)} headings")

all_para = ""
for header in headers:
    deet = soup.find('h3', string=lambda t: t and header.strip() in t)
    if deet:
        for para in deet.find_next():
            if para.name in ["h2", "h3"]: break
            elif para.name == "p": all_para += para.get_text()

print(all_para[:300])

🎯 Remember requests talks to the website. BeautifulSoup reads what came back. Together they let you collect text from any webpage automatically.

🐼 Pandas — Working with Data Tables

pandas DataFrame .head() .sample() len()

pandas is a Python package for working with data organised in rows and columns — like a spreadsheet. In this lesson we use it to load and explore our tweet dataset.

📊 What is a DataFrame?
A DataFrame is pandas' name for a table. It has columns (categories of information) and rows (individual records). Our tweet dataset has one row per tweet, with columns like 'text', 'choose_one', and 'keyword'.

Loading and peeking at data

import pandas as pd # Load the CSV file into a DataFrame df_raw = pd.read_csv('socialmedia disaster tweets DFE.csv', encoding='ISO-8859-1') # Preview the first 5 rows (like peeking at a spreadsheet) df_raw.head(5) # View 5 random rows (good for checking variety) df_raw.sample(5) # Count total rows (total tweets) print(len(df_raw)) # → e.g. 10876

Selecting a single column

Use square brackets [ ] with the column name in quotes to select just one column:

# Select just the tweet text column df_raw['text'] # → all tweet strings df_raw['choose_one'] # → 'Relevant' or 'Not Relevant' df_raw['keyword'] # → disaster keyword # Make a copy so we don't change the original df_text = df_raw['text'].copy() # Access a specific row by position (0 = first row) df_text.iloc[0] # → first tweet string df_text.iloc[100] # → 101st tweet string

Filtering rows

You can filter a DataFrame to keep only rows that match a condition — like a search filter:

# Keep only tweets labelled 'Relevant' df_relevant = df_raw[df_raw['choose_one'] == 'Relevant'] print(len(df_relevant), "relevant tweets") # Keep only tweets labelled 'Not Relevant' df_irrelevant = df_raw[df_raw['choose_one'] == 'Not Relevant']

🎯 Key pandas methods in this lesson
.head(n) — show first n rows | .sample(n) — show n random rows | .copy() — duplicate without changing original | .iloc[n] — get row at position n

✂️ Tokenization — Splitting Text into Words

tokenization token corpus vocabulary word_tokenize

A computer sees a sentence as one long stream of characters. Before it can do anything useful, we need to split the text into individual tokens — usually individual words and punctuation marks. This process is called tokenization.

Visual example

Original tweet (one long string):

"Huge fire at the bridge!! People running away!"

After tokenization (a list of individual tokens):

Hugefireatthebridge!!Peoplerunningaway!

Each coloured box = one token. Notice that punctuation (!! and !) become separate tokens.

The code

import nltk nltk.download('punkt') # download the tokenizer data sample_tweet = df_text.iloc[100] print('Before:', sample_tweet) tokenized_tweet = nltk.tokenize.word_tokenize(sample_tweet) print('After:', tokenized_tweet)

Key vocabulary

Term	Meaning	Example
token	One individual word or punctuation mark	"fire", "!!", "bridge"
corpus	Your complete collection of all texts	All 10,000+ tweets together
vocabulary	The set of unique tokens in the corpus	35,335 unique words found

Counting vocabulary

# Tokenize all tweets and count unique tokens tokenized_raw = [nltk.tokenize.word_tokenize(x) for x in list(df_text)] # Flatten all token lists and count unique ones vocab_size = len(set([y for x in tokenized_raw for y in x])) print("Vocabulary size:", vocab_size) # → 35335

💡 Why does vocabulary size matter?
The vocabulary is like a dictionary your AI model uses. A huge vocabulary means the model has more to learn. One goal of preprocessing is to reduce vocabulary size by merging similar words — making the model faster and more accurate.

🌿 Stemming & Lemmatization — Normalising Words

stemming lemmatization PorterStemmer WordNetLemmatizer

After tokenizing, we have thousands of unique words. But many of these are just different forms of the same word. Stemming and lemmatization group them together to reduce vocabulary size.

🤔 The problem
"run", "running", "ran", "runs" all mean the same thing, but a computer treats them as 4 completely different words. We want to collapse them into one form so the computer sees them as related.

Stemming — the fast, rough approach

A stemmer chops the end off a word to find its "stem". It is fast, but sometimes produces words that are not real English words.

porter = nltk.stem.PorterStemmer() examples = ['running', 'accident', 'terrible', 'fires', 'bridges'] for word in examples: print(word, '→', porter.stem(word)) # running → run # accident → accid ← not a real word! # terrible → terribl ← not a real word! # fires → fire # bridges → bridg ← not a real word!

Lemmatization — the accurate, slower approach

A lemmatizer uses a dictionary to find the true root word. It always returns a real word.

wnl = nltk.stem.WordNetLemmatizer() examples = ['running', 'better', 'geese', 'fires', 'worst'] for word in examples: print(word, '→', wnl.lemmatize(word)) # running → running (verbs need pos tag to lemmatize) # better → good # geese → goose # fires → fire # worst → bad

Side-by-side comparison

🪓 Stemming

✅ Very fast

✅ Simple algorithm

❌ May produce non-words

❌ Less accurate

Used in this lesson's pipeline

📖 Lemmatization

✅ Always real words

✅ More accurate

❌ Slower

❌ Needs more setup

Better for production NLP

🎓 In this lesson we use stemming in the process_tweet() function because it is faster to run on thousands of tweets. Both techniques serve the same goal: grouping related word forms together to reduce vocabulary size.

🚫 Stop Words — Removing Meaningless Words

stop words noise normalisation NLTK

Stop words are extremely common words that appear in almost every sentence but add very little meaning on their own. Words like "the", "is", "a", "in", "and", "of" are stop words. We remove them to keep only the meaningful, informative words.

Visual example — before and after

Original: "There is a huge fire at the warehouse and people are running away"

Thereisahugefireatthewarehouseandpeoplearerunningaway

red = stop words removed green = meaningful words kept

After removing stop words: "huge fire warehouse people running away" — still perfectly clear!

NLTK's built-in stop word list

stop = nltk.corpus.stopwords.words('english') print(stop[:20]) # → ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', # 'you', "you're", 'your', 'he', 'him', 'she', 'her', 'it', # 'the', 'a', 'an', 'and', ...] # Add extra stop words specific to tweets stop.append('@') # Twitter mentions stop.append('#') # Hashtag symbol stop.append('http') # URLs stop.append(':') # Colons # Remove stop words from a list of tokens clean = [word for word in tokens if word not in stop]

Why does this matter for NLP?

Without removing stop words

Vocabulary: ~35,000 words

Many are "the", "is", "a"

These words are the same in disaster AND non-disaster tweets → not useful for classification

After removing stop words

Vocabulary: ~25,000 words

Mostly meaningful words

Words like "fire", "flood", "evacuate" are kept → very useful for disaster detection!

🔄 The Full NLP Pipeline — Putting It All Together

pipeline preprocessing process_tweet() vocabulary reduction

A pipeline is a sequence of processing steps applied one after another. In this lesson, every tweet goes through the same 4 steps automatically. The pipeline is packaged into a function called process_tweet().

The pipeline visualised

Raw tweet 🐦

→

1. Tokenize

→

2. Stem

→

3. Lowercase

→

4. Remove stops ✅

Tracing one tweet through the pipeline

Step	Output
Raw	"A Huge Fire at the Bridge!!"
1. Tokenize	['A','Huge','Fire','at','the','Bridge','!!']
2. Stem	['A','Huge','Fire','at','the','Bridg','!!']
3. Lowercase	['a','huge','fire','at','the','bridg','!!']
4. Remove stops	['huge','fire','bridg']

The complete process_tweet() function

def process_tweet(tweet): # Step 1: split the tweet string into individual tokens tokenized_tweet = nltk.tokenize.word_tokenize(tweet) # Step 2: stem each token (shorten to root form) stemmed = [porter.stem(word) for word in tokenized_tweet] # Step 3: lowercase and remove stop words in one step processed = [w.lower() for w in stemmed if w not in stop] return processed # Run the pipeline on ALL tweets in the dataset def tokenizer(df): tweets = [] for _, tweet in df.iterrows(): tweets.append(process_tweet(tweet['text'])) return tweets tweets = tokenizer(df_raw) # process all 10,000+ tweets!

The result — vocabulary reduced by ~10,000 words!

Before pipeline

35,335

unique tokens in raw tweets

After pipeline

~25,000

unique tokens — cleaner, more meaningful

🎓 Why does this help AI?
The cleaned, smaller vocabulary is what gets fed into a machine learning model. Fewer, more meaningful features = a faster, more accurate classifier. This is the foundation of tweet disaster detection!

📦 What is NLTK?

Natural Language Toolkit Python library text processing

NLTK (Natural Language Toolkit) is one of the most popular Python libraries for working with human language data. It was built by researchers at the University of Pennsylvania and has been used worldwide since 2001.

🎯 What can NLTK do?
NLTK can split text into words, find root forms of words, remove common filler words, tag words as nouns/verbs/adjectives, and much more. Think of it as a complete Swiss Army knife for text processing.

Setting up NLTK — the downloads

Before using NLTK's tools, you must download the data they need. Each nltk.download() call downloads a specific dataset:

import nltk # Download required datasets (run once) nltk.download('punkt') # ← tokenizer rules nltk.download('wordnet') # ← dictionary for lemmatizer nltk.download('stopwords') # ← list of stop words nltk.download('omw-1.4') # ← multilingual wordnet support print("NLTK version:", nltk.__version__)

Download	What it gives you	Used for
punkt	Tokenization rules for English	word_tokenize(), sent_tokenize()
wordnet	A large English word dictionary	WordNetLemmatizer()
stopwords	Lists of common words in 20+ languages	stopwords.words('english')
omw-1.4	Open Multilingual Wordnet	Multilingual lemmatization

✂️ NLTK Tokenizers — Three Ways to Split Text

NLTK provides multiple tokenizers. Each splits text differently depending on your need.

1. word_tokenize() — split into words

from nltk.tokenize import word_tokenize text = "The building's on fire!! Call 999 now." tokens = word_tokenize(text) print(tokens) # → ['The', 'building', "'s", 'on', 'fire', '!', '!', 'Call', '999', 'now', '.'] # Notice: '!!' becomes two '!' tokens, "'s" is split from 'building'

2. sent_tokenize() — split into sentences

from nltk.tokenize import sent_tokenize text = "The fire started at 3am. Ten people were evacuated. No injuries reported." sentences = sent_tokenize(text) print(sentences) # → ['The fire started at 3am.', # 'Ten people were evacuated.', # 'No injuries reported.'] print(len(sentences), "sentences") # → 3 sentences

3. TweetTokenizer() — built for social media

from nltk.tokenize import TweetTokenizer tknzr = TweetTokenizer() tweet = "@JohnDoe Huge flood near #KL!! http://t.co/abc :(" tokens = tknzr.tokenize(tweet) print(tokens) # → ['@JohnDoe', 'Huge', 'flood', 'near', '#KL', '!!', 'http://t.co/abc', ':('] # Keeps @mentions, #hashtags, URLs, and emoticons intact!

word_tokenize

Best for: general English text, academic writing

Handles: contractions, punctuation

TweetTokenizer

Best for: tweets, social media posts

Handles: @mentions, #tags, emoticons, URLs

🤔 Think about it — For our disaster tweet dataset, which tokenizer makes more sense? TweetTokenizer preserves @mentions and #hashtags which might be useful features for our classifier!

🌿 NLTK Stemmers — Three Different Algorithms

NLTK includes multiple stemming algorithms. Each has a different level of aggressiveness in how much it cuts words.

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer porter = PorterStemmer() lancaster = LancasterStemmer() snowball = SnowballStemmer('english') words = ['running', 'generously', 'electricity', 'disasters', 'evacuating'] print(f"{'Word':<15} {'Porter':<12} {'Lancaster':<12} {'Snowball'}") print("-" * 52) for w in words: print(f"{w:<15} {porter.stem(w):<12} {lancaster.stem(w):<12} {snowball.stem(w)}") # Word Porter Lancaster Snowball # running run run run # generously generous gen generous # electricity electr elect electr # disasters disast disast disast # evacuating evacuat evacu evacuat

Stemmer	Speed	Aggressiveness	Best for
PorterStemmer	Fast	Moderate — balanced cuts	Most general NLP tasks (used in this lesson)
LancasterStemmer	Fastest	Very aggressive — cuts a lot	When you want maximum reduction
SnowballStemmer	Fast	Moderate — more languages	Multilingual projects (supports 15+ languages)

⚠️ Watch out! Lancaster is so aggressive it can make words unrecognisable. "generously" becomes "gen" — losing too much meaning. Porter and Snowball are usually safer choices.

📖 NLTK Lemmatizer — Dictionary-Based Root Finding

from nltk.stem import WordNetLemmatizer wnl = WordNetLemmatizer() # Basic lemmatization (default: treats all words as nouns) words = ['running', 'better', 'geese', 'caring', 'worst'] for w in words: print(f"{w} → {wnl.lemmatize(w)}") # running → running ← stays same! (default assumes noun) # better → better ← stays same! (default assumes noun) # geese → goose ← correct! # caring → caring ← stays same! # worst → worst ← stays same! # ✅ Better: tell it the Part-of-Speech (pos) print(wnl.lemmatize('running', pos='v')) # → run (v = verb) print(wnl.lemmatize('better', pos='a')) # → good (a = adjective) print(wnl.lemmatize('worst', pos='a')) # → bad (a = adjective) print(wnl.lemmatize('caring', pos='v')) # → care (v = verb)

pos= code	Meaning	Example
'n'	noun (default)	geese → goose
'v'	verb	running → run, caring → care
'a'	adjective	better → good, worst → bad
'r'	adverb	better → well

Stemming vs Lemmatization — direct comparison

porter = PorterStemmer() wnl = WordNetLemmatizer() tests = [('accidents','n'),('evacuating','v'),('terrible','a'),('fires','n'),('flooded','v')] print(f"{'Original':<15} {'Stemmed':<15} {'Lemmatized'}") print("-" * 45) for word, pos in tests: print(f"{word:<15} {porter.stem(word):<15} {wnl.lemmatize(word, pos=pos)}") # Original Stemmed Lemmatized # accidents accid accident # evacuating evacuat evacuate # terrible terribl terrible # fires fire fire # flooded flood flood

🎓 Key insight — Lemmatization with pos tags gives real words like "accident" and "evacuate". Stemming gives "accid" and "evacuat". For a human-readable output, lemmatization is better. For pure machine processing speed, stemming wins.

🚫 NLTK Stop Words — Advanced Usage

from nltk.corpus import stopwords # English stop words (179 words) en_stop = stopwords.words('english') print("Total stop words:", len(en_stop)) # → 179 print(en_stop[:10]) # → ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"] # NLTK supports many languages! print(stopwords.fileids()) # → ['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', # 'chinese', 'danish', 'dutch', 'english', 'finnish', # 'french', 'german', 'greek', 'hebrew', 'hungarian', # 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', # 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', # 'swedish', 'tajik', 'turkish'] # Custom stop words for tweets tweet_stop = set(en_stop) # use set for faster lookup tweet_stop.update(['@', '#', 'http', 'https', 'rt', 'amp', '...']) # Efficient filtering using set membership tokens = ['the', 'fire', 'is', 'huge', '@', 'bridge', 'http'] clean = [t for t in tokens if t.lower() not in tweet_stop] print(clean) # → ['fire', 'huge', 'bridge']

💡 Pro tip — Use set() instead of a list for stop words. Checking word in set is 10x faster than word in list because sets use hash lookup. For 10,000 tweets this speed difference is significant!

Checking what is and isn't a stop word

# Test any word test_words = ['fire', 'the', 'disaster', 'is', 'flood', 'a'] for w in test_words: label = "STOP" if w in en_stop else "KEEP" print(f"{w:<12} → {label}") # fire → KEEP # the → STOP # disaster → KEEP # is → STOP # flood → KEEP # a → STOP

📊 NLTK FreqDist — Counting Word Frequencies

NLTK's FreqDist (Frequency Distribution) counts how often each token appears in your text — like a word counter built specifically for NLP.

from nltk import FreqDist # Example: count words in a list of tokens tokens = ['fire','flood','fire','evacuate','fire','flood','disaster'] fdist = FreqDist(tokens) # Most common words print(fdist.most_common(3)) # → [('fire', 3), ('flood', 2), ('evacuate', 1)] # Count a specific word print(fdist['fire']) # → 3 # Apply to ALL processed tweets all_tokens = [tok for tweet in tweets for tok in tweet] fdist_all = FreqDist(all_tokens) print(fdist_all.most_common(10)) # → top 10 most common words across all disaster tweets # Plot frequency chart (in Jupyter) fdist_all.plot(20, cumulative=False)

📈 Why is word frequency useful?
Knowing that "fire", "flood", "evacuate" appear most often in disaster tweets — but "lunch", "movie", "happy" appear in non-disaster tweets — gives our AI model a huge clue for classification. Word frequency is one of the most powerful features in NLP.

Comparing frequencies: Relevant vs Not Relevant

# Split tweets by label df_result = df_raw['choose_one'].copy().to_frame() df_result['tokens'] = tweets pos_tokens = [t for row in df_result[df_result['choose_one']=='Relevant']['tokens'] for t in row] neg_tokens = [t for row in df_result[df_result['choose_one']=='Not Relevant']['tokens'] for t in row] print("Top 5 disaster words:", FreqDist(pos_tokens).most_common(5)) print("Top 5 non-disaster words:", FreqDist(neg_tokens).most_common(5))

💻 NLTK Practice Exercises

Try these in your Jupyter notebook. Each exercise builds on the previous one.

Exercise A — Tokenize and count

# EXERCISE A: Complete the code below tweet = "Massive earthquake hits the city!! 50+ buildings collapsed. #disaster" # Step 1: tokenize the tweet tokens = ___________ # Step 2: count the tokens print("Token count:", ___________) # Step 3: print only tokens longer than 4 characters long_words = [t for t in tokens if len(t) > 4] print("Long words:", long_words) # ANSWER: # tokens = word_tokenize(tweet) # print("Token count:", len(tokens))

Exercise B — Apply the full pipeline to one sentence

# EXERCISE B: Apply process_tweet() to this sentence sentence = "The flooding of the river banks has destroyed hundreds of homes" # Step 1: tokenize tokens = word_tokenize(sentence) # Step 2: stem each token stemmed = [porter.stem(w) for w in tokens] # Step 3: remove stop words and lowercase processed = [w.lower() for w in stemmed if w.lower() not in stop] print("Original tokens:", len(tokens), tokens) print("After pipeline:", len(processed), processed) # Original: 14 tokens # After: ~6 tokens — ['flood', 'river', 'bank', 'destroi', 'hundr', 'home']

Exercise C — FreqDist on a paragraph

from nltk import FreqDist from nltk.tokenize import word_tokenize para = """Floods are one of the most common and deadly natural disasters. Heavy rain causes rivers to overflow. Floods destroy homes and farmland. Many people die in floods every year due to lack of warning.""" # Process the paragraph tokens = word_tokenize(para.lower()) clean = [t for t in tokens if t.isalpha() and t not in stop] fdist = FreqDist(clean) print("Top 5 words:", fdist.most_common(5)) # → [('floods', 3), ('people', 2), ('destroy', 1), ...] # Notice 'floods' appears 3 times — a strong disaster signal!

🎓 Challenge yourself — Modify Exercise C to compare word frequency between the first and second sentences. Which words appear only in the first sentence? This is how NLP models learn to distinguish between topics!

⚔️ NLTK vs BeautifulSoup — What's the Difference?

Students often wonder: are NLTK and BeautifulSoup the same thing? No — they solve completely different problems and are used at different stages of the pipeline.

🌐 BeautifulSoup (bs4)

Stage: Data collection

Job: Gets text FROM the internet

Works on: HTML files, websites

Knows about: Tags, links, page structure

Output: Raw text strings

🔬 NLTK

Stage: Text processing

Job: Understands and cleans text

Works on: Text strings and word lists

Knows about: Grammar, word roots, language

Output: Processed token lists

🔄 How they work together in this lesson

requests.get() — visit the websiteFetches raw HTML from Wikipedia/any site

BeautifulSoup — read the HTMLFinds <h3> and <p> tags, extracts text strings

NLTK — process the textTokenizes, stems, removes stop words from those strings

Machine Learning model — classifyUses the clean tokens to detect disaster tweets

📋 Full comparison table

Feature	🌐 BeautifulSoup (bs4)	🔬 NLTK
Full name	Beautiful Soup 4	Natural Language Toolkit
Purpose	Parse and navigate HTML/XML	Process and analyse human language
Input	HTML source code from a website	Plain text strings or word lists
Output	Extracted text, links, tags	Tokens, stems, POS tags, frequency counts
Understands	Web page structure (<tags>)	Language structure (words, grammar)
Used when	Collecting data from websites	Cleaning and processing collected text
Can it scrape?	✅ Yes — that's its main job	❌ No — it processes text, not websites
Can it tokenize?	❌ No	✅ Yes — multiple tokenizers
Can it stem?	❌ No	✅ Yes — Porter, Lancaster, Snowball
Can it remove stop words?	❌ No	✅ Yes — 28 languages
Used with requests?	✅ Yes — always paired together	❌ requests not needed
Analogy	A robot that reads web pages	A linguist that analyses words

💻 Side-by-side code — what each library does

BeautifulSoup — finding and extracting text from HTML

import requests, bs4 # BEAUTIFULSOUP JOBS: # 1. Fetch the web page # ✅ User-Agent prevents 403 hdrs = {'User-Agent': 'Mozilla/5.0'} r = requests.get('https://en.wikipedia.org/wiki/Jupiter', headers=hdrs) # 2. Parse the HTML structure soup = bs4.BeautifulSoup(r.text, 'html5lib') # 3. Navigate the page — find all headings headings = [h.text for h in soup.find_all('h3')] # 4. Extract paragraph text under the first heading first_h = soup.find('h3', text=headings[0]) raw_text = first_h.find_next_sibling('p').get_text() # raw_text is now a plain string ↓ BeautifulSoup is done here print("Raw text:", raw_text[:80])

NLTK — processing that raw text string

import nltk # NLTK JOBS: (takes the raw_text string from above) # 1. Tokenize the plain string into words tokens = nltk.tokenize.word_tokenize(raw_text) # 2. Stem each token porter = nltk.stem.PorterStemmer() stemmed = [porter.stem(w) for w in tokens] # 3. Remove stop words stop = set(nltk.corpus.stopwords.words('english')) processed = [w.lower() for w in stemmed if w.lower() not in stop] # 4. Count word frequency from nltk import FreqDist fdist = FreqDist(processed) print("Most common:", fdist.most_common(5))

🎯 The handoff point — BeautifulSoup's job ends when it calls .get_text() and returns a plain string. NLTK's job begins at that exact point — it receives the plain string and starts processing the language inside it. They never directly talk to each other — plain text is the bridge between them.

❓ Common student questions

Q: Can I use NLTK without BeautifulSoup?
Yes! If you already have text (e.g. from a CSV file like our tweet dataset), you go straight to NLTK. You only need BeautifulSoup when collecting text FROM a website.

Q: Can BeautifulSoup do what NLTK does?
No. BeautifulSoup understands HTML page structure (tags, links, layout) but knows nothing about English grammar. It cannot tokenize, stem, or find stop words.

Q: Do I always need both?
Only when your workflow is: scrape website → process text. In this lesson: Part 1 (Wikipedia) uses both. Part 2 (tweet CSV) uses only NLTK — because the tweets are already in a CSV file, no scraping needed.

Q: Which is harder to learn?
BeautifulSoup has a simpler API — learn 5 functions and you can scrape most sites. NLTK is deeper — it has dozens of tools, but you only need 4-5 for basic NLP (tokenize, stem, lemmatize, stopwords, FreqDist).

🐍 NLP Lesson 1 — Coding Exercises

🏆 Challenge Problems

📤 Submit your results to your teacher

How to submit