๐Ÿ”
NLP Lesson 1
Enter your access password to begin
Ask your teacher for the password

๐Ÿ NLP Lesson 1 โ€” Coding Exercises

Tasks ยท fill-in quizzes ยท type-in exercises ยท challenges ยท hints ยท review

Not entered
0
Completed
0
Correct โœ“
0
Wrong โœ—
0%
Score

๐ŸŒ What is Natural Language Processing (NLP)?

NLP Artificial Intelligence Text Processing

Natural Language Processing (NLP) is a branch of Artificial Intelligence that teaches computers to understand, read, and make sense of human language โ€” the kind of language we use every day when we write text messages, tweets, or emails.

๐Ÿ’ฌ Real-world example
When you ask Siri "What is the weather today?", Siri uses NLP to understand your words and find the right answer. Gmail's spam filter uses NLP to decide if an email is junk. Google Translate uses NLP to convert one language to another.

Why is this hard for computers?

Humans understand language naturally โ€” we grew up learning it. Computers only understand numbers (0s and 1s). NLP is the bridge that converts human words into numbers that a computer can work with.

๐Ÿค” Think about it
The sentence "I saw the man with the telescope" could mean: (A) I used a telescope to see a man, or (B) I saw a man who had a telescope. Humans understand from context. Computers find this very difficult!

What will we do in this lesson?

1
Collect text dataUse web scraping to gather text from websites automatically
2
Load tweet dataLoad a real disaster tweet dataset using pandas
3
Clean and process the textTokenize, stem, and remove stop words from the tweets
4
Analyse the dataMeasure vocabulary size and tweet length distributions

๐Ÿ•ท Web Scraping โ€” Collecting Data from Websites

requests BeautifulSoup HTML GET request

The internet has enormous amounts of text data. Web scraping is the technique of writing code to automatically visit websites and collect their text โ€” just like a robot reading a page for you.

Step 1 โ€” How a website works

๐Ÿ“ก What happens when you visit a website?
Your browser sends a GET request to the website's server โ€” it is like knocking on a door and saying "please give me your content". The server replies with an HTML file (a text file full of special tags). Your browser reads the HTML and displays it as the pretty page you see.

HTML uses tags โ€” words inside angle brackets โ€” to organise content on a page. Every tag has an opening and a closing version:

# Common HTML tags you will encounter: <h1>This is the biggest heading</h1> <h2>This is a medium heading</h2> <h3>This is a smaller heading</h3> <p>This is a paragraph of normal text.</p> <a href="http://...">This is a clickable link</a>

Step 2 โ€” Response codes

โœ… Response [200]

Success! The website found your page and sent back the HTML content.

r = requests.get(url)
print(r) โ†’ <Response [200]>

โŒ Response [404]

Not found. The URL you typed does not exist โ€” like a wrong address.

r = requests.get(bad_url)
print(r) โ†’ <Response [404]>

Step 3 โ€” The two packages we use

import requests # Package 1: talks to websites import bs4 # Package 2: reads and understands HTML # Step A: visit the website and get its HTML # โœ… Fix: add User-Agent header โ€” without this Wikipedia returns 403 headers_req = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0)'} r = requests.get('https://en.wikipedia.org/wiki/Jupiter', headers=headers_req) print(r) # โ†’ <Response [200]> means success! # Step B: ask BeautifulSoup to read and understand the HTML soup = bs4.BeautifulSoup(r.text, 'html5lib') # Step C: find all <h3> heading tags headers = [] for tag in soup.find_all("h3"): headers.append(tag.text) print(headers[:3]) # first 3 headings

Step 4 โ€” Getting paragraph text under a heading

Once we find a heading, we want the paragraph text that follows it. We use find_next() to get the next tag after our heading:

# Find one paragraph after the first heading deet = soup.find('h3', string=lambda t: t and headers[0].strip() in t) para = deet.find_next('p') print(para.get_text()) # To get ALL paragraphs under a heading: for para in deet.find_next(): if para.name == "h2" or para.name == "h3": break # stop when we reach the next heading elif para.name == "p": print(para.get_text())
โœ… Updated โ€” fixes applied for bs4 v4.0+
โš ๏ธ Additional fix: text= โ†’ string= and NoneType guard added
๐Ÿ”ง Two fixes for your Jupyter notebook:
Fix 1 โ€” 403 error: Always add a User-Agent header: headers={'User-Agent':'Mozilla/5.0'} to requests.get()
Fix 2 โ€” DeprecationWarning: Use find_all() instead of findAll() โ€” findAll() is deprecated since bs4 v4.0
๐ŸŽฏ Complete working example โ€” run as ONE cell:

headers_req = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0)'} r = requests.get('https://en.wikipedia.org/wiki/Jupiter', headers=headers_req) soup = bs4.BeautifulSoup(r.text, 'html5lib') headers = [tag.text.strip() for tag in soup.find_all("h3")] print(f"Found {len(headers)} headings") all_para = "" for header in headers: deet = soup.find('h3', string=lambda t: t and header.strip() in t) if deet: for para in deet.find_next(): if para.name in ["h2", "h3"]: break elif para.name == "p": all_para += para.get_text() print(all_para[:300])
๐ŸŽฏ Remember requests talks to the website. BeautifulSoup reads what came back. Together they let you collect text from any webpage automatically.

๐Ÿผ Pandas โ€” Working with Data Tables

pandas DataFrame .head() .sample() len()

pandas is a Python package for working with data organised in rows and columns โ€” like a spreadsheet. In this lesson we use it to load and explore our tweet dataset.

๐Ÿ“Š What is a DataFrame?
A DataFrame is pandas' name for a table. It has columns (categories of information) and rows (individual records). Our tweet dataset has one row per tweet, with columns like 'text', 'choose_one', and 'keyword'.

Loading and peeking at data

import pandas as pd # Load the CSV file into a DataFrame df_raw = pd.read_csv('socialmedia disaster tweets DFE.csv', encoding='ISO-8859-1') # Preview the first 5 rows (like peeking at a spreadsheet) df_raw.head(5) # View 5 random rows (good for checking variety) df_raw.sample(5) # Count total rows (total tweets) print(len(df_raw)) # โ†’ e.g. 10876

Selecting a single column

Use square brackets [ ] with the column name in quotes to select just one column:

# Select just the tweet text column df_raw['text'] # โ†’ all tweet strings df_raw['choose_one'] # โ†’ 'Relevant' or 'Not Relevant' df_raw['keyword'] # โ†’ disaster keyword # Make a copy so we don't change the original df_text = df_raw['text'].copy() # Access a specific row by position (0 = first row) df_text.iloc[0] # โ†’ first tweet string df_text.iloc[100] # โ†’ 101st tweet string

Filtering rows

You can filter a DataFrame to keep only rows that match a condition โ€” like a search filter:

# Keep only tweets labelled 'Relevant' df_relevant = df_raw[df_raw['choose_one'] == 'Relevant'] print(len(df_relevant), "relevant tweets") # Keep only tweets labelled 'Not Relevant' df_irrelevant = df_raw[df_raw['choose_one'] == 'Not Relevant']
๐ŸŽฏ Key pandas methods in this lesson
.head(n) โ€” show first n rows  |  .sample(n) โ€” show n random rows  |  .copy() โ€” duplicate without changing original  |  .iloc[n] โ€” get row at position n

โœ‚๏ธ Tokenization โ€” Splitting Text into Words

tokenization token corpus vocabulary word_tokenize

A computer sees a sentence as one long stream of characters. Before it can do anything useful, we need to split the text into individual tokens โ€” usually individual words and punctuation marks. This process is called tokenization.

Visual example

Original tweet (one long string):

"Huge fire at the bridge!! People running away!"

After tokenization (a list of individual tokens):

Hugefireatthebridge!!Peoplerunningaway!

Each coloured box = one token. Notice that punctuation (!! and !) become separate tokens.

The code

import nltk nltk.download('punkt') # download the tokenizer data sample_tweet = df_text.iloc[100] print('Before:', sample_tweet) tokenized_tweet = nltk.tokenize.word_tokenize(sample_tweet) print('After:', tokenized_tweet)

Key vocabulary

TermMeaningExample
tokenOne individual word or punctuation mark"fire", "!!", "bridge"
corpusYour complete collection of all textsAll 10,000+ tweets together
vocabularyThe set of unique tokens in the corpus35,335 unique words found

Counting vocabulary

# Tokenize all tweets and count unique tokens tokenized_raw = [nltk.tokenize.word_tokenize(x) for x in list(df_text)] # Flatten all token lists and count unique ones vocab_size = len(set([y for x in tokenized_raw for y in x])) print("Vocabulary size:", vocab_size) # โ†’ 35335
๐Ÿ’ก Why does vocabulary size matter?
The vocabulary is like a dictionary your AI model uses. A huge vocabulary means the model has more to learn. One goal of preprocessing is to reduce vocabulary size by merging similar words โ€” making the model faster and more accurate.

๐ŸŒฟ Stemming & Lemmatization โ€” Normalising Words

stemming lemmatization PorterStemmer WordNetLemmatizer

After tokenizing, we have thousands of unique words. But many of these are just different forms of the same word. Stemming and lemmatization group them together to reduce vocabulary size.

๐Ÿค” The problem
"run", "running", "ran", "runs" all mean the same thing, but a computer treats them as 4 completely different words. We want to collapse them into one form so the computer sees them as related.

Stemming โ€” the fast, rough approach

A stemmer chops the end off a word to find its "stem". It is fast, but sometimes produces words that are not real English words.

porter = nltk.stem.PorterStemmer() examples = ['running', 'accident', 'terrible', 'fires', 'bridges'] for word in examples: print(word, 'โ†’', porter.stem(word)) # running โ†’ run # accident โ†’ accid โ† not a real word! # terrible โ†’ terribl โ† not a real word! # fires โ†’ fire # bridges โ†’ bridg โ† not a real word!

Lemmatization โ€” the accurate, slower approach

A lemmatizer uses a dictionary to find the true root word. It always returns a real word.

wnl = nltk.stem.WordNetLemmatizer() examples = ['running', 'better', 'geese', 'fires', 'worst'] for word in examples: print(word, 'โ†’', wnl.lemmatize(word)) # running โ†’ running (verbs need pos tag to lemmatize) # better โ†’ good # geese โ†’ goose # fires โ†’ fire # worst โ†’ bad

Side-by-side comparison

๐Ÿช“ Stemming

โœ… Very fast

โœ… Simple algorithm

โŒ May produce non-words

โŒ Less accurate

Used in this lesson's pipeline

๐Ÿ“– Lemmatization

โœ… Always real words

โœ… More accurate

โŒ Slower

โŒ Needs more setup

Better for production NLP

๐ŸŽ“ In this lesson we use stemming in the process_tweet() function because it is faster to run on thousands of tweets. Both techniques serve the same goal: grouping related word forms together to reduce vocabulary size.

๐Ÿšซ Stop Words โ€” Removing Meaningless Words

stop words noise normalisation NLTK

Stop words are extremely common words that appear in almost every sentence but add very little meaning on their own. Words like "the", "is", "a", "in", "and", "of" are stop words. We remove them to keep only the meaningful, informative words.

Visual example โ€” before and after

Original: "There is a huge fire at the warehouse and people are running away"

Thereisahugefireatthewarehouseandpeoplearerunningaway

red = stop words removed   green = meaningful words kept

After removing stop words: "huge fire warehouse people running away" โ€” still perfectly clear!

NLTK's built-in stop word list

stop = nltk.corpus.stopwords.words('english') print(stop[:20]) # โ†’ ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', # 'you', "you're", 'your', 'he', 'him', 'she', 'her', 'it', # 'the', 'a', 'an', 'and', ...] # Add extra stop words specific to tweets stop.append('@') # Twitter mentions stop.append('#') # Hashtag symbol stop.append('http') # URLs stop.append(':') # Colons # Remove stop words from a list of tokens clean = [word for word in tokens if word not in stop]

Why does this matter for NLP?

Without removing stop words

Vocabulary: ~35,000 words

Many are "the", "is", "a"

These words are the same in disaster AND non-disaster tweets โ†’ not useful for classification

After removing stop words

Vocabulary: ~25,000 words

Mostly meaningful words

Words like "fire", "flood", "evacuate" are kept โ†’ very useful for disaster detection!

๐Ÿ”„ The Full NLP Pipeline โ€” Putting It All Together

pipeline preprocessing process_tweet() vocabulary reduction

A pipeline is a sequence of processing steps applied one after another. In this lesson, every tweet goes through the same 4 steps automatically. The pipeline is packaged into a function called process_tweet().

The pipeline visualised

Raw tweet ๐Ÿฆ
โ†’
1. Tokenize
โ†’
2. Stem
โ†’
3. Lowercase
โ†’
4. Remove stops โœ…

Tracing one tweet through the pipeline

StepOutput
Raw"A Huge Fire at the Bridge!!"
1. Tokenize['A','Huge','Fire','at','the','Bridge','!!']
2. Stem['A','Huge','Fire','at','the','Bridg','!!']
3. Lowercase['a','huge','fire','at','the','bridg','!!']
4. Remove stops['huge','fire','bridg']

The complete process_tweet() function

def process_tweet(tweet): # Step 1: split the tweet string into individual tokens tokenized_tweet = nltk.tokenize.word_tokenize(tweet) # Step 2: stem each token (shorten to root form) stemmed = [porter.stem(word) for word in tokenized_tweet] # Step 3: lowercase and remove stop words in one step processed = [w.lower() for w in stemmed if w not in stop] return processed # Run the pipeline on ALL tweets in the dataset def tokenizer(df): tweets = [] for _, tweet in df.iterrows(): tweets.append(process_tweet(tweet['text'])) return tweets tweets = tokenizer(df_raw) # process all 10,000+ tweets!

The result โ€” vocabulary reduced by ~10,000 words!

Before pipeline

35,335

unique tokens in raw tweets

After pipeline

~25,000

unique tokens โ€” cleaner, more meaningful

๐ŸŽ“ Why does this help AI?
The cleaned, smaller vocabulary is what gets fed into a machine learning model. Fewer, more meaningful features = a faster, more accurate classifier. This is the foundation of tweet disaster detection!