Tasks ยท fill-in quizzes ยท type-in exercises ยท challenges ยท hints ยท review
Natural Language Processing (NLP) is a branch of Artificial Intelligence that teaches computers to understand, read, and make sense of human language โ the kind of language we use every day when we write text messages, tweets, or emails.
Humans understand language naturally โ we grew up learning it. Computers only understand numbers (0s and 1s). NLP is the bridge that converts human words into numbers that a computer can work with.
The internet has enormous amounts of text data. Web scraping is the technique of writing code to automatically visit websites and collect their text โ just like a robot reading a page for you.
HTML uses tags โ words inside angle brackets โ to organise content on a page. Every tag has an opening and a closing version:
Success! The website found your page and sent back the HTML content.
r = requests.get(url)
print(r) โ <Response [200]>
Not found. The URL you typed does not exist โ like a wrong address.
r = requests.get(bad_url)
print(r) โ <Response [404]>
Once we find a heading, we want the paragraph text that follows it. We use find_next() to get the next tag after our heading:
headers={'User-Agent':'Mozilla/5.0'} to requests.get()find_all() instead of findAll() โ findAll() is deprecated since bs4 v4.0
headers_req = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0)'}
r = requests.get('https://en.wikipedia.org/wiki/Jupiter', headers=headers_req)
soup = bs4.BeautifulSoup(r.text, 'html5lib')
headers = [tag.text.strip() for tag in soup.find_all("h3")]
print(f"Found {len(headers)} headings")
all_para = ""
for header in headers:
deet = soup.find('h3', string=lambda t: t and header.strip() in t)
if deet:
for para in deet.find_next():
if para.name in ["h2", "h3"]: break
elif para.name == "p": all_para += para.get_text()
print(all_para[:300])
pandas is a Python package for working with data organised in rows and columns โ like a spreadsheet. In this lesson we use it to load and explore our tweet dataset.
'text', 'choose_one', and 'keyword'.
Use square brackets [ ] with the column name in quotes to select just one column:
You can filter a DataFrame to keep only rows that match a condition โ like a search filter:
.head(n) โ show first n rows | .sample(n) โ show n random rows | .copy() โ duplicate without changing original | .iloc[n] โ get row at position nA computer sees a sentence as one long stream of characters. Before it can do anything useful, we need to split the text into individual tokens โ usually individual words and punctuation marks. This process is called tokenization.
Original tweet (one long string):
After tokenization (a list of individual tokens):
Each coloured box = one token. Notice that punctuation (!! and !) become separate tokens.
| Term | Meaning | Example |
|---|---|---|
| token | One individual word or punctuation mark | "fire", "!!", "bridge" |
| corpus | Your complete collection of all texts | All 10,000+ tweets together |
| vocabulary | The set of unique tokens in the corpus | 35,335 unique words found |
After tokenizing, we have thousands of unique words. But many of these are just different forms of the same word. Stemming and lemmatization group them together to reduce vocabulary size.
A stemmer chops the end off a word to find its "stem". It is fast, but sometimes produces words that are not real English words.
A lemmatizer uses a dictionary to find the true root word. It always returns a real word.
โ Very fast
โ Simple algorithm
โ May produce non-words
โ Less accurate
Used in this lesson's pipeline
โ Always real words
โ More accurate
โ Slower
โ Needs more setup
Better for production NLP
Stop words are extremely common words that appear in almost every sentence but add very little meaning on their own. Words like "the", "is", "a", "in", "and", "of" are stop words. We remove them to keep only the meaningful, informative words.
Original: "There is a huge fire at the warehouse and people are running away"
red = stop words removed green = meaningful words kept
After removing stop words: "huge fire warehouse people running away" โ still perfectly clear!
Vocabulary: ~35,000 words
Many are "the", "is", "a"
These words are the same in disaster AND non-disaster tweets โ not useful for classification
Vocabulary: ~25,000 words
Mostly meaningful words
Words like "fire", "flood", "evacuate" are kept โ very useful for disaster detection!
A pipeline is a sequence of processing steps applied one after another. In this lesson, every tweet goes through the same 4 steps automatically. The pipeline is packaged into a function called process_tweet().
| Step | Output |
|---|---|
| Raw | "A Huge Fire at the Bridge!!" |
| 1. Tokenize | ['A','Huge','Fire','at','the','Bridge','!!'] |
| 2. Stem | ['A','Huge','Fire','at','the','Bridg','!!'] |
| 3. Lowercase | ['a','huge','fire','at','the','bridg','!!'] |
| 4. Remove stops | ['huge','fire','bridg'] |
35,335
unique tokens in raw tweets
~25,000
unique tokens โ cleaner, more meaningful
Well done!