In this tutorial, we will uncover the magic of Natural Language Processing (NLP) and explore its common techniques such as tokenization, named entity recognition, and sentiment analysis.
By the end of this tutorial, you'll understand what NLP is, the main techniques involved, and how to implement them. You'll be able to create a simple NLP pipeline using Python's NLTK and SpaCy libraries.
Tokenization is the process of breaking down text into words, phrases, symbols, or other meaningful elements called tokens. The goal is to understand the context and make the text computationally manageable.
NER is a process where we extract the entities from the text such as a person, a place, or any other specific identifiers.
Sentiment Analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially to determine whether the writer's attitude towards a particular topic is positive, negative, or neutral.
Let's dive into the implementation of each of these techniques using Python.
# Importing necessary library
import nltk
nltk.download('punkt')
# Sample text
text = "Hello, world. We are exploring NLP."
# Tokenization
tokens = nltk.word_tokenize(text)
print(tokens)
In this code snippet, we first import the necessary library, nltk
, and download the 'punkt' package which is a pre-trained tokenizer. We then define a sample text and tokenize it using nltk.word_tokenize()
.
The expected output is:
['Hello', ',', 'world', '.', 'We', 'are', 'exploring', 'NLP', '.']
# Importing necessary library
import spacy
# Loading English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at Google, few people took him seriously.")
doc = nlp(text)
# Analyze syntax
for entity in doc.ents:
print(entity.text, entity.label_)
In this code snippet, we first import SpaCy and load the English language model. We then define a text and analyze it for named entities using doc.ents
.
The expected output is:
Sebastian Thrun PERSON
Google ORG
# Importing necessary libraries
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
# Initialize the sentiment intensity analyzer
vader = SentimentIntensityAnalyzer()
# Define a text
text = "I love this tutorial! It's very informative."
# Analyze the sentiment of the text
sentiment = vader.polarity_scores(text)
print(sentiment)
In this code snippet, we first import the necessary library and download the 'vader_lexicon' package which is used for sentiment analysis. We then initialize the SentimentIntensityAnalyzer and define a text. Finally, we analyze the sentiment of the text using vader.polarity_scores()
.
The expected output is:
{'neg': 0.0, 'neu': 0.326, 'pos': 0.674, 'compound': 0.6696}
We learned about various NLP techniques, including tokenization, NER, and sentiment analysis. We also learned how to implement these techniques using Python's NLTK and SpaCy libraries.
Tokenize the following text: "NLP is fascinating. It makes machines understand human language."
Extract the named entities from this text: "Apple is planning to buy a UK startup for $1 billion."
Analyze the sentiment of this text: "I hate this movie. It's boring and the acting is terrible."
text = "NLP is fascinating. It makes machines understand human language."
tokens = nltk.word_tokenize(text)
print(tokens)
text = "Apple is planning to buy a UK startup for $1 billion."
doc = nlp(text)
for entity in doc.ents:
print(entity.text, entity.label_)
text = "I hate this movie. It's boring and the acting is terrible."
sentiment = vader.polarity_scores(text)
print(sentiment)
To practice further, you can apply these techniques on different datasets to extract insights. Happy coding!