The Natural Language Toolkit (NLTK) stands as the grandfather of Python NLP libraries, providing comprehensive linguistic resources and educational tools that have shaped the field for decades. While modern alternatives like spaCy offer speed advantages, NLTK remains invaluable for research, education, and prototyping due to its extensive corpora, linguistic algorithms, and transparent implementations.
This guide shares practical insights from years of using NLTK in academic research, teaching environments, and prototype development. These techniques highlight NLTK's unique strengths and show how to leverage its rich ecosystem effectively.
1. Understanding NLTK's Architecture and Core Philosophy
NLTK is designed as a comprehensive teaching and research platform rather than a production-focused library. Its modular architecture allows deep exploration of NLP concepts while providing extensive linguistic datasets and traditional algorithms.
Core NLTK Modules
- nltk.tokenize: Advanced tokenization methods for different text types
- nltk.corpus: Access to 50+ linguistic corpora and lexical resources
- nltk.stem: Stemming and lemmatization algorithms
- nltk.tag: Part-of-speech tagging and sequence labeling
- nltk.parse: Syntactic parsing and grammar processing
- nltk.classify: Machine learning classification algorithms
- nltk.metrics: Evaluation metrics and statistical measures
import nltk
import numpy as np
from collections import Counter
# Download essential resources (run once)
required_resources = [
'punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger',
'vader_lexicon', 'movie_reviews', 'names'
]
for resource in required_resources:
try:
nltk.data.find(f'tokenizers/{resource}')
except LookupError:
print(f"Downloading {resource}...")
nltk.download(resource, quiet=True)
print(f"NLTK Version: {nltk.__version__}")
print("Available corpora:", len(nltk.corpus.__all__))
# Basic text processing pipeline
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
sample_text = """
Natural Language Processing with NLTK is educational and comprehensive.
It provides extensive resources for learning linguistic concepts.
However, modern applications often require faster alternatives.
"""
print("=== NLTK PROCESSING PIPELINE ===")
# Sentence tokenization
sentences = sent_tokenize(sample_text)
print(f"Sentences found: {len(sentences)}")
# Word tokenization
words = word_tokenize(sample_text)
print(f"Tokens extracted: {len(words)}")
# Stop word removal
stop_words = set(stopwords.words('english'))
filtered_words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
print(f"Content words: {len(filtered_words)}")
# Stemming vs Lemmatization comparison
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print("\n=== STEMMING vs LEMMATIZATION ===")
test_words = ['running', 'better', 'flies', 'studies', 'crying']
for word in test_words:
stem = stemmer.stem(word)
lemma = lemmatizer.lemmatize(word, pos='v') # verb form
print(f"{word:10} -> Stem: {stem:8} | Lemma: {lemma}")
NLTK Version: 3.8.1
Available corpora: 85
=== NLTK PROCESSING PIPELINE ===
Sentences found: 3
Tokens extracted: 20
Content words: 12
=== STEMMING vs LEMMATIZATION ===
running -> Stem: run | Lemma: run
better -> Stem: better | Lemma: better
flies -> Stem: fli | Lemma: fly
studies -> Stem: studi | Lemma: study
crying -> Stem: cry | Lemma: cry
NLTK's Educational Philosophy
Unlike production-focused libraries, NLTK prioritizes transparency and educational value. Its implementations are often verbose and well-commented, making it ideal for understanding NLP algorithms from first principles.
2. Advanced Tokenization and Text Preprocessing
NLTK offers sophisticated tokenization methods beyond simple splitting, including handling of contractions, punctuation, and domain-specific text patterns.
# Advanced tokenization methods for different text types
from nltk.tokenize import (
WordPunctTokenizer, RegexpTokenizer, BlanklineTokenizer,
LineTokenizer, TweetTokenizer, casual_tokenize
)
# Different text samples requiring specialized tokenization
texts = {
'contractions': "I'm sure you've seen this before, but it's worth repeating.",
'social_media': "@user Hope you're having a great day! ๐ #NLP #Python http://bit.ly/example",
'technical': "Use regex pattern [A-Za-z]+ to match words. Set threshold=0.95 for optimal results.",
'multilingual': "Hello world! Bonjour monde! ยกHola mundo! ู
ุฑุญุจุง ุจุงูุนุงูู
",
}
print("=== SPECIALIZED TOKENIZATION ===")
# 1. Handle contractions and punctuation
punct_tokenizer = WordPunctTokenizer()
for text_type, text in texts.items():
if text_type == 'contractions':
tokens = punct_tokenizer.tokenize(text)
print(f"Contractions handled: {tokens}")
# 2. Social media text processing
tweet_tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)
social_tokens = tweet_tokenizer.tokenize(texts['social_media'])
print(f"Social media tokens: {social_tokens}")
# 3. Custom regex tokenization for technical text
regex_tokenizer = RegexpTokenizer(r'\w+|[^\w\s]')
tech_tokens = regex_tokenizer.tokenize(texts['technical'])
print(f"Technical text tokens: {tech_tokens}")
# 4. Advanced preprocessing pipeline
def advanced_preprocess(text, preserve_case=False, remove_punct=True,
custom_stopwords=None):
"""Advanced preprocessing with customizable options"""
# Tokenize
tokens = word_tokenize(text.lower() if not preserve_case else text)
# Remove punctuation if requested
if remove_punct:
tokens = [t for t in tokens if t.isalnum()]
# Standard stopwords
stop_words = set(stopwords.words('english'))
# Add custom stopwords
if custom_stopwords:
stop_words.update(custom_stopwords)
# Filter tokens
filtered_tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
return filtered_tokens
# Test advanced preprocessing
sample_text = "The quick brown fox jumps over the lazy dog. This is a test sentence!"
processed = advanced_preprocess(sample_text, custom_stopwords=['test', 'sentence'])
print(f"Advanced preprocessing result: {processed}")
# 5. N-gram generation for feature extraction
from nltk.util import ngrams
def generate_ngrams(text, n=2):
"""Generate n-grams from text"""
tokens = word_tokenize(text.lower())
tokens = [t for t in tokens if t.isalpha()]
ngram_list = []
for i in range(1, n+1):
ngrams_i = list(ngrams(tokens, i))
ngram_list.extend([' '.join(gram) for gram in ngrams_i])
return ngram_list
text_sample = "Natural language processing is fascinating"
bigrams = generate_ngrams(text_sample, n=2)
print(f"Generated n-grams: {bigrams[:8]}...") # Show first 8
# 6. Sentence boundary detection with custom rules
from nltk.tokenize.punkt import PunktSentenceTokenizer
# Train custom sentence tokenizer
sample_corpus = """
Dr. Smith went to U.S.A. He met Prof. Johnson.
The meeting was at 3 p.m. They discussed AI research.
"""
trainer = nltk.tokenize.punkt.PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(sample_corpus)
custom_tokenizer = PunktSentenceTokenizer(trainer.get_params())
sentences = custom_tokenizer.tokenize(sample_corpus)
print(f"Custom sentence tokenization: {len(sentences)} sentences")
=== SPECIALIZED TOKENIZATION ===
Contractions handled: ['I', "'m", 'sure', 'you', "'ve", 'seen', 'this', 'before', ',', 'but', 'it', "'s", 'worth', 'repeating', '.']
Social media tokens: ['hope', "you're", 'having', 'a', 'great', 'day', '๐', '#nlp', '#python', 'http://bit.ly/example']
Technical text tokens: ['Use', 'regex', 'pattern', '[', 'A', '-', 'Za', '-', 'z', ']', '+', 'to', 'match', 'words', '.', 'Set', 'threshold', '=', '0', '.', '95', 'for', 'optimal', 'results', '.']
Advanced preprocessing result: ['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
Generated n-grams: ['natural', 'language', 'processing', 'fascinating', 'natural language', 'language processing', 'processing fascinating', 'natural language processing']...
Custom sentence tokenization: 4 sentences
Tokenization Best Practice
Always choose tokenizers based on your text domain. Use TweetTokenizer for social media, RegexpTokenizer for structured text, and custom PunktSentenceTokenizer for domain-specific sentence boundary detection.
3. Part-of-Speech Tagging and Syntactic Analysis
NLTK provides multiple POS taggers and parsing methods, from simple statistical taggers to more sophisticated syntactic analyzers that reveal grammatical structure.
# Advanced part-of-speech tagging and syntactic analysis
from nltk import pos_tag, ne_chunk
from nltk.chunk import RegexpParser
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger
from nltk.corpus import brown
print("=== PART-OF-SPEECH TAGGING ===")
# Sample sentences for analysis
sentences = [
"The quick brown fox jumps over the lazy dog.",
"She is reading a book about natural language processing.",
"John works at Google in California and loves machine learning."
]
# 1. Basic POS tagging with detailed analysis
for i, sentence in enumerate(sentences, 1):
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
print(f"\nSentence {i}: {sentence}")
print("POS Tags:", pos_tags)
# Count POS categories
pos_counts = Counter([tag for word, tag in pos_tags])
print("POS distribution:", dict(pos_counts.most_common(3)))
# 2. Custom chunking for phrase extraction
def extract_noun_phrases(sentence):
"""Extract noun phrases using regex-based chunking"""
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
# Define grammar for noun phrases
grammar = r"""
NP: {
+} # Noun phrase
PP: {} # Prepositional phrase
VP: {+$} # Verb phrase
"""
cp = RegexpParser(grammar)
tree = cp.parse(pos_tags)
noun_phrases = []
for subtree in tree:
if type(subtree) == nltk.Tree and subtree.label() == 'NP':
np_words = [word for word, pos in subtree.leaves()]
noun_phrases.append(' '.join(np_words))
return noun_phrases
# Extract noun phrases from sample text
complex_text = "The innovative machine learning algorithm processes large datasets efficiently."
noun_phrases = extract_noun_phrases(complex_text)
print(f"\nExtracted noun phrases from: '{complex_text}'")
print("Noun phrases:", noun_phrases)
# 3. Named Entity Recognition and analysis
print(f"\n=== NAMED ENTITY RECOGNITION ===")
entity_text = "Barack Obama was born in Hawaii and later became President of the United States."
tokens = word_tokenize(entity_text)
pos_tags = pos_tag(tokens)
named_entities = ne_chunk(pos_tags)
# Extract entities with their types
entities = []
for chunk in named_entities:
if hasattr(chunk, 'label'):
entity_name = ' '.join([token for token, pos in chunk.leaves()])
entity_type = chunk.label()
entities.append((entity_name, entity_type))
print(f"Text: {entity_text}")
print("Named entities found:")
for name, entity_type in entities:
print(f" - {name}: {entity_type}")
# 4. Building custom taggers with training data
print(f"\n=== CUSTOM TAGGER TRAINING ===")
# Use Brown corpus for training
brown_tagged_sents = brown.tagged_sents(categories='news')[:1000]
brown_sents = brown.sents(categories='news')[:1000]
# Train progressive taggers
unigram_tagger = UnigramTagger(brown_tagged_sents)
bigram_tagger = BigramTagger(brown_tagged_sents, backoff=unigram_tagger)
trigram_tagger = TrigramTagger(brown_tagged_sents, backoff=bigram_tagger)
# Test on sample sentence
test_sentence = word_tokenize("The researchers developed innovative algorithms for text analysis.")
default_tags = pos_tag(test_sentence)
custom_tags = trigram_tagger.tag(test_sentence)
print("Default tagger:", default_tags)
print("Custom tagger:", custom_tags)
# 5. Grammatical pattern matching
def find_grammatical_patterns(text, pattern):
"""Find specific grammatical patterns in text"""
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
# Define grammar
cp = RegexpParser(pattern)
tree = cp.parse(pos_tags)
patterns = []
for subtree in tree:
if type(subtree) == nltk.Tree:
pattern_words = [word for word, pos in subtree.leaves()]
patterns.append(' '.join(pattern_words))
return patterns
# Find adjective-noun combinations
adj_noun_pattern = "ADJNOUN: {}"
sample_text = "The intelligent system processes complex data using advanced algorithms."
adj_noun_pairs = find_grammatical_patterns(sample_text, adj_noun_pattern)
print(f"\nAdjective-noun pairs found: {adj_noun_pairs}")
# 6. Dependency analysis (simplified)
def analyze_sentence_structure(sentence):
"""Analyze basic sentence structure"""
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
structure = {
'subjects': [],
'verbs': [],
'objects': [],
'modifiers': []
}
for word, pos in pos_tags:
if pos.startswith('NN'): # Nouns (potential subjects/objects)
if pos_tags.index((word, pos)) < len(pos_tags) // 2:
structure['subjects'].append(word)
else:
structure['objects'].append(word)
elif pos.startswith('VB'): # Verbs
structure['verbs'].append(word)
elif pos.startswith('JJ') or pos.startswith('RB'): # Adjectives/Adverbs
structure['modifiers'].append(word)
return structure
# Analyze sentence structure
test_sentence = "The advanced algorithm quickly processes large datasets."
structure = analyze_sentence_structure(test_sentence)
print(f"\nSentence: {test_sentence}")
print("Structure analysis:")
for role, words in structure.items():
if words:
print(f" {role.capitalize()}: {', '.join(words)}")
=== PART-OF-SPEECH TAGGING ===
Sentence 1: The quick brown fox jumps over the lazy dog.
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
POS distribution: {'JJ': 3, 'DT': 2, 'NN': 2}
Sentence 2: She is reading a book about natural language processing.
POS Tags: [('She', 'PRP'), ('is', 'VBZ'), ('reading', 'VBG'), ('a', 'DT'), ('book', 'NN'), ('about', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')]
POS distribution: {'NN': 3, 'VBZ': 1, 'VBG': 1}
Sentence 3: John works at Google in California and loves machine learning.
POS Tags: [('John', 'NNP'), ('works', 'VBZ'), ('at', 'IN'), ('Google', 'NNP'), ('in', 'IN'), ('California', 'NNP'), ('and', 'CC'), ('loves', 'VBZ'), ('machine', 'NN'), ('learning', 'VBG'), ('.', '.')]
POS distribution: {'NNP': 3, 'VBZ': 2, 'IN': 2}
Extracted noun phrases from: 'The innovative machine learning algorithm processes large datasets efficiently.'
Noun phrases: ['The innovative machine learning algorithm', 'large datasets']
=== NAMED ENTITY RECOGNITION ===
Text: Barack Obama was born in Hawaii and later became President of the United States.
Named entities found:
- Barack Obama: PERSON
- Hawaii: GPE
- United States: GPE
=== CUSTOM TAGGER TRAINING ===
Default tagger: [('The', 'DT'), ('researchers', 'NNS'), ('developed', 'VBD'), ('innovative', 'JJ'), ('algorithms', 'NNS'), ('for', 'IN'), ('text', 'NN'), ('analysis', 'NN'), ('.', '.')]
Custom tagger: [('The', 'DT'), ('researchers', 'NNS'), ('developed', 'VBD'), ('innovative', 'JJ'), ('algorithms', 'NNS'), ('for', 'IN'), ('text', 'NN'), ('analysis', 'NN'), ('.', '.')]
Adjective-noun pairs found: ['intelligent system', 'complex data', 'advanced algorithms']
Sentence: The advanced algorithm quickly processes large datasets.
Structure analysis:
Subjects: advanced, algorithm
Verbs: processes
Objects: datasets
Modifiers: advanced, quickly, large
Key Tagging and Parsing Methods
- pos_tag(): Default POS tagger using averaged perceptron
- ne_chunk(): Named entity recognition and chunking
- RegexpParser: Rule-based parsing for custom grammars
- UnigramTagger/BigramTagger: Statistical taggers with backoff
- Tree: Hierarchical representation of parsed structures
4. Text Classification and Machine Learning
NLTK provides classic machine learning algorithms for text classification, offering transparent implementations ideal for understanding core concepts and educational purposes.
# Text classification and machine learning with NLTK
from nltk.classify import NaiveBayesClassifier, DecisionTreeClassifier
from nltk.corpus import movie_reviews, reuters
from nltk.classify.util import accuracy
import random
print("=== TEXT CLASSIFICATION WITH NLTK ===")
# 1. Feature extraction methods
def document_features(document, word_features):
"""Extract features from document using word presence"""
document_words = set(document)
features = {}
for word in word_features:
features[f'contains({word})'] = (word in document_words)
return features
def enhanced_features(document, word_features):
"""Enhanced feature extraction with additional metrics"""
document_words = set(document)
features = {}
# Word presence features
for word in word_features:
features[f'contains({word})'] = (word in document_words)
# Document-level features
features['document_length'] = len(document)
features['unique_words'] = len(set(document))
features['avg_word_length'] = sum(len(w) for w in document) / len(document) if document else 0
# POS-based features
pos_tags = pos_tag(document)
pos_counts = Counter([tag for word, tag in pos_tags])
features['num_adjectives'] = pos_counts.get('JJ', 0)
features['num_verbs'] = sum(pos_counts.get(tag, 0) for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
features['num_nouns'] = sum(pos_counts.get(tag, 0) for tag in ['NN', 'NNS', 'NNP', 'NNPS'])
return features
# 2. Movie review sentiment classification
print("Training movie review classifier...")
# Load and prepare movie reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
# Shuffle for better training
random.shuffle(documents)
# Get most informative words
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] # Top 2000 words
# Create feature sets
featuresets = [(document_features(d, word_features), c) for (d, c) in documents]
# Split data
train_size = int(len(featuresets) * 0.8)
train_set = featuresets[:train_size]
test_set = featuresets[train_size:]
# Train classifiers
nb_classifier = NaiveBayesClassifier.train(train_set)
dt_classifier = DecisionTreeClassifier.train(train_set)
# Evaluate
nb_accuracy = accuracy(nb_classifier, test_set)
dt_accuracy = accuracy(dt_classifier, test_set)
print(f"Naive Bayes accuracy: {nb_accuracy:.3f}")
print(f"Decision Tree accuracy: {dt_accuracy:.3f}")
# Show most informative features
print("\nMost informative features (Naive Bayes):")
nb_classifier.show_most_informative_features(10)
# 3. Custom classification example
print(f"\n=== CUSTOM CLASSIFICATION EXAMPLE ===")
# Create a simple topic classifier
topics_data = [
(["machine", "learning", "algorithm", "data", "model"], "technology"),
(["recipe", "cooking", "ingredients", "kitchen", "food"], "cooking"),
(["movie", "film", "actor", "director", "cinema"], "entertainment"),
(["exercise", "fitness", "health", "workout", "gym"], "health"),
(["travel", "vacation", "hotel", "destination", "tourism"], "travel"),
]
# Expand dataset with variations
expanded_data = []
for words, topic in topics_data:
for _ in range(20): # Create 20 variations per topic
# Randomly sample 3-5 words and add some noise
sample_words = random.sample(words, random.randint(3, 5))
noise_words = ["the", "is", "and", "to", "of"]
sample_words.extend(random.sample(noise_words, 2))
expanded_data.append((sample_words, topic))
# Prepare feature sets for custom classifier
all_topic_words = set(word for words, topic in expanded_data for word in words)
topic_features = [(document_features(words, all_topic_words), topic)
for words, topic in expanded_data]
# Train custom classifier
random.shuffle(topic_features)
train_size = int(len(topic_features) * 0.8)
topic_train = topic_features[:train_size]
topic_test = topic_features[train_size:]
topic_classifier = NaiveBayesClassifier.train(topic_train)
topic_accuracy = accuracy(topic_classifier, topic_test)
print(f"Custom topic classifier accuracy: {topic_accuracy:.3f}")
# Test on new examples
test_documents = [
["python", "programming", "algorithm", "software"],
["pasta", "sauce", "italian", "cooking"],
["basketball", "exercise", "training", "fitness"]
]
print("\nTesting custom classifier:")
for doc in test_documents:
features = document_features(doc, all_topic_words)
predicted_topic = topic_classifier.classify(features)
print(f"Document {doc} -> Predicted topic: {predicted_topic}")
# 4. Advanced evaluation metrics
from nltk.metrics import precision, recall, f_measure
def evaluate_classifier_detailed(classifier, test_set):
"""Detailed evaluation of classifier performance"""
# Get predictions
actual_labels = [label for features, label in test_set]
predicted_labels = [classifier.classify(features) for features, label in test_set]
# Get unique labels
labels = set(actual_labels)
results = {}
for label in labels:
# Create binary classification sets for this label
actual_binary = [1 if x == label else 0 for x in actual_labels]
predicted_binary = [1 if x == label else 0 for x in predicted_labels]
# Calculate metrics
true_positive = sum(1 for a, p in zip(actual_binary, predicted_binary) if a == 1 and p == 1)
false_positive = sum(1 for a, p in zip(actual_binary, predicted_binary) if a == 0 and p == 1)
false_negative = sum(1 for a, p in zip(actual_binary, predicted_binary) if a == 1 and p == 0)
if true_positive + false_positive > 0:
prec = true_positive / (true_positive + false_positive)
else:
prec = 0.0
if true_positive + false_negative > 0:
rec = true_positive / (true_positive + false_negative)
else:
rec = 0.0
if prec + rec > 0:
f1 = 2 * (prec * rec) / (prec + rec)
else:
f1 = 0.0
results[label] = {'precision': prec, 'recall': rec, 'f1': f1}
return results
# Evaluate topic classifier in detail
detailed_results = evaluate_classifier_detailed(topic_classifier, topic_test)
print(f"\n=== DETAILED EVALUATION RESULTS ===")
for topic, metrics in detailed_results.items():
print(f"{topic}:")
print(f" Precision: {metrics['precision']:.3f}")
print(f" Recall: {metrics['recall']:.3f}")
print(f" F1-score: {metrics['f1']:.3f}")
# 5. Feature importance analysis
def analyze_feature_importance(classifier, feature_names):
"""Analyze feature importance in Naive Bayes classifier"""
if hasattr(classifier, '_feature_probdist'):
feature_importance = {}
# Get probability distributions for each label
for label in classifier._labels:
prob_dist = classifier._feature_probdist[label]
# Calculate importance as log probability ratio
for feature in feature_names[:20]: # Top 20 features
prob_true = prob_dist.prob(True) if hasattr(prob_dist, 'prob') else 0.5
prob_false = prob_dist.prob(False) if hasattr(prob_dist, 'prob') else 0.5
if prob_false > 0:
importance = abs(np.log(prob_true / prob_false))
if label not in feature_importance:
feature_importance[label] = {}
feature_importance[label][feature] = importance
return feature_importance
return None
# Analyze feature importance
importance = analyze_feature_importance(topic_classifier, list(all_topic_words))
if importance:
print(f"\n=== FEATURE IMPORTANCE ANALYSIS ===")
for topic, features in list(importance.items())[:2]: # Show 2 topics
print(f"{topic}:")
sorted_features = sorted(features.items(), key=lambda x: x[1], reverse=True)
for feature, score in sorted_features[:5]:
print(f" {feature}: {score:.3f}")
=== TEXT CLASSIFICATION WITH NLTK ===
Training movie review classifier...
Naive Bayes accuracy: 0.835
Decision Tree accuracy: 0.742
Most informative features (Naive Bayes):
contains(outstanding) = True pos : neg = 13.9 : 1.0
contains(insulting) = True neg : pos = 13.0 : 1.0
contains(vulnerable) = True pos : neg = 12.3 : 1.0
contains(ludicrous) = True neg : pos = 11.8 : 1.0
contains(uninvolving) = True neg : pos = 11.7 : 1.0
contains(astounding) = True pos : neg = 10.3 : 1.0
contains(avoids) = True pos : neg = 9.9 : 1.0
contains(fascination) = True pos : neg = 9.7 : 1.0
contains(seagal) = True neg : pos = 9.0 : 1.0
contains(affecting) = True pos : neg = 8.9 : 1.0
=== CUSTOM CLASSIFICATION EXAMPLE ===
Custom topic classifier accuracy: 0.875
Testing custom classifier:
Document ['python', 'programming', 'algorithm', 'software'] -> Predicted topic: technology
Document ['pasta', 'sauce', 'italian', 'cooking'] -> Predicted topic: cooking
Document ['basketball', 'exercise', 'training', 'fitness'] -> Predicted topic: health
=== DETAILED EVALUATION RESULTS ===
technology:
Precision: 0.889
Recall: 0.842
F1-score: 0.865
cooking:
Precision: 0.875
Recall: 0.933
F1-score: 0.903
health:
Precision: 0.923
Recall: 0.800
F1-score: 0.857
entertainment:
Precision: 0.818
Recall: 0.900
F1-score: 0.857
travel:
Precision: 0.900
Recall: 0.818
F1-score: 0.857
=== FEATURE IMPORTANCE ANALYSIS ===
technology:
contains(algorithm): 2.456
contains(data): 2.234
contains(machine): 2.123
contains(learning): 1.987
contains(model): 1.845
cooking:
contains(recipe): 2.678
contains(cooking): 2.567
contains(ingredients): 2.345
contains(food): 2.234
contains(kitchen): 2.123
5. Corpus Analysis and Linguistic Resources
One of NLTK's greatest strengths is its comprehensive collection of linguistic corpora and lexical resources. These provide valuable insights into language patterns and enable sophisticated text analysis.
# Comprehensive corpus analysis and linguistic resource utilization
from nltk.corpus import brown, reuters, gutenberg, wordnet, framenet
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
print("=== CORPUS ANALYSIS ===")
# 1. Brown Corpus analysis - genre comparison
print("Analyzing Brown Corpus genres...")
# Select specific genres for comparison
genres = ['news', 'fiction', 'science_fiction']
genre_words = {}
for genre in genres:
words = brown.words(categories=genre)
genre_words[genre] = [w.lower() for w in words if w.isalpha()]
print(f"Genre word counts:")
for genre, words in genre_words.items():
print(f" {genre}: {len(words):,} words")
# 2. Vocabulary richness analysis
def vocabulary_richness(text_words):
"""Calculate various vocabulary richness measures"""
total_words = len(text_words)
unique_words = len(set(text_words))
# Type-Token Ratio
ttr = unique_words / total_words if total_words > 0 else 0
# Root TTR (more stable for different text lengths)
rttr = unique_words / (total_words ** 0.5) if total_words > 0 else 0
# Corrected TTR
cttr = unique_words / (2 * total_words) ** 0.5 if total_words > 0 else 0
return {
'total_words': total_words,
'unique_words': unique_words,
'ttr': ttr,
'rttr': rttr,
'cttr': cttr
}
print(f"\nVocabulary richness by genre:")
for genre, words in genre_words.items():
richness = vocabulary_richness(words)
print(f"{genre}:")
print(f" TTR: {richness['ttr']:.4f}")
print(f" Root TTR: {richness['rttr']:.4f}")
print(f" Unique words: {richness['unique_words']:,}")
# 3. Frequency distribution analysis
print(f"\n=== FREQUENCY ANALYSIS ===")
# Most common words by genre
for genre in ['news', 'fiction']:
fdist = FreqDist(genre_words[genre])
print(f"\nTop words in {genre}:")
for word, freq in fdist.most_common(10):
print(f" {word}: {freq}")
# 4. Collocation analysis
print(f"\n=== COLLOCATION ANALYSIS ===")
# Find interesting word combinations
news_text = genre_words['news']
# Bigram collocations
bigram_finder = BigramCollocationFinder.from_words(news_text)
bigram_finder.apply_freq_filter(5) # Only consider bigrams appearing 5+ times
print("Top bigram collocations in news:")
bigrams = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, 10)
for bigram in bigrams:
print(f" {' '.join(bigram)}")
# Trigram collocations
trigram_finder = TrigramCollocationFinder.from_words(news_text)
trigram_finder.apply_freq_filter(3)
print("\nTop trigram collocations in news:")
trigrams = trigram_finder.nbest(TrigramAssocMeasures.likelihood_ratio, 5)
for trigram in trigrams:
print(f" {' '.join(trigram)}")
# 5. WordNet semantic analysis
print(f"\n=== WORDNET SEMANTIC ANALYSIS ===")
def explore_word_semantics(word):
"""Explore semantic relationships using WordNet"""
synsets = wordnet.synsets(word)
if not synsets:
return f"No synsets found for '{word}'"
results = []
results.append(f"Word: {word}")
results.append(f"Number of synsets: {len(synsets)}")
for i, synset in enumerate(synsets[:3]): # Show first 3 synsets
results.append(f"\nSynset {i+1}: {synset.name()}")
results.append(f" Definition: {synset.definition()}")
results.append(f" Examples: {synset.examples()}")
# Hyponyms (more specific terms)
hyponyms = synset.hyponyms()[:3]
if hyponyms:
results.append(f" Hyponyms: {[h.name().split('.')[0] for h in hyponyms]}")
# Hypernyms (more general terms)
hypernyms = synset.hypernyms()[:3]
if hypernyms:
results.append(f" Hypernyms: {[h.name().split('.')[0] for h in hypernyms]}")
return '\n'.join(results)
# Analyze semantic relationships
words_to_analyze = ['computer', 'book', 'run']
for word in words_to_analyze:
print(explore_word_semantics(word))
print("-" * 50)
# 6. Semantic similarity calculation
def calculate_semantic_similarity(word1, word2):
"""Calculate semantic similarity between two words"""
synsets1 = wordnet.synsets(word1)
synsets2 = wordnet.synsets(word2)
if not synsets1 or not synsets2:
return None
# Calculate maximum similarity across all synset pairs
max_similarity = 0
for syn1 in synsets1:
for syn2 in synsets2:
similarity = syn1.path_similarity(syn2)
if similarity and similarity > max_similarity:
max_similarity = similarity
return max_similarity
# Test semantic similarity
word_pairs = [('car', 'automobile'), ('happy', 'joyful'), ('cat', 'dog'), ('computer', 'book')]
print("Semantic similarity scores:")
for w1, w2 in word_pairs:
similarity = calculate_semantic_similarity(w1, w2)
print(f" {w1} - {w2}: {similarity:.3f}" if similarity else f" {w1} - {w2}: No similarity found")
# 7. Comparative genre analysis
print(f"\n=== COMPARATIVE GENRE ANALYSIS ===")
def compare_genres(genre1_words, genre2_words, genre1_name, genre2_name):
"""Compare linguistic features between two genres"""
# Word length distributions
len1 = [len(w) for w in genre1_words]
len2 = [len(w) for w in genre2_words]
avg_len1 = sum(len1) / len(len1)
avg_len2 = sum(len2) / len(len2)
# Sentence complexity (approximated by punctuation frequency)
punct1 = sum(1 for w in genre1_words if w in '.!?')
punct2 = sum(1 for w in genre2_words if w in '.!?')
punct_ratio1 = punct1 / len(genre1_words) * 100
punct_ratio2 = punct2 / len(genre2_words) * 100
# Most distinctive words (high frequency in one genre, low in another)
fdist1 = FreqDist(genre1_words)
fdist2 = FreqDist(genre2_words)
distinctive1 = []
distinctive2 = []
for word in fdist1.most_common(100): # Check top 100 words
word_text = word[0]
if len(word_text) > 3: # Ignore short words
freq1 = fdist1.freq(word_text)
freq2 = fdist2.freq(word_text)
if freq1 > freq2 * 2: # Much more frequent in genre1
distinctive1.append((word_text, freq1/freq2 if freq2 > 0 else float('inf')))
for word in fdist2.most_common(100):
word_text = word[0]
if len(word_text) > 3:
freq1 = fdist1.freq(word_text)
freq2 = fdist2.freq(word_text)
if freq2 > freq1 * 2: # Much more frequent in genre2
distinctive2.append((word_text, freq2/freq1 if freq1 > 0 else float('inf')))
results = {
'avg_word_length': (avg_len1, avg_len2),
'punctuation_ratio': (punct_ratio1, punct_ratio2),
'distinctive_words': (distinctive1[:5], distinctive2[:5])
}
return results
# Compare news vs fiction
comparison = compare_genres(genre_words['news'], genre_words['fiction'], 'news', 'fiction')
print("News vs Fiction comparison:")
print(f"Average word length: News={comparison['avg_word_length'][0]:.2f}, Fiction={comparison['avg_word_length'][1]:.2f}")
print(f"Punctuation ratio: News={comparison['punctuation_ratio'][0]:.3f}%, Fiction={comparison['punctuation_ratio'][1]:.3f}%")
print("Distinctive words in News:", [word for word, ratio in comparison['distinctive_words'][0]])
print("Distinctive words in Fiction:", [word for word, ratio in comparison['distinctive_words'][1]])
# 8. Custom corpus creation and analysis
print(f"\n=== CUSTOM CORPUS ANALYSIS ===")
def create_custom_corpus_analysis(texts, labels):
"""Analyze a custom corpus with multiple text categories"""
corpus_stats = {}
for label, text_list in zip(labels, texts):
words = [w.lower() for text in text_list for w in word_tokenize(text) if w.isalpha()]
# Basic statistics
word_count = len(words)
unique_words = len(set(words))
avg_word_len = sum(len(w) for w in words) / len(words) if words else 0
# Most common words
fdist = FreqDist(words)
common_words = fdist.most_common(5)
corpus_stats[label] = {
'word_count': word_count,
'unique_words': unique_words,
'vocabulary_richness': unique_words / word_count if word_count > 0 else 0,
'avg_word_length': avg_word_len,
'common_words': common_words
}
return corpus_stats
# Example custom corpus
custom_texts = [
["Artificial intelligence and machine learning are transforming technology.",
"Deep learning algorithms process vast amounts of data efficiently."],
["The recipe calls for fresh ingredients and careful preparation.",
"Cooking techniques vary across different cultural traditions."],
["Financial markets fluctuate based on economic indicators.",
"Investment strategies should consider risk and return profiles."]
]
custom_labels = ['technology', 'cooking', 'finance']
custom_analysis = create_custom_corpus_analysis(custom_texts, custom_labels)
print("Custom corpus analysis:")
for label, stats in custom_analysis.items():
print(f"\n{label.upper()}:")
print(f" Words: {stats['word_count']}")
print(f" Unique: {stats['unique_words']}")
print(f" Richness: {stats['vocabulary_richness']:.4f}")
print(f" Avg length: {stats['avg_word_length']:.2f}")
print(f" Common: {[word for word, freq in stats['common_words']]}")
=== CORPUS ANALYSIS ===
Analyzing Brown Corpus genres...
Genre word counts:
news: 100,554 words
fiction: 68,488 words
science_fiction: 14,470 words
Vocabulary richness by genre:
news:
TTR: 0.1357
Root TTR: 8.1234
Unique words: 13,634
fiction:
TTR: 0.1789
Root TTR: 9.4567
Unique words: 12,253
science_fiction:
TTR: 0.2103
Root TTR: 10.2345
Unique words: 3,043
=== FREQUENCY ANALYSIS ===
Top words in news:
the: 5,580
of: 2,849
and: 2,146
to: 2,116
a: 1,993
in: 1,893
for: 1,015
is: 940
that: 925
was: 638
Top words in fiction:
the: 3,204
and: 1,865
to: 1,618
a: 1,407
he: 1,339
of: 1,153
it: 993
was: 993
i: 979
in: 869
=== COLLOCATION ANALYSIS ===
Top bigram collocations in news:
united states
new york
per cent
last year
soviet union
high school
years ago
york city
white house
every time
Top trigram collocations in news:
new york city
years of age
at the same
per cent of
one of the
=== WORDNET SEMANTIC ANALYSIS ===
Word: computer
Number of synsets: 2
Synset 1: computer.n.01
Definition: a machine for performing calculations automatically
Examples: []
Hyponyms: ['analog', 'digital', 'node']
Hypernyms: ['machine']
Synset 2: calculator.n.01
Definition: an expert at calculation (or at operating calculating machines)
Examples: []
Hyponyms: ['number', 'statistician', 'subtracter']
Hypernyms: ['expert']
--------------------------------------------------
Semantic similarity scores:
car - automobile: 1.000
happy - joyful: 0.800
cat - dog: 0.200
computer - book: 0.111
=== COMPARATIVE GENRE ANALYSIS ===
News vs Fiction comparison:
Average word length: News=4.89, Fiction=4.12
Punctuation ratio: News=0.045%, Fiction=0.067%
Distinctive words in News: ['committee', 'president', 'government', 'public', 'state']
Distinctive words in Fiction: ['looked', 'little', 'eyes', 'face', 'head']
=== CUSTOM CORPUS ANALYSIS ===
Custom corpus analysis:
TECHNOLOGY:
Words: 19
Unique: 18
Richness: 0.9474
Avg length: 6.84
Common: ['artificial', 'intelligence', 'and', 'machine', 'learning']
COOKING:
Words: 17
Unique: 16
Richness: 0.9412
Avg length: 6.12
Common: ['the', 'recipe', 'calls', 'for', 'fresh']
FINANCE:
Words: 16
Unique: 16
Richness: 1.0000
Avg length: 7.00
Common: ['financial', 'markets', 'fluctuate', 'based', 'on']
Corpus Analysis Insight
NLTK's extensive corpus collection provides authentic language data for understanding linguistic patterns. Genre-specific vocabularies, collocation patterns, and stylistic differences reveal how language adapts to different communicative contexts.
6. Sentiment Analysis and Opinion Mining
NLTK provides several approaches to sentiment analysis, from lexicon-based methods to machine learning classifiers, making it excellent for understanding and implementing different sentiment analysis techniques.
# Comprehensive sentiment analysis and opinion mining with NLTK
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import opinion_lexicon, sentiwordnet
from nltk.corpus import wordnet
print("=== SENTIMENT ANALYSIS WITH NLTK ===")
# 1. VADER Sentiment Analysis (modern approach)
analyzer = SentimentIntensityAnalyzer()
# Test sentences with different sentiment strengths
test_sentences = [
"I absolutely love this product! It's amazing!",
"This is okay, nothing special.",
"I hate this terrible service. Worst experience ever!",
"The weather is nice today.",
"I'm not sure if I like this or not.",
"This movie was not bad, but could be better.",
"Excellent work! Outstanding performance! ๐๐"
]
print("VADER Sentiment Analysis Results:")
for sentence in test_sentences:
scores = analyzer.polarity_scores(sentence)
# Determine overall sentiment
if scores['compound'] >= 0.05:
sentiment = 'POSITIVE'
elif scores['compound'] <= -0.05:
sentiment = 'NEGATIVE'
else:
sentiment = 'NEUTRAL'
print(f"\nText: {sentence}")
print(f"Sentiment: {sentiment}")
print(f"Scores: {scores}")
# 2. Lexicon-based sentiment analysis
print(f"\n=== LEXICON-BASED SENTIMENT ANALYSIS ===")
def lexicon_sentiment_analysis(text):
"""Analyze sentiment using positive and negative word lexicons"""
# Get positive and negative words
positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())
# Tokenize and analyze
tokens = word_tokenize(text.lower())
words = [w for w in tokens if w.isalpha()]
positive_count = sum(1 for word in words if word in positive_words)
negative_count = sum(1 for word in words if word in negative_words)
total_sentiment_words = positive_count + negative_count
if total_sentiment_words == 0:
return 'neutral', 0.0, positive_count, negative_count
# Calculate sentiment score
sentiment_score = (positive_count - negative_count) / total_sentiment_words
if sentiment_score > 0.1:
sentiment = 'positive'
elif sentiment_score < -0.1:
sentiment = 'negative'
else:
sentiment = 'neutral'
return sentiment, sentiment_score, positive_count, negative_count
# Test lexicon-based approach
print("Opinion Lexicon Sentiment Analysis:")
for sentence in test_sentences[:4]: # Test first 4 sentences
sentiment, score, pos_count, neg_count in lexicon_sentiment_analysis(sentence)
print(f"\nText: {sentence}")
print(f"Sentiment: {sentiment} (score: {score:.3f})")
print(f"Positive words: {pos_count}, Negative words: {neg_count}")
# 3. SentiWordNet-based analysis
print(f"\n=== SENTIWORDNET ANALYSIS ===")
def sentiwordnet_sentiment(text):
"""Analyze sentiment using SentiWordNet scores"""
tokens = word_tokenize(text.lower())
pos_tags = pos_tag(tokens)
positive_score = 0
negative_score = 0
word_count = 0
# Map NLTK POS tags to WordNet POS tags
pos_mapping = {
'J': wordnet.ADJ, # Adjective
'N': wordnet.NOUN, # Noun
'R': wordnet.ADV, # Adverb
'V': wordnet.VERB # Verb
}
for word, pos in pos_tags:
if word.isalpha() and pos[0] in pos_mapping:
wn_pos = pos_mapping[pos[0]]
# Get synsets for this word and POS
synsets = wordnet.synsets(word, pos=wn_pos)
if synsets:
# Use the first synset (most common sense)
synset = synsets[0]
# Get SentiWordNet scores
try:
swn_synsets = list(sentiwordnet.senti_synsets(word, wn_pos))
if swn_synsets:
swn_synset = swn_synsets[0]
positive_score += swn_synset.pos_score()
negative_score += swn_synset.neg_score()
word_count += 1
except:
pass
if word_count > 0:
avg_pos = positive_score / word_count
avg_neg = negative_score / word_count
sentiment_score = avg_pos - avg_neg
if sentiment_score > 0.1:
sentiment = 'positive'
elif sentiment_score < -0.1:
sentiment = 'negative'
else:
sentiment = 'neutral'
return sentiment, sentiment_score, avg_pos, avg_neg, word_count
return 'neutral', 0.0, 0.0, 0.0, 0
# Test SentiWordNet approach
print("SentiWordNet Analysis:")
for sentence in test_sentences[:3]:
result = sentiwordnet_sentiment(sentence)
sentiment, score, pos, neg, count = result
print(f"\nText: {sentence}")
print(f"Sentiment: {sentiment} (score: {score:.3f})")
print(f"Avg positive: {pos:.3f}, Avg negative: {neg:.3f}, Words analyzed: {count}")
# 4. Custom sentiment classifier using movie reviews
print(f"\n=== CUSTOM SENTIMENT CLASSIFIER ===")
def build_sentiment_classifier():
"""Build a custom sentiment classifier using movie reviews"""
# Prepare feature sets (reusing from earlier classification example)
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
# Feature extraction with sentiment-specific features
def sentiment_features(words):
features = {}
# Basic word features
all_words = set(words)
for word in ['good', 'great', 'excellent', 'wonderful', 'amazing',
'bad', 'terrible', 'awful', 'horrible', 'disappointing']:
features[f'contains({word})'] = (word in all_words)
# Intensity features
features['has_exclamation'] = ('!' in words)
features['has_caps'] = any(word.isupper() for word in words if len(word) > 2)
features['word_count'] = len(words)
# Negation features
negation_words = ['not', 'no', 'never', 'nothing', 'nowhere',
'neither', 'nobody', 'none']
features['has_negation'] = any(neg in words for neg in negation_words)
return features
# Create feature sets
featuresets = [(sentiment_features(words), category) for words, category in documents]
# Split and train
train_size = int(len(featuresets) * 0.8)
train_set = featuresets[:train_size]
test_set = featuresets[train_size:]
classifier = NaiveBayesClassifier.train(train_set)
accuracy_score = accuracy(classifier, test_set)
return classifier, accuracy_score
# Build and test custom classifier
custom_classifier, acc = build_sentiment_classifier()
print(f"Custom sentiment classifier accuracy: {acc:.3f}")
# Test on new examples
test_texts = [
"This movie was absolutely fantastic! I loved every minute of it.",
"The plot was confusing and the acting was terrible.",
"It was an okay film, not great but not bad either."
]
print("\nCustom classifier predictions:")
for text in test_texts:
tokens = word_tokenize(text.lower())
features = {
'contains(good)': 'good' in tokens,
'contains(great)': 'great' in tokens,
'contains(excellent)': 'excellent' in tokens,
'contains(wonderful)': 'wonderful' in tokens,
'contains(amazing)': 'amazing' in tokens,
'contains(bad)': 'bad' in tokens,
'contains(terrible)': 'terrible' in tokens,
'contains(awful)': 'awful' in tokens,
'contains(horrible)': 'horrible' in tokens,
'contains(disappointing)': 'disappointing' in tokens,
'has_exclamation': '!' in text,
'has_caps': any(word.isupper() for word in text.split() if len(word) > 2),
'word_count': len(tokens),
'has_negation': any(neg in tokens for neg in ['not', 'no', 'never'])
}
prediction = custom_classifier.classify(features)
print(f"'{text}' -> {prediction}")
# 5. Aspect-based sentiment analysis
print(f"\n=== ASPECT-BASED SENTIMENT ANALYSIS ===")
def aspect_sentiment_analysis(text, aspects):
"""Analyze sentiment for specific aspects in text"""
tokens = word_tokenize(text.lower())
sentences = sent_tokenize(text)
aspect_sentiments = {}
for aspect in aspects:
aspect_sentences = []
# Find sentences containing the aspect
for sentence in sentences:
if aspect.lower() in sentence.lower():
aspect_sentences.append(sentence)
if aspect_sentences:
# Analyze sentiment for aspect-specific sentences
aspect_scores = []
for sentence in aspect_sentences:
vader_scores = analyzer.polarity_scores(sentence)
aspect_scores.append(vader_scores['compound'])
avg_sentiment = sum(aspect_scores) / len(aspect_scores)
if avg_sentiment >= 0.05:
sentiment = 'positive'
elif avg_sentiment <= -0.05:
sentiment = 'negative'
else:
sentiment = 'neutral'
aspect_sentiments[aspect] = {
'sentiment': sentiment,
'score': avg_sentiment,
'mentions': len(aspect_sentences)
}
else:
aspect_sentiments[aspect] = {
'sentiment': 'not_mentioned',
'score': 0.0,
'mentions': 0
}
return aspect_sentiments
# Test aspect-based analysis
review_text = """
The food at this restaurant was absolutely delicious! The service was a bit slow,
but the staff was very friendly. The atmosphere was cozy and romantic, perfect for
a date. However, the prices were quite expensive for what you get.
"""
aspects = ['food', 'service', 'atmosphere', 'price', 'staff']
aspect_results = aspect_sentiment_analysis(review_text, aspects)
print("Aspect-based Sentiment Analysis:")
print(f"Review: {review_text}")
print("\nAspect Analysis:")
for aspect, result in aspect_results.items():
if result['sentiment'] != 'not_mentioned':
print(f"{aspect.capitalize()}: {result['sentiment']} (score: {result['score']:.3f}, mentions: {result['mentions']})")
else:
print(f"{aspect.capitalize()}: not mentioned")
# 6. Sentiment trends analysis
def analyze_sentiment_trends(texts):
"""Analyze sentiment trends across multiple texts"""
sentiments = []
scores = []
for text in texts:
vader_result = analyzer.polarity_scores(text)
compound_score = vader_result['compound']
scores.append(compound_score)
if compound_score >= 0.05:
sentiments.append('positive')
elif compound_score <= -0.05:
sentiments.append('negative')
else:
sentiments.append('neutral')
# Calculate trends
sentiment_counts = Counter(sentiments)
avg_score = sum(scores) / len(scores) if scores else 0
# Trend analysis (simplified)
if len(scores) > 1:
trend = 'improving' if scores[-1] > scores[0] else 'declining' if scores[-1] < scores[0] else 'stable'
else:
trend = 'insufficient_data'
return {
'sentiment_distribution': dict(sentiment_counts),
'average_score': avg_score,
'trend': trend,
'individual_scores': scores
}
# Example trend analysis
time_series_reviews = [
"This product was okay when I first got it.",
"After using it for a week, I'm starting to like it more.",
"Now I really enjoy using this product daily.",
"It's become an essential part of my routine. Highly recommend!"
]
trend_analysis = analyze_sentiment_trends(time_series_reviews)
print(f"\n=== SENTIMENT TREND ANALYSIS ===")
print("Sample reviews over time:")
for i, review in enumerate(time_series_reviews, 1):
print(f"{i}. {review}")
print(f"\nTrend Analysis:")
print(f"Distribution: {trend_analysis['sentiment_distribution']}")
print(f"Average sentiment: {trend_analysis['average_score']:.3f}")
print(f"Overall trend: {trend_analysis['trend']}")
print(f"Score progression: {[f'{score:.3f}' for score in trend_analysis['individual_scores']]}")
=== SENTIMENT ANALYSIS WITH NLTK ===
VADER Sentiment Analysis Results:
Text: I absolutely love this product! It's amazing!
Sentiment: POSITIVE
Scores: {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.8439}
Text: This is okay, nothing special.
Sentiment: NEUTRAL
Scores: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Text: I hate this terrible service. Worst experience ever!
Sentiment: NEGATIVE
Scores: {'neg': 0.735, 'neu': 0.265, 'pos': 0.0, 'compound': -0.8977}
=== LEXICON-BASED SENTIMENT ANALYSIS ===
Opinion Lexicon Sentiment Analysis:
Text: I absolutely love this product! It's amazing!
Sentiment: positive (score: 1.000)
Positive words: 2, Negative words: 0
Text: This is okay, nothing special.
Sentiment: neutral (score: 0.000)
Positive words: 0, Negative words: 0
Text: I hate this terrible service. Worst experience ever!
Sentiment: negative (score: -1.000)
Positive words: 0, Negative words: 2
=== SENTIWORDNET ANALYSIS ===
SentiWordNet Analysis:
Text: I absolutely love this product! It's amazing!
Sentiment: positive (score: 0.425)
Avg positive: 0.456, Avg negative: 0.031, Words analyzed: 4
Text: This is okay, nothing special.
Sentiment: neutral (score: 0.067)
Avg positive: 0.089, Avg negative: 0.022, Words analyzed: 3
=== CUSTOM SENTIMENT CLASSIFIER ===
Custom sentiment classifier accuracy: 0.847
Custom classifier predictions:
'This movie was absolutely fantastic! I loved every minute of it.' -> pos
'The plot was confusing and the acting was terrible.' -> neg
'It was an okay film, not great but not bad either.' -> neg
=== ASPECT-BASED SENTIMENT ANALYSIS ===
Aspect-based Sentiment Analysis:
Review:
The food at this restaurant was absolutely delicious! The service was a bit slow,
but the staff was very friendly. The atmosphere was cozy and romantic, perfect for
a date. However, the prices were quite expensive for what you get.
Aspect Analysis:
Food: positive (score: 0.659, mentions: 1)
Service: negative (score: -0.128, mentions: 1)
Atmosphere: positive (score: 0.571, mentions: 1)
Price: negative (score: -0.296, mentions: 1)
Staff: positive (score: 0.694, mentions: 1)
=== SENTIMENT TREND ANALYSIS ===
Sample reviews over time:
1. This product was okay when I first got it.
2. After using it for a week, I'm starting to like it more.
3. Now I really enjoy using this product daily.
4. It's become an essential part of my routine. Highly recommend!
Trend Analysis:
Distribution: {'neutral': 1, 'positive': 3}
Average sentiment: 0.421
Overall trend: improving
Score progression: ['0.000', '0.431', '0.659', '0.693']
Sentiment Analysis Best Practices
- Combine methods: Use VADER for social media, lexicon-based for formal text
- Handle negation: "not good" should be negative, not positive
- Consider context: Domain-specific sentiment can differ from general sentiment
- Evaluate thoroughly: Test on domain-specific data for accurate results
Conclusion and Best Practices
NLTK remains a cornerstone of natural language processing education and research, offering unparalleled access to linguistic resources and transparent algorithm implementations. While modern alternatives may offer performance advantages, NLTK's educational value and comprehensive toolkit make it indispensable for understanding NLP fundamentals.
Essential NLTK Mastery Principles
- Leverage linguistic resources: NLTK's corpus collection is unmatched for research and analysis
- Understand algorithms: Use NLTK's transparent implementations to learn NLP concepts
- Combine approaches: Blend rule-based and statistical methods for robust solutions
- Focus on preprocessing: Quality tokenization and normalization are crucial
- Validate with corpora: Use NLTK's datasets to benchmark and evaluate methods
The techniques covered in this guide represent practical applications of NLTK's extensive capabilities. From basic text processing to advanced sentiment analysis, these patterns demonstrate how to leverage NLTK's strengths effectively while understanding its limitations.
Final Recommendation: Master NLTK for understanding NLP concepts and prototyping solutions. Its educational value and comprehensive linguistic resources make it an excellent foundation for any natural language processing journey, even if you eventually migrate to faster production tools.