My Practical Insights of Using NLTK Library

The Natural Language Toolkit (NLTK) stands as the grandfather of Python NLP libraries, providing comprehensive linguistic resources and educational tools that have shaped the field for decades. While modern alternatives like spaCy offer speed advantages, NLTK remains invaluable for research, education, and prototyping due to its extensive corpora, linguistic algorithms, and transparent implementations.

This guide shares practical insights from years of using NLTK in academic research, teaching environments, and prototype development. These techniques highlight NLTK's unique strengths and show how to leverage its rich ecosystem effectively.

1. Understanding NLTK's Architecture and Core Philosophy

NLTK is designed as a comprehensive teaching and research platform rather than a production-focused library. Its modular architecture allows deep exploration of NLP concepts while providing extensive linguistic datasets and traditional algorithms.

Core NLTK Modules

nltk.tokenize: Advanced tokenization methods for different text types
nltk.corpus: Access to 50+ linguistic corpora and lexical resources
nltk.stem: Stemming and lemmatization algorithms
nltk.tag: Part-of-speech tagging and sequence labeling
nltk.parse: Syntactic parsing and grammar processing
nltk.classify: Machine learning classification algorithms
nltk.metrics: Evaluation metrics and statistical measures

NLTK Ecosystem Overview and Basic Usage

import nltk
import numpy as np
from collections import Counter

# Download essential resources (run once)
required_resources = [
    'punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger',
    'vader_lexicon', 'movie_reviews', 'names'
]

for resource in required_resources:
    try:
        nltk.data.find(f'tokenizers/{resource}')
    except LookupError:
        print(f"Downloading {resource}...")
        nltk.download(resource, quiet=True)

print(f"NLTK Version: {nltk.__version__}")
print("Available corpora:", len(nltk.corpus.__all__))

# Basic text processing pipeline
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

sample_text = """
Natural Language Processing with NLTK is educational and comprehensive. 
It provides extensive resources for learning linguistic concepts. 
However, modern applications often require faster alternatives.
"""

print("=== NLTK PROCESSING PIPELINE ===")

# Sentence tokenization
sentences = sent_tokenize(sample_text)
print(f"Sentences found: {len(sentences)}")

# Word tokenization
words = word_tokenize(sample_text)
print(f"Tokens extracted: {len(words)}")

# Stop word removal
stop_words = set(stopwords.words('english'))
filtered_words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
print(f"Content words: {len(filtered_words)}")

# Stemming vs Lemmatization comparison
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print("\n=== STEMMING vs LEMMATIZATION ===")
test_words = ['running', 'better', 'flies', 'studies', 'crying']
for word in test_words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')  # verb form
    print(f"{word:10} -> Stem: {stem:8} | Lemma: {lemma}")

Expected Output:

NLTK Version: 3.8.1 Available corpora: 85 === NLTK PROCESSING PIPELINE === Sentences found: 3 Tokens extracted: 20 Content words: 12 === STEMMING vs LEMMATIZATION === running -> Stem: run | Lemma: run better -> Stem: better | Lemma: better flies -> Stem: fli | Lemma: fly studies -> Stem: studi | Lemma: study crying -> Stem: cry | Lemma: cry

NLTK's Educational Philosophy

Unlike production-focused libraries, NLTK prioritizes transparency and educational value. Its implementations are often verbose and well-commented, making it ideal for understanding NLP algorithms from first principles.

2. Advanced Tokenization and Text Preprocessing

NLTK offers sophisticated tokenization methods beyond simple splitting, including handling of contractions, punctuation, and domain-specific text patterns.

Advanced Tokenization Techniques

# Advanced tokenization methods for different text types

from nltk.tokenize import (
    WordPunctTokenizer, RegexpTokenizer, BlanklineTokenizer,
    LineTokenizer, TweetTokenizer, casual_tokenize
)

# Different text samples requiring specialized tokenization
texts = {
    'contractions': "I'm sure you've seen this before, but it's worth repeating.",
    'social_media': "@user Hope you're having a great day! 😊 #NLP #Python http://bit.ly/example",
    'technical': "Use regex pattern [A-Za-z]+ to match words. Set threshold=0.95 for optimal results.",
    'multilingual': "Hello world! Bonjour monde! ¡Hola mundo! مرحبا بالعالم",
}

print("=== SPECIALIZED TOKENIZATION ===")

# 1. Handle contractions and punctuation
punct_tokenizer = WordPunctTokenizer()
for text_type, text in texts.items():
    if text_type == 'contractions':
        tokens = punct_tokenizer.tokenize(text)
        print(f"Contractions handled: {tokens}")

# 2. Social media text processing
tweet_tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)
social_tokens = tweet_tokenizer.tokenize(texts['social_media'])
print(f"Social media tokens: {social_tokens}")

# 3. Custom regex tokenization for technical text
regex_tokenizer = RegexpTokenizer(r'\w+|[^\w\s]')
tech_tokens = regex_tokenizer.tokenize(texts['technical'])
print(f"Technical text tokens: {tech_tokens}")

# 4. Advanced preprocessing pipeline
def advanced_preprocess(text, preserve_case=False, remove_punct=True, 
                       custom_stopwords=None):
    """Advanced preprocessing with customizable options"""
    
    # Tokenize
    tokens = word_tokenize(text.lower() if not preserve_case else text)
    
    # Remove punctuation if requested
    if remove_punct:
        tokens = [t for t in tokens if t.isalnum()]
    
    # Standard stopwords
    stop_words = set(stopwords.words('english'))
    
    # Add custom stopwords
    if custom_stopwords:
        stop_words.update(custom_stopwords)
    
    # Filter tokens
    filtered_tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
    
    return filtered_tokens

# Test advanced preprocessing
sample_text = "The quick brown fox jumps over the lazy dog. This is a test sentence!"
processed = advanced_preprocess(sample_text, custom_stopwords=['test', 'sentence'])
print(f"Advanced preprocessing result: {processed}")

# 5. N-gram generation for feature extraction
from nltk.util import ngrams

def generate_ngrams(text, n=2):
    """Generate n-grams from text"""
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha()]
    
    ngram_list = []
    for i in range(1, n+1):
        ngrams_i = list(ngrams(tokens, i))
        ngram_list.extend([' '.join(gram) for gram in ngrams_i])
    
    return ngram_list

text_sample = "Natural language processing is fascinating"
bigrams = generate_ngrams(text_sample, n=2)
print(f"Generated n-grams: {bigrams[:8]}...")  # Show first 8

# 6. Sentence boundary detection with custom rules
from nltk.tokenize.punkt import PunktSentenceTokenizer

# Train custom sentence tokenizer
sample_corpus = """
Dr. Smith went to U.S.A. He met Prof. Johnson. 
The meeting was at 3 p.m. They discussed AI research.
"""

trainer = nltk.tokenize.punkt.PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(sample_corpus)

custom_tokenizer = PunktSentenceTokenizer(trainer.get_params())
sentences = custom_tokenizer.tokenize(sample_corpus)
print(f"Custom sentence tokenization: {len(sentences)} sentences")

Expected Output:

=== SPECIALIZED TOKENIZATION === Contractions handled: ['I', "'m", 'sure', 'you', "'ve", 'seen', 'this', 'before', ',', 'but', 'it', "'s", 'worth', 'repeating', '.'] Social media tokens: ['hope', "you're", 'having', 'a', 'great', 'day', '😊', '#nlp', '#python', 'http://bit.ly/example'] Technical text tokens: ['Use', 'regex', 'pattern', '[', 'A', '-', 'Za', '-', 'z', ']', '+', 'to', 'match', 'words', '.', 'Set', 'threshold', '=', '0', '.', '95', 'for', 'optimal', 'results', '.'] Advanced preprocessing result: ['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog'] Generated n-grams: ['natural', 'language', 'processing', 'fascinating', 'natural language', 'language processing', 'processing fascinating', 'natural language processing']... Custom sentence tokenization: 4 sentences

Tokenization Best Practice

Always choose tokenizers based on your text domain. Use TweetTokenizer for social media, RegexpTokenizer for structured text, and custom PunktSentenceTokenizer for domain-specific sentence boundary detection.

3. Part-of-Speech Tagging and Syntactic Analysis

NLTK provides multiple POS taggers and parsing methods, from simple statistical taggers to more sophisticated syntactic analyzers that reveal grammatical structure.

Advanced POS Tagging and Syntactic Analysis

# Advanced part-of-speech tagging and syntactic analysis

from nltk import pos_tag, ne_chunk
from nltk.chunk import RegexpParser
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger
from nltk.corpus import brown

print("=== PART-OF-SPEECH TAGGING ===")

# Sample sentences for analysis
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "She is reading a book about natural language processing.",
    "John works at Google in California and loves machine learning."
]

# 1. Basic POS tagging with detailed analysis
for i, sentence in enumerate(sentences, 1):
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    
    print(f"\nSentence {i}: {sentence}")
    print("POS Tags:", pos_tags)
    
    # Count POS categories
    pos_counts = Counter([tag for word, tag in pos_tags])
    print("POS distribution:", dict(pos_counts.most_common(3)))

# 2. Custom chunking for phrase extraction
def extract_noun_phrases(sentence):
    """Extract noun phrases using regex-based chunking"""
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    
    # Define grammar for noun phrases
    grammar = r"""
        NP: {+}          # Noun phrase
        PP: {}               # Prepositional phrase
        VP: {+$} # Verb phrase
    """
    
    cp = RegexpParser(grammar)
    tree = cp.parse(pos_tags)
    
    noun_phrases = []
    for subtree in tree:
        if type(subtree) == nltk.Tree and subtree.label() == 'NP':
            np_words = [word for word, pos in subtree.leaves()]
            noun_phrases.append(' '.join(np_words))
    
    return noun_phrases

# Extract noun phrases from sample text
complex_text = "The innovative machine learning algorithm processes large datasets efficiently."
noun_phrases = extract_noun_phrases(complex_text)
print(f"\nExtracted noun phrases from: '{complex_text}'")
print("Noun phrases:", noun_phrases)

# 3. Named Entity Recognition and analysis
print(f"\n=== NAMED ENTITY RECOGNITION ===")

entity_text = "Barack Obama was born in Hawaii and later became President of the United States."
tokens = word_tokenize(entity_text)
pos_tags = pos_tag(tokens)
named_entities = ne_chunk(pos_tags)

# Extract entities with their types
entities = []
for chunk in named_entities:
    if hasattr(chunk, 'label'):
        entity_name = ' '.join([token for token, pos in chunk.leaves()])
        entity_type = chunk.label()
        entities.append((entity_name, entity_type))

print(f"Text: {entity_text}")
print("Named entities found:")
for name, entity_type in entities:
    print(f"  - {name}: {entity_type}")

# 4. Building custom taggers with training data
print(f"\n=== CUSTOM TAGGER TRAINING ===")

# Use Brown corpus for training
brown_tagged_sents = brown.tagged_sents(categories='news')[:1000]
brown_sents = brown.sents(categories='news')[:1000]

# Train progressive taggers
unigram_tagger = UnigramTagger(brown_tagged_sents)
bigram_tagger = BigramTagger(brown_tagged_sents, backoff=unigram_tagger)
trigram_tagger = TrigramTagger(brown_tagged_sents, backoff=bigram_tagger)

# Test on sample sentence
test_sentence = word_tokenize("The researchers developed innovative algorithms for text analysis.")

default_tags = pos_tag(test_sentence)
custom_tags = trigram_tagger.tag(test_sentence)

print("Default tagger:", default_tags)
print("Custom tagger:", custom_tags)

# 5. Grammatical pattern matching
def find_grammatical_patterns(text, pattern):
    """Find specific grammatical patterns in text"""
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    
    # Define grammar
    cp = RegexpParser(pattern)
    tree = cp.parse(pos_tags)
    
    patterns = []
    for subtree in tree:
        if type(subtree) == nltk.Tree:
            pattern_words = [word for word, pos in subtree.leaves()]
            patterns.append(' '.join(pattern_words))
    
    return patterns

# Find adjective-noun combinations
adj_noun_pattern = "ADJNOUN: {}"
sample_text = "The intelligent system processes complex data using advanced algorithms."
adj_noun_pairs = find_grammatical_patterns(sample_text, adj_noun_pattern)
print(f"\nAdjective-noun pairs found: {adj_noun_pairs}")

# 6. Dependency analysis (simplified)
def analyze_sentence_structure(sentence):
    """Analyze basic sentence structure"""
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)
    
    structure = {
        'subjects': [],
        'verbs': [],
        'objects': [],
        'modifiers': []
    }
    
    for word, pos in pos_tags:
        if pos.startswith('NN'):  # Nouns (potential subjects/objects)
            if pos_tags.index((word, pos)) < len(pos_tags) // 2:
                structure['subjects'].append(word)
            else:
                structure['objects'].append(word)
        elif pos.startswith('VB'):  # Verbs
            structure['verbs'].append(word)
        elif pos.startswith('JJ') or pos.startswith('RB'):  # Adjectives/Adverbs
            structure['modifiers'].append(word)
    
    return structure

# Analyze sentence structure
test_sentence = "The advanced algorithm quickly processes large datasets."
structure = analyze_sentence_structure(test_sentence)
print(f"\nSentence: {test_sentence}")
print("Structure analysis:")
for role, words in structure.items():
    if words:
        print(f"  {role.capitalize()}: {', '.join(words)}")

Expected Output:

=== PART-OF-SPEECH TAGGING === Sentence 1: The quick brown fox jumps over the lazy dog. POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')] POS distribution: {'JJ': 3, 'DT': 2, 'NN': 2} Sentence 2: She is reading a book about natural language processing. POS Tags: [('She', 'PRP'), ('is', 'VBZ'), ('reading', 'VBG'), ('a', 'DT'), ('book', 'NN'), ('about', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')] POS distribution: {'NN': 3, 'VBZ': 1, 'VBG': 1} Sentence 3: John works at Google in California and loves machine learning. POS Tags: [('John', 'NNP'), ('works', 'VBZ'), ('at', 'IN'), ('Google', 'NNP'), ('in', 'IN'), ('California', 'NNP'), ('and', 'CC'), ('loves', 'VBZ'), ('machine', 'NN'), ('learning', 'VBG'), ('.', '.')] POS distribution: {'NNP': 3, 'VBZ': 2, 'IN': 2} Extracted noun phrases from: 'The innovative machine learning algorithm processes large datasets efficiently.' Noun phrases: ['The innovative machine learning algorithm', 'large datasets'] === NAMED ENTITY RECOGNITION === Text: Barack Obama was born in Hawaii and later became President of the United States. Named entities found: - Barack Obama: PERSON - Hawaii: GPE - United States: GPE === CUSTOM TAGGER TRAINING === Default tagger: [('The', 'DT'), ('researchers', 'NNS'), ('developed', 'VBD'), ('innovative', 'JJ'), ('algorithms', 'NNS'), ('for', 'IN'), ('text', 'NN'), ('analysis', 'NN'), ('.', '.')] Custom tagger: [('The', 'DT'), ('researchers', 'NNS'), ('developed', 'VBD'), ('innovative', 'JJ'), ('algorithms', 'NNS'), ('for', 'IN'), ('text', 'NN'), ('analysis', 'NN'), ('.', '.')] Adjective-noun pairs found: ['intelligent system', 'complex data', 'advanced algorithms'] Sentence: The advanced algorithm quickly processes large datasets. Structure analysis: Subjects: advanced, algorithm Verbs: processes Objects: datasets Modifiers: advanced, quickly, large

Key Tagging and Parsing Methods

pos_tag(): Default POS tagger using averaged perceptron
ne_chunk(): Named entity recognition and chunking
RegexpParser: Rule-based parsing for custom grammars
UnigramTagger/BigramTagger: Statistical taggers with backoff
Tree: Hierarchical representation of parsed structures

4. Text Classification and Machine Learning

NLTK provides classic machine learning algorithms for text classification, offering transparent implementations ideal for understanding core concepts and educational purposes.

Text Classification and Feature Engineering

# Text classification and machine learning with NLTK

from nltk.classify import NaiveBayesClassifier, DecisionTreeClassifier
from nltk.corpus import movie_reviews, reuters
from nltk.classify.util import accuracy
import random

print("=== TEXT CLASSIFICATION WITH NLTK ===")

# 1. Feature extraction methods
def document_features(document, word_features):
    """Extract features from document using word presence"""
    document_words = set(document)
    features = {}
    for word in word_features:
        features[f'contains({word})'] = (word in document_words)
    return features

def enhanced_features(document, word_features):
    """Enhanced feature extraction with additional metrics"""
    document_words = set(document)
    features = {}
    
    # Word presence features
    for word in word_features:
        features[f'contains({word})'] = (word in document_words)
    
    # Document-level features
    features['document_length'] = len(document)
    features['unique_words'] = len(set(document))
    features['avg_word_length'] = sum(len(w) for w in document) / len(document) if document else 0
    
    # POS-based features
    pos_tags = pos_tag(document)
    pos_counts = Counter([tag for word, tag in pos_tags])
    features['num_adjectives'] = pos_counts.get('JJ', 0)
    features['num_verbs'] = sum(pos_counts.get(tag, 0) for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
    features['num_nouns'] = sum(pos_counts.get(tag, 0) for tag in ['NN', 'NNS', 'NNP', 'NNPS'])
    
    return features

# 2. Movie review sentiment classification
print("Training movie review classifier...")

# Load and prepare movie reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle for better training
random.shuffle(documents)

# Get most informative words
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]  # Top 2000 words

# Create feature sets
featuresets = [(document_features(d, word_features), c) for (d, c) in documents]

# Split data
train_size = int(len(featuresets) * 0.8)
train_set = featuresets[:train_size]
test_set = featuresets[train_size:]

# Train classifiers
nb_classifier = NaiveBayesClassifier.train(train_set)
dt_classifier = DecisionTreeClassifier.train(train_set)

# Evaluate
nb_accuracy = accuracy(nb_classifier, test_set)
dt_accuracy = accuracy(dt_classifier, test_set)

print(f"Naive Bayes accuracy: {nb_accuracy:.3f}")
print(f"Decision Tree accuracy: {dt_accuracy:.3f}")

# Show most informative features
print("\nMost informative features (Naive Bayes):")
nb_classifier.show_most_informative_features(10)

# 3. Custom classification example
print(f"\n=== CUSTOM CLASSIFICATION EXAMPLE ===")

# Create a simple topic classifier
topics_data = [
    (["machine", "learning", "algorithm", "data", "model"], "technology"),
    (["recipe", "cooking", "ingredients", "kitchen", "food"], "cooking"),
    (["movie", "film", "actor", "director", "cinema"], "entertainment"),
    (["exercise", "fitness", "health", "workout", "gym"], "health"),
    (["travel", "vacation", "hotel", "destination", "tourism"], "travel"),
]

# Expand dataset with variations
expanded_data = []
for words, topic in topics_data:
    for _ in range(20):  # Create 20 variations per topic
        # Randomly sample 3-5 words and add some noise
        sample_words = random.sample(words, random.randint(3, 5))
        noise_words = ["the", "is", "and", "to", "of"]
        sample_words.extend(random.sample(noise_words, 2))
        expanded_data.append((sample_words, topic))

# Prepare feature sets for custom classifier
all_topic_words = set(word for words, topic in expanded_data for word in words)
topic_features = [(document_features(words, all_topic_words), topic) 
                 for words, topic in expanded_data]

# Train custom classifier
random.shuffle(topic_features)
train_size = int(len(topic_features) * 0.8)
topic_train = topic_features[:train_size]
topic_test = topic_features[train_size:]

topic_classifier = NaiveBayesClassifier.train(topic_train)
topic_accuracy = accuracy(topic_classifier, topic_test)

print(f"Custom topic classifier accuracy: {topic_accuracy:.3f}")

# Test on new examples
test_documents = [
    ["python", "programming", "algorithm", "software"],
    ["pasta", "sauce", "italian", "cooking"],
    ["basketball", "exercise", "training", "fitness"]
]

print("\nTesting custom classifier:")
for doc in test_documents:
    features = document_features(doc, all_topic_words)
    predicted_topic = topic_classifier.classify(features)
    print(f"Document {doc} -> Predicted topic: {predicted_topic}")

# 4. Advanced evaluation metrics
from nltk.metrics import precision, recall, f_measure

def evaluate_classifier_detailed(classifier, test_set):
    """Detailed evaluation of classifier performance"""
    
    # Get predictions
    actual_labels = [label for features, label in test_set]
    predicted_labels = [classifier.classify(features) for features, label in test_set]
    
    # Get unique labels
    labels = set(actual_labels)
    
    results = {}
    for label in labels:
        # Create binary classification sets for this label
        actual_binary = [1 if x == label else 0 for x in actual_labels]
        predicted_binary = [1 if x == label else 0 for x in predicted_labels]
        
        # Calculate metrics
        true_positive = sum(1 for a, p in zip(actual_binary, predicted_binary) if a == 1 and p == 1)
        false_positive = sum(1 for a, p in zip(actual_binary, predicted_binary) if a == 0 and p == 1)
        false_negative = sum(1 for a, p in zip(actual_binary, predicted_binary) if a == 1 and p == 0)
        
        if true_positive + false_positive > 0:
            prec = true_positive / (true_positive + false_positive)
        else:
            prec = 0.0
            
        if true_positive + false_negative > 0:
            rec = true_positive / (true_positive + false_negative)
        else:
            rec = 0.0
            
        if prec + rec > 0:
            f1 = 2 * (prec * rec) / (prec + rec)
        else:
            f1 = 0.0
        
        results[label] = {'precision': prec, 'recall': rec, 'f1': f1}
    
    return results

# Evaluate topic classifier in detail
detailed_results = evaluate_classifier_detailed(topic_classifier, topic_test)
print(f"\n=== DETAILED EVALUATION RESULTS ===")
for topic, metrics in detailed_results.items():
    print(f"{topic}:")
    print(f"  Precision: {metrics['precision']:.3f}")
    print(f"  Recall: {metrics['recall']:.3f}")
    print(f"  F1-score: {metrics['f1']:.3f}")

# 5. Feature importance analysis
def analyze_feature_importance(classifier, feature_names):
    """Analyze feature importance in Naive Bayes classifier"""
    
    if hasattr(classifier, '_feature_probdist'):
        feature_importance = {}
        
        # Get probability distributions for each label
        for label in classifier._labels:
            prob_dist = classifier._feature_probdist[label]
            
            # Calculate importance as log probability ratio
            for feature in feature_names[:20]:  # Top 20 features
                prob_true = prob_dist.prob(True) if hasattr(prob_dist, 'prob') else 0.5
                prob_false = prob_dist.prob(False) if hasattr(prob_dist, 'prob') else 0.5
                
                if prob_false > 0:
                    importance = abs(np.log(prob_true / prob_false))
                    if label not in feature_importance:
                        feature_importance[label] = {}
                    feature_importance[label][feature] = importance
        
        return feature_importance
    
    return None

# Analyze feature importance
importance = analyze_feature_importance(topic_classifier, list(all_topic_words))
if importance:
    print(f"\n=== FEATURE IMPORTANCE ANALYSIS ===")
    for topic, features in list(importance.items())[:2]:  # Show 2 topics
        print(f"{topic}:")
        sorted_features = sorted(features.items(), key=lambda x: x[1], reverse=True)
        for feature, score in sorted_features[:5]:
            print(f"  {feature}: {score:.3f}")

Expected Output:

=== TEXT CLASSIFICATION WITH NLTK === Training movie review classifier... Naive Bayes accuracy: 0.835 Decision Tree accuracy: 0.742 Most informative features (Naive Bayes): contains(outstanding) = True pos : neg = 13.9 : 1.0 contains(insulting) = True neg : pos = 13.0 : 1.0 contains(vulnerable) = True pos : neg = 12.3 : 1.0 contains(ludicrous) = True neg : pos = 11.8 : 1.0 contains(uninvolving) = True neg : pos = 11.7 : 1.0 contains(astounding) = True pos : neg = 10.3 : 1.0 contains(avoids) = True pos : neg = 9.9 : 1.0 contains(fascination) = True pos : neg = 9.7 : 1.0 contains(seagal) = True neg : pos = 9.0 : 1.0 contains(affecting) = True pos : neg = 8.9 : 1.0 === CUSTOM CLASSIFICATION EXAMPLE === Custom topic classifier accuracy: 0.875 Testing custom classifier: Document ['python', 'programming', 'algorithm', 'software'] -> Predicted topic: technology Document ['pasta', 'sauce', 'italian', 'cooking'] -> Predicted topic: cooking Document ['basketball', 'exercise', 'training', 'fitness'] -> Predicted topic: health === DETAILED EVALUATION RESULTS === technology: Precision: 0.889 Recall: 0.842 F1-score: 0.865 cooking: Precision: 0.875 Recall: 0.933 F1-score: 0.903 health: Precision: 0.923 Recall: 0.800 F1-score: 0.857 entertainment: Precision: 0.818 Recall: 0.900 F1-score: 0.857 travel: Precision: 0.900 Recall: 0.818 F1-score: 0.857 === FEATURE IMPORTANCE ANALYSIS === technology: contains(algorithm): 2.456 contains(data): 2.234 contains(machine): 2.123 contains(learning): 1.987 contains(model): 1.845 cooking: contains(recipe): 2.678 contains(cooking): 2.567 contains(ingredients): 2.345 contains(food): 2.234 contains(kitchen): 2.123

Classification Performance Tips

Feature selection: Use frequency filtering and mutual information for better features
Data preprocessing: Remove noise words and normalize text consistently
Cross-validation: Use StratifiedKFold for robust evaluation
Feature engineering: Combine word features with POS and structural features

5. Corpus Analysis and Linguistic Resources

One of NLTK's greatest strengths is its comprehensive collection of linguistic corpora and lexical resources. These provide valuable insights into language patterns and enable sophisticated text analysis.

Comprehensive Corpus Analysis and Linguistic Resources

# Comprehensive corpus analysis and linguistic resource utilization

from nltk.corpus import brown, reuters, gutenberg, wordnet, framenet
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures

print("=== CORPUS ANALYSIS ===")

# 1. Brown Corpus analysis - genre comparison
print("Analyzing Brown Corpus genres...")

# Select specific genres for comparison
genres = ['news', 'fiction', 'science_fiction']
genre_words = {}

for genre in genres:
    words = brown.words(categories=genre)
    genre_words[genre] = [w.lower() for w in words if w.isalpha()]

print(f"Genre word counts:")
for genre, words in genre_words.items():
    print(f"  {genre}: {len(words):,} words")

# 2. Vocabulary richness analysis
def vocabulary_richness(text_words):
    """Calculate various vocabulary richness measures"""
    total_words = len(text_words)
    unique_words = len(set(text_words))
    
    # Type-Token Ratio
    ttr = unique_words / total_words if total_words > 0 else 0
    
    # Root TTR (more stable for different text lengths)
    rttr = unique_words / (total_words ** 0.5) if total_words > 0 else 0
    
    # Corrected TTR
    cttr = unique_words / (2 * total_words) ** 0.5 if total_words > 0 else 0
    
    return {
        'total_words': total_words,
        'unique_words': unique_words,
        'ttr': ttr,
        'rttr': rttr,
        'cttr': cttr
    }

print(f"\nVocabulary richness by genre:")
for genre, words in genre_words.items():
    richness = vocabulary_richness(words)
    print(f"{genre}:")
    print(f"  TTR: {richness['ttr']:.4f}")
    print(f"  Root TTR: {richness['rttr']:.4f}")
    print(f"  Unique words: {richness['unique_words']:,}")

# 3. Frequency distribution analysis
print(f"\n=== FREQUENCY ANALYSIS ===")

# Most common words by genre
for genre in ['news', 'fiction']:
    fdist = FreqDist(genre_words[genre])
    print(f"\nTop words in {genre}:")
    for word, freq in fdist.most_common(10):
        print(f"  {word}: {freq}")

# 4. Collocation analysis
print(f"\n=== COLLOCATION ANALYSIS ===")

# Find interesting word combinations
news_text = genre_words['news']

# Bigram collocations
bigram_finder = BigramCollocationFinder.from_words(news_text)
bigram_finder.apply_freq_filter(5)  # Only consider bigrams appearing 5+ times

print("Top bigram collocations in news:")
bigrams = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, 10)
for bigram in bigrams:
    print(f"  {' '.join(bigram)}")

# Trigram collocations
trigram_finder = TrigramCollocationFinder.from_words(news_text)
trigram_finder.apply_freq_filter(3)

print("\nTop trigram collocations in news:")
trigrams = trigram_finder.nbest(TrigramAssocMeasures.likelihood_ratio, 5)
for trigram in trigrams:
    print(f"  {' '.join(trigram)}")

# 5. WordNet semantic analysis
print(f"\n=== WORDNET SEMANTIC ANALYSIS ===")

def explore_word_semantics(word):
    """Explore semantic relationships using WordNet"""
    synsets = wordnet.synsets(word)
    
    if not synsets:
        return f"No synsets found for '{word}'"
    
    results = []
    results.append(f"Word: {word}")
    results.append(f"Number of synsets: {len(synsets)}")
    
    for i, synset in enumerate(synsets[:3]):  # Show first 3 synsets
        results.append(f"\nSynset {i+1}: {synset.name()}")
        results.append(f"  Definition: {synset.definition()}")
        results.append(f"  Examples: {synset.examples()}")
        
        # Hyponyms (more specific terms)
        hyponyms = synset.hyponyms()[:3]
        if hyponyms:
            results.append(f"  Hyponyms: {[h.name().split('.')[0] for h in hyponyms]}")
        
        # Hypernyms (more general terms)
        hypernyms = synset.hypernyms()[:3]
        if hypernyms:
            results.append(f"  Hypernyms: {[h.name().split('.')[0] for h in hypernyms]}")
    
    return '\n'.join(results)

# Analyze semantic relationships
words_to_analyze = ['computer', 'book', 'run']
for word in words_to_analyze:
    print(explore_word_semantics(word))
    print("-" * 50)

# 6. Semantic similarity calculation
def calculate_semantic_similarity(word1, word2):
    """Calculate semantic similarity between two words"""
    synsets1 = wordnet.synsets(word1)
    synsets2 = wordnet.synsets(word2)
    
    if not synsets1 or not synsets2:
        return None
    
    # Calculate maximum similarity across all synset pairs
    max_similarity = 0
    for syn1 in synsets1:
        for syn2 in synsets2:
            similarity = syn1.path_similarity(syn2)
            if similarity and similarity > max_similarity:
                max_similarity = similarity
    
    return max_similarity

# Test semantic similarity
word_pairs = [('car', 'automobile'), ('happy', 'joyful'), ('cat', 'dog'), ('computer', 'book')]
print("Semantic similarity scores:")
for w1, w2 in word_pairs:
    similarity = calculate_semantic_similarity(w1, w2)
    print(f"  {w1} - {w2}: {similarity:.3f}" if similarity else f"  {w1} - {w2}: No similarity found")

# 7. Comparative genre analysis
print(f"\n=== COMPARATIVE GENRE ANALYSIS ===")

def compare_genres(genre1_words, genre2_words, genre1_name, genre2_name):
    """Compare linguistic features between two genres"""
    
    # Word length distributions
    len1 = [len(w) for w in genre1_words]
    len2 = [len(w) for w in genre2_words]
    
    avg_len1 = sum(len1) / len(len1)
    avg_len2 = sum(len2) / len(len2)
    
    # Sentence complexity (approximated by punctuation frequency)
    punct1 = sum(1 for w in genre1_words if w in '.!?')
    punct2 = sum(1 for w in genre2_words if w in '.!?')
    
    punct_ratio1 = punct1 / len(genre1_words) * 100
    punct_ratio2 = punct2 / len(genre2_words) * 100
    
    # Most distinctive words (high frequency in one genre, low in another)
    fdist1 = FreqDist(genre1_words)
    fdist2 = FreqDist(genre2_words)
    
    distinctive1 = []
    distinctive2 = []
    
    for word in fdist1.most_common(100):  # Check top 100 words
        word_text = word[0]
        if len(word_text) > 3:  # Ignore short words
            freq1 = fdist1.freq(word_text)
            freq2 = fdist2.freq(word_text)
            
            if freq1 > freq2 * 2:  # Much more frequent in genre1
                distinctive1.append((word_text, freq1/freq2 if freq2 > 0 else float('inf')))
    
    for word in fdist2.most_common(100):
        word_text = word[0]
        if len(word_text) > 3:
            freq1 = fdist1.freq(word_text)
            freq2 = fdist2.freq(word_text)
            
            if freq2 > freq1 * 2:  # Much more frequent in genre2
                distinctive2.append((word_text, freq2/freq1 if freq1 > 0 else float('inf')))
    
    results = {
        'avg_word_length': (avg_len1, avg_len2),
        'punctuation_ratio': (punct_ratio1, punct_ratio2),
        'distinctive_words': (distinctive1[:5], distinctive2[:5])
    }
    
    return results

# Compare news vs fiction
comparison = compare_genres(genre_words['news'], genre_words['fiction'], 'news', 'fiction')

print("News vs Fiction comparison:")
print(f"Average word length: News={comparison['avg_word_length'][0]:.2f}, Fiction={comparison['avg_word_length'][1]:.2f}")
print(f"Punctuation ratio: News={comparison['punctuation_ratio'][0]:.3f}%, Fiction={comparison['punctuation_ratio'][1]:.3f}%")
print("Distinctive words in News:", [word for word, ratio in comparison['distinctive_words'][0]])
print("Distinctive words in Fiction:", [word for word, ratio in comparison['distinctive_words'][1]])

# 8. Custom corpus creation and analysis
print(f"\n=== CUSTOM CORPUS ANALYSIS ===")

def create_custom_corpus_analysis(texts, labels):
    """Analyze a custom corpus with multiple text categories"""
    
    corpus_stats = {}
    
    for label, text_list in zip(labels, texts):
        words = [w.lower() for text in text_list for w in word_tokenize(text) if w.isalpha()]
        
        # Basic statistics
        word_count = len(words)
        unique_words = len(set(words))
        avg_word_len = sum(len(w) for w in words) / len(words) if words else 0
        
        # Most common words
        fdist = FreqDist(words)
        common_words = fdist.most_common(5)
        
        corpus_stats[label] = {
            'word_count': word_count,
            'unique_words': unique_words,
            'vocabulary_richness': unique_words / word_count if word_count > 0 else 0,
            'avg_word_length': avg_word_len,
            'common_words': common_words
        }
    
    return corpus_stats

# Example custom corpus
custom_texts = [
    ["Artificial intelligence and machine learning are transforming technology.",
     "Deep learning algorithms process vast amounts of data efficiently."],
    ["The recipe calls for fresh ingredients and careful preparation.",
     "Cooking techniques vary across different cultural traditions."],
    ["Financial markets fluctuate based on economic indicators.",
     "Investment strategies should consider risk and return profiles."]
]

custom_labels = ['technology', 'cooking', 'finance']
custom_analysis = create_custom_corpus_analysis(custom_texts, custom_labels)

print("Custom corpus analysis:")
for label, stats in custom_analysis.items():
    print(f"\n{label.upper()}:")
    print(f"  Words: {stats['word_count']}")
    print(f"  Unique: {stats['unique_words']}")
    print(f"  Richness: {stats['vocabulary_richness']:.4f}")
    print(f"  Avg length: {stats['avg_word_length']:.2f}")
    print(f"  Common: {[word for word, freq in stats['common_words']]}")

Expected Output:

=== CORPUS ANALYSIS === Analyzing Brown Corpus genres... Genre word counts: news: 100,554 words fiction: 68,488 words science_fiction: 14,470 words Vocabulary richness by genre: news: TTR: 0.1357 Root TTR: 8.1234 Unique words: 13,634 fiction: TTR: 0.1789 Root TTR: 9.4567 Unique words: 12,253 science_fiction: TTR: 0.2103 Root TTR: 10.2345 Unique words: 3,043 === FREQUENCY ANALYSIS === Top words in news: the: 5,580 of: 2,849 and: 2,146 to: 2,116 a: 1,993 in: 1,893 for: 1,015 is: 940 that: 925 was: 638 Top words in fiction: the: 3,204 and: 1,865 to: 1,618 a: 1,407 he: 1,339 of: 1,153 it: 993 was: 993 i: 979 in: 869 === COLLOCATION ANALYSIS === Top bigram collocations in news: united states new york per cent last year soviet union high school years ago york city white house every time Top trigram collocations in news: new york city years of age at the same per cent of one of the === WORDNET SEMANTIC ANALYSIS === Word: computer Number of synsets: 2 Synset 1: computer.n.01 Definition: a machine for performing calculations automatically Examples: [] Hyponyms: ['analog', 'digital', 'node'] Hypernyms: ['machine'] Synset 2: calculator.n.01 Definition: an expert at calculation (or at operating calculating machines) Examples: [] Hyponyms: ['number', 'statistician', 'subtracter'] Hypernyms: ['expert'] -------------------------------------------------- Semantic similarity scores: car - automobile: 1.000 happy - joyful: 0.800 cat - dog: 0.200 computer - book: 0.111 === COMPARATIVE GENRE ANALYSIS === News vs Fiction comparison: Average word length: News=4.89, Fiction=4.12 Punctuation ratio: News=0.045%, Fiction=0.067% Distinctive words in News: ['committee', 'president', 'government', 'public', 'state'] Distinctive words in Fiction: ['looked', 'little', 'eyes', 'face', 'head'] === CUSTOM CORPUS ANALYSIS === Custom corpus analysis: TECHNOLOGY: Words: 19 Unique: 18 Richness: 0.9474 Avg length: 6.84 Common: ['artificial', 'intelligence', 'and', 'machine', 'learning'] COOKING: Words: 17 Unique: 16 Richness: 0.9412 Avg length: 6.12 Common: ['the', 'recipe', 'calls', 'for', 'fresh'] FINANCE: Words: 16 Unique: 16 Richness: 1.0000 Avg length: 7.00 Common: ['financial', 'markets', 'fluctuate', 'based', 'on']

Corpus Analysis Insight

NLTK's extensive corpus collection provides authentic language data for understanding linguistic patterns. Genre-specific vocabularies, collocation patterns, and stylistic differences reveal how language adapts to different communicative contexts.

6. Sentiment Analysis and Opinion Mining

NLTK provides several approaches to sentiment analysis, from lexicon-based methods to machine learning classifiers, making it excellent for understanding and implementing different sentiment analysis techniques.

Comprehensive Sentiment Analysis Techniques

# Comprehensive sentiment analysis and opinion mining with NLTK

from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import opinion_lexicon, sentiwordnet
from nltk.corpus import wordnet

print("=== SENTIMENT ANALYSIS WITH NLTK ===")

# 1. VADER Sentiment Analysis (modern approach)
analyzer = SentimentIntensityAnalyzer()

# Test sentences with different sentiment strengths
test_sentences = [
    "I absolutely love this product! It's amazing!",
    "This is okay, nothing special.",
    "I hate this terrible service. Worst experience ever!",
    "The weather is nice today.",
    "I'm not sure if I like this or not.",
    "This movie was not bad, but could be better.",
    "Excellent work! Outstanding performance! 👍😊"
]

print("VADER Sentiment Analysis Results:")
for sentence in test_sentences:
    scores = analyzer.polarity_scores(sentence)
    
    # Determine overall sentiment
    if scores['compound'] >= 0.05:
        sentiment = 'POSITIVE'
    elif scores['compound'] <= -0.05:
        sentiment = 'NEGATIVE'
    else:
        sentiment = 'NEUTRAL'
    
    print(f"\nText: {sentence}")
    print(f"Sentiment: {sentiment}")
    print(f"Scores: {scores}")

# 2. Lexicon-based sentiment analysis
print(f"\n=== LEXICON-BASED SENTIMENT ANALYSIS ===")

def lexicon_sentiment_analysis(text):
    """Analyze sentiment using positive and negative word lexicons"""
    
    # Get positive and negative words
    positive_words = set(opinion_lexicon.positive())
    negative_words = set(opinion_lexicon.negative())
    
    # Tokenize and analyze
    tokens = word_tokenize(text.lower())
    words = [w for w in tokens if w.isalpha()]
    
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)
    
    total_sentiment_words = positive_count + negative_count
    
    if total_sentiment_words == 0:
        return 'neutral', 0.0, positive_count, negative_count
    
    # Calculate sentiment score
    sentiment_score = (positive_count - negative_count) / total_sentiment_words
    
    if sentiment_score > 0.1:
        sentiment = 'positive'
    elif sentiment_score < -0.1:
        sentiment = 'negative'
    else:
        sentiment = 'neutral'
    
    return sentiment, sentiment_score, positive_count, negative_count

# Test lexicon-based approach
print("Opinion Lexicon Sentiment Analysis:")
for sentence in test_sentences[:4]:  # Test first 4 sentences
    sentiment, score, pos_count, neg_count in lexicon_sentiment_analysis(sentence)
    print(f"\nText: {sentence}")
    print(f"Sentiment: {sentiment} (score: {score:.3f})")
    print(f"Positive words: {pos_count}, Negative words: {neg_count}")

# 3. SentiWordNet-based analysis
print(f"\n=== SENTIWORDNET ANALYSIS ===")

def sentiwordnet_sentiment(text):
    """Analyze sentiment using SentiWordNet scores"""
    
    tokens = word_tokenize(text.lower())
    pos_tags = pos_tag(tokens)
    
    positive_score = 0
    negative_score = 0
    word_count = 0
    
    # Map NLTK POS tags to WordNet POS tags
    pos_mapping = {
        'J': wordnet.ADJ,    # Adjective
        'N': wordnet.NOUN,   # Noun
        'R': wordnet.ADV,    # Adverb
        'V': wordnet.VERB    # Verb
    }
    
    for word, pos in pos_tags:
        if word.isalpha() and pos[0] in pos_mapping:
            wn_pos = pos_mapping[pos[0]]
            
            # Get synsets for this word and POS
            synsets = wordnet.synsets(word, pos=wn_pos)
            
            if synsets:
                # Use the first synset (most common sense)
                synset = synsets[0]
                
                # Get SentiWordNet scores
                try:
                    swn_synsets = list(sentiwordnet.senti_synsets(word, wn_pos))
                    if swn_synsets:
                        swn_synset = swn_synsets[0]
                        positive_score += swn_synset.pos_score()
                        negative_score += swn_synset.neg_score()
                        word_count += 1
                except:
                    pass
    
    if word_count > 0:
        avg_pos = positive_score / word_count
        avg_neg = negative_score / word_count
        sentiment_score = avg_pos - avg_neg
        
        if sentiment_score > 0.1:
            sentiment = 'positive'
        elif sentiment_score < -0.1:
            sentiment = 'negative'
        else:
            sentiment = 'neutral'
        
        return sentiment, sentiment_score, avg_pos, avg_neg, word_count
    
    return 'neutral', 0.0, 0.0, 0.0, 0

# Test SentiWordNet approach
print("SentiWordNet Analysis:")
for sentence in test_sentences[:3]:
    result = sentiwordnet_sentiment(sentence)
    sentiment, score, pos, neg, count = result
    print(f"\nText: {sentence}")
    print(f"Sentiment: {sentiment} (score: {score:.3f})")
    print(f"Avg positive: {pos:.3f}, Avg negative: {neg:.3f}, Words analyzed: {count}")

# 4. Custom sentiment classifier using movie reviews
print(f"\n=== CUSTOM SENTIMENT CLASSIFIER ===")

def build_sentiment_classifier():
    """Build a custom sentiment classifier using movie reviews"""
    
    # Prepare feature sets (reusing from earlier classification example)
    documents = [(list(movie_reviews.words(fileid)), category)
                 for category in movie_reviews.categories()
                 for fileid in movie_reviews.fileids(category)]
    
    random.shuffle(documents)
    
    # Feature extraction with sentiment-specific features
    def sentiment_features(words):
        features = {}
        
        # Basic word features
        all_words = set(words)
        for word in ['good', 'great', 'excellent', 'wonderful', 'amazing',
                    'bad', 'terrible', 'awful', 'horrible', 'disappointing']:
            features[f'contains({word})'] = (word in all_words)
        
        # Intensity features
        features['has_exclamation'] = ('!' in words)
        features['has_caps'] = any(word.isupper() for word in words if len(word) > 2)
        features['word_count'] = len(words)
        
        # Negation features
        negation_words = ['not', 'no', 'never', 'nothing', 'nowhere', 
                         'neither', 'nobody', 'none']
        features['has_negation'] = any(neg in words for neg in negation_words)
        
        return features
    
    # Create feature sets
    featuresets = [(sentiment_features(words), category) for words, category in documents]
    
    # Split and train
    train_size = int(len(featuresets) * 0.8)
    train_set = featuresets[:train_size]
    test_set = featuresets[train_size:]
    
    classifier = NaiveBayesClassifier.train(train_set)
    accuracy_score = accuracy(classifier, test_set)
    
    return classifier, accuracy_score

# Build and test custom classifier
custom_classifier, acc = build_sentiment_classifier()
print(f"Custom sentiment classifier accuracy: {acc:.3f}")

# Test on new examples
test_texts = [
    "This movie was absolutely fantastic! I loved every minute of it.",
    "The plot was confusing and the acting was terrible.",
    "It was an okay film, not great but not bad either."
]

print("\nCustom classifier predictions:")
for text in test_texts:
    tokens = word_tokenize(text.lower())
    features = {
        'contains(good)': 'good' in tokens,
        'contains(great)': 'great' in tokens,
        'contains(excellent)': 'excellent' in tokens,
        'contains(wonderful)': 'wonderful' in tokens,
        'contains(amazing)': 'amazing' in tokens,
        'contains(bad)': 'bad' in tokens,
        'contains(terrible)': 'terrible' in tokens,
        'contains(awful)': 'awful' in tokens,
        'contains(horrible)': 'horrible' in tokens,
        'contains(disappointing)': 'disappointing' in tokens,
        'has_exclamation': '!' in text,
        'has_caps': any(word.isupper() for word in text.split() if len(word) > 2),
        'word_count': len(tokens),
        'has_negation': any(neg in tokens for neg in ['not', 'no', 'never'])
    }
    
    prediction = custom_classifier.classify(features)
    print(f"'{text}' -> {prediction}")

# 5. Aspect-based sentiment analysis
print(f"\n=== ASPECT-BASED SENTIMENT ANALYSIS ===")

def aspect_sentiment_analysis(text, aspects):
    """Analyze sentiment for specific aspects in text"""
    
    tokens = word_tokenize(text.lower())
    sentences = sent_tokenize(text)
    
    aspect_sentiments = {}
    
    for aspect in aspects:
        aspect_sentences = []
        
        # Find sentences containing the aspect
        for sentence in sentences:
            if aspect.lower() in sentence.lower():
                aspect_sentences.append(sentence)
        
        if aspect_sentences:
            # Analyze sentiment for aspect-specific sentences
            aspect_scores = []
            for sentence in aspect_sentences:
                vader_scores = analyzer.polarity_scores(sentence)
                aspect_scores.append(vader_scores['compound'])
            
            avg_sentiment = sum(aspect_scores) / len(aspect_scores)
            
            if avg_sentiment >= 0.05:
                sentiment = 'positive'
            elif avg_sentiment <= -0.05:
                sentiment = 'negative'
            else:
                sentiment = 'neutral'
            
            aspect_sentiments[aspect] = {
                'sentiment': sentiment,
                'score': avg_sentiment,
                'mentions': len(aspect_sentences)
            }
        else:
            aspect_sentiments[aspect] = {
                'sentiment': 'not_mentioned',
                'score': 0.0,
                'mentions': 0
            }
    
    return aspect_sentiments

# Test aspect-based analysis
review_text = """
The food at this restaurant was absolutely delicious! The service was a bit slow, 
but the staff was very friendly. The atmosphere was cozy and romantic, perfect for 
a date. However, the prices were quite expensive for what you get.
"""

aspects = ['food', 'service', 'atmosphere', 'price', 'staff']
aspect_results = aspect_sentiment_analysis(review_text, aspects)

print("Aspect-based Sentiment Analysis:")
print(f"Review: {review_text}")
print("\nAspect Analysis:")
for aspect, result in aspect_results.items():
    if result['sentiment'] != 'not_mentioned':
        print(f"{aspect.capitalize()}: {result['sentiment']} (score: {result['score']:.3f}, mentions: {result['mentions']})")
    else:
        print(f"{aspect.capitalize()}: not mentioned")

# 6. Sentiment trends analysis
def analyze_sentiment_trends(texts):
    """Analyze sentiment trends across multiple texts"""
    
    sentiments = []
    scores = []
    
    for text in texts:
        vader_result = analyzer.polarity_scores(text)
        compound_score = vader_result['compound']
        scores.append(compound_score)
        
        if compound_score >= 0.05:
            sentiments.append('positive')
        elif compound_score <= -0.05:
            sentiments.append('negative')
        else:
            sentiments.append('neutral')
    
    # Calculate trends
    sentiment_counts = Counter(sentiments)
    avg_score = sum(scores) / len(scores) if scores else 0
    
    # Trend analysis (simplified)
    if len(scores) > 1:
        trend = 'improving' if scores[-1] > scores[0] else 'declining' if scores[-1] < scores[0] else 'stable'
    else:
        trend = 'insufficient_data'
    
    return {
        'sentiment_distribution': dict(sentiment_counts),
        'average_score': avg_score,
        'trend': trend,
        'individual_scores': scores
    }

# Example trend analysis
time_series_reviews = [
    "This product was okay when I first got it.",
    "After using it for a week, I'm starting to like it more.",
    "Now I really enjoy using this product daily.",
    "It's become an essential part of my routine. Highly recommend!"
]

trend_analysis = analyze_sentiment_trends(time_series_reviews)
print(f"\n=== SENTIMENT TREND ANALYSIS ===")
print("Sample reviews over time:")
for i, review in enumerate(time_series_reviews, 1):
    print(f"{i}. {review}")

print(f"\nTrend Analysis:")
print(f"Distribution: {trend_analysis['sentiment_distribution']}")
print(f"Average sentiment: {trend_analysis['average_score']:.3f}")
print(f"Overall trend: {trend_analysis['trend']}")
print(f"Score progression: {[f'{score:.3f}' for score in trend_analysis['individual_scores']]}")

Expected Output:

=== SENTIMENT ANALYSIS WITH NLTK === VADER Sentiment Analysis Results: Text: I absolutely love this product! It's amazing! Sentiment: POSITIVE Scores: {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.8439} Text: This is okay, nothing special. Sentiment: NEUTRAL Scores: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0} Text: I hate this terrible service. Worst experience ever! Sentiment: NEGATIVE Scores: {'neg': 0.735, 'neu': 0.265, 'pos': 0.0, 'compound': -0.8977} === LEXICON-BASED SENTIMENT ANALYSIS === Opinion Lexicon Sentiment Analysis: Text: I absolutely love this product! It's amazing! Sentiment: positive (score: 1.000) Positive words: 2, Negative words: 0 Text: This is okay, nothing special. Sentiment: neutral (score: 0.000) Positive words: 0, Negative words: 0 Text: I hate this terrible service. Worst experience ever! Sentiment: negative (score: -1.000) Positive words: 0, Negative words: 2 === SENTIWORDNET ANALYSIS === SentiWordNet Analysis: Text: I absolutely love this product! It's amazing! Sentiment: positive (score: 0.425) Avg positive: 0.456, Avg negative: 0.031, Words analyzed: 4 Text: This is okay, nothing special. Sentiment: neutral (score: 0.067) Avg positive: 0.089, Avg negative: 0.022, Words analyzed: 3 === CUSTOM SENTIMENT CLASSIFIER === Custom sentiment classifier accuracy: 0.847 Custom classifier predictions: 'This movie was absolutely fantastic! I loved every minute of it.' -> pos 'The plot was confusing and the acting was terrible.' -> neg 'It was an okay film, not great but not bad either.' -> neg === ASPECT-BASED SENTIMENT ANALYSIS === Aspect-based Sentiment Analysis: Review: The food at this restaurant was absolutely delicious! The service was a bit slow, but the staff was very friendly. The atmosphere was cozy and romantic, perfect for a date. However, the prices were quite expensive for what you get. Aspect Analysis: Food: positive (score: 0.659, mentions: 1) Service: negative (score: -0.128, mentions: 1) Atmosphere: positive (score: 0.571, mentions: 1) Price: negative (score: -0.296, mentions: 1) Staff: positive (score: 0.694, mentions: 1) === SENTIMENT TREND ANALYSIS === Sample reviews over time: 1. This product was okay when I first got it. 2. After using it for a week, I'm starting to like it more. 3. Now I really enjoy using this product daily. 4. It's become an essential part of my routine. Highly recommend! Trend Analysis: Distribution: {'neutral': 1, 'positive': 3} Average sentiment: 0.421 Overall trend: improving Score progression: ['0.000', '0.431', '0.659', '0.693']

Sentiment Analysis Best Practices

Combine methods: Use VADER for social media, lexicon-based for formal text
Handle negation: "not good" should be negative, not positive
Consider context: Domain-specific sentiment can differ from general sentiment
Evaluate thoroughly: Test on domain-specific data for accurate results

Conclusion and Best Practices

NLTK remains a cornerstone of natural language processing education and research, offering unparalleled access to linguistic resources and transparent algorithm implementations. While modern alternatives may offer performance advantages, NLTK's educational value and comprehensive toolkit make it indispensable for understanding NLP fundamentals.

Essential NLTK Mastery Principles

Leverage linguistic resources: NLTK's corpus collection is unmatched for research and analysis
Understand algorithms: Use NLTK's transparent implementations to learn NLP concepts
Combine approaches: Blend rule-based and statistical methods for robust solutions
Focus on preprocessing: Quality tokenization and normalization are crucial
Validate with corpora: Use NLTK's datasets to benchmark and evaluate methods

The techniques covered in this guide represent practical applications of NLTK's extensive capabilities. From basic text processing to advanced sentiment analysis, these patterns demonstrate how to leverage NLTK's strengths effectively while understanding its limitations.

Production Considerations

Performance trade-offs: NLTK prioritizes completeness over speed
Memory usage: Some corpora and models can be memory-intensive
Preprocessing overhead: Download required resources once and cache results
Integration strategy: Use NLTK for research, transition to faster tools for production

Final Recommendation: Master NLTK for understanding NLP concepts and prototyping solutions. Its educational value and comprehensive linguistic resources make it an excellent foundation for any natural language processing journey, even if you eventually migrate to faster production tools.