更新时间:2021-08-05 17:13:04
封面
版权页
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files eBooks discount offers and more
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Chapter 1. Simple Classifiers
Introduction
Deserializing and running a classifier
Getting confidence estimates from a classifier
Getting data from the Twitter API
Applying a classifier to a .csv file
Evaluation of classifiers – the confusion matrix
Training your own language model classifier
How to train and evaluate with cross validation
Viewing error categories – false positives
Understanding precision and recall
How to serialize a LingPipe object – classifier example
Eliminate near duplicates with the Jaccard distance
How to classify sentiment – simple version
Chapter 2. Finding and Working with Words
Introduction to tokenizer factories – finding words in a character stream
Combining tokenizers – lowercase tokenizer
Combining tokenizers – stop word tokenizers
Using Lucene/Solr tokenizers
Using Lucene/Solr tokenizers with LingPipe
Evaluating tokenizers with unit tests
Modifying tokenizer factories
Finding words for languages without white spaces
Chapter 3. Advanced Classifiers
A simple classifier
Language model classifier with tokens
Naïve Bayes
Feature extractors
Logistic regression
Multithreaded cross validation
Tuning parameters in logistic regression
Customizing feature extraction
Combining feature extractors
Classifier-building life cycle
Linguistic tuning
Thresholding classifiers
Train a little learn a little – active learning
Annotation
Chapter 4. Tagging Words and Tokens
Interesting phrase detection
Foreground- or background-driven interesting phrase detection
Hidden Markov Models (HMM) – part-of-speech
N-best word tagging
Confidence-based tagging
Training word tagging
Word-tagging evaluation
Conditional random fields (CRF) for word/token tagging
Modifying CRFs
Chapter 5. Finding Spans in Text – Chunking
Sentence detection
Evaluation of sentence detection
Tuning sentence detection
Marking embedded chunks in a string – sentence chunk example
Paragraph detection
Simple noun phrases and verb phrases
Regular expression-based chunking for NER
Dictionary-based chunking for NER
Translating between word tagging and chunks – BIO codec
HMM-based NER
Mixing the NER sources
CRFs for chunking
NER using CRFs with better features
Chapter 6. String Comparison and Clustering
Distance and proximity – simple edit distance
Weighted edit distance
The Jaccard distance
The Tf-Idf distance
Using edit distance and language models for spelling correction
The case restoring corrector
Automatic phrase completion
Single-link and complete-link clustering using edit distance
Latent Dirichlet allocation (LDA) for multitopic clustering
Chapter 7. Finding Coreference Between Concepts/People
Named entity coreference with a document
Adding pronouns to coreference
Cross-document coreference
The John Smith problem
Index