更新时间:2021-08-05 18:10:46
coverpage
Hadoop MapReduce Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files eBooks discount offers and more
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Chapter 1. Getting Hadoop Up and Running in a Cluster
Introduction
Setting up Hadoop on your machine
Writing a WordCount MapReduce sample bundling it and running it using standalone Hadoop
Adding the combiner step to the WordCount MapReduce program
Setting up HDFS
Using HDFS monitoring UI
HDFS basic command-line file operations
Setting Hadoop in a distributed cluster environment
Running the WordCount program in a distributed cluster environment
Using MapReduce monitoring UI
Chapter 2. Advanced HDFS
Benchmarking HDFS
Adding a new DataNode
Decommissioning DataNodes
Using multiple disks/volumes and limiting HDFS disk usage
Setting HDFS block size
Setting the file replication factor
Using HDFS Java API
Using HDFS C API (libhdfs)
Mounting HDFS (Fuse-DFS)
Merging files in HDFS
Chapter 3. Advanced Hadoop MapReduce Administration
Tuning Hadoop configurations for cluster deployments
Running benchmarks to verify the Hadoop installation
Reusing Java VMs to improve the performance
Fault tolerance and speculative execution
Debug scripts – analyzing task failures
Setting failure percentages and skipping bad records
Shared-user Hadoop clusters – using fair and other schedulers
Hadoop security – integrating with Kerberos
Using the Hadoop Tool interface
Chapter 4. Developing Complex Hadoop MapReduce Applications
Choosing appropriate Hadoop data types
Implementing a custom Hadoop Writable data type
Implementing a custom Hadoop key type
Emitting data of different value types from a mapper
Choosing a suitable Hadoop InputFormat for your input data format
Adding support for new input data formats – implementing a custom InputFormat
Formatting the results of MapReduce computations – using Hadoop OutputFormats
Hadoop intermediate (map to reduce) data partitioning
Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
Using Hadoop with legacy applications – Hadoop Streaming
Adding dependencies between MapReduce jobs
Hadoop counters for reporting custom metrics
Chapter 5. Hadoop Ecosystem
Installing HBase
Data random access using Java client APIs
Running MapReduce jobs on HBase (table input/output)
Installing Pig
Running your first Pig command
Set operations (join union) and sorting with Pig
Installing Hive
Running a SQL-style query with Hive
Performing a join with Hive
Installing Mahout
Running K-means with Mahout
Visualizing K-means results
Chapter 6. Analytics
Simple analytics using MapReduce
Performing Group-By using MapReduce
Calculating frequency distributions and sorting using MapReduce
Plotting the Hadoop results using GNU Plot
Calculating histograms using MapReduce
Calculating scatter plots using MapReduce
Parsing a complex dataset with Hadoop
Joining two datasets using MapReduce
Chapter 7. Searching and Indexing
Generating an inverted index using Hadoop MapReduce
Intra-domain web crawling using Apache Nutch
Indexing and searching web documents using Apache Solr
Configuring Apache HBase as the backend data store for Apache Nutch
Deploying Apache HBase on a Hadoop cluster
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
ElasticSearch for indexing and searching
Generating the in-links graph for crawled web pages
Chapter 8. Classifications Recommendations and Finding Relationships
Content-based recommendations
Hierarchical clustering