Summary_Apache Spark Machine Learning Blueprints-QQ阅读女生幻言网

上QQ阅读APP看书，第一时间看更新

Summary

Machine learning professionals and data scientists often spend 80% or more of their time on data preparation, which makes data preparation the most important task to perform even though it could be the most boiling task.

In this chapter, after discussing locating datasets and loading them into Apache Spark, we covered the methods of completing the six critical data preparation tasks, which include:

Treating dirty data with a focus on missing cases
Resolving entity problems to match datasets
Reorganizing datasets, with creating subsets and aggregating data as examples
Joining tables together
Developing features
Organizing data preparation workflows and automating them

In covering these, we studied the Spark SQL and R as two primary tools in combination with some special Spark packages, such as SampleClean, and some R packages, such as reshape. We also explored ways of making data preparation easy and fast.

After this chapter, we should master all the necessary data preparation methods plus a few advanced methods and become capable of cleaning datasets, such as the four used as examples in this chapter. From now on, we should be able to complete data preparation tasks fast with a workflow approach and be ready for practical machine learning tasks.

本周热推：

计算机网络 Visual Basic.NET+SQL Server全程指南 AI 3.0 AI的25种可能 ABB工业机器人编程全集