The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

Gavin (41) [Avatar] Offline
I just don't get this paragraph at all;

Of course for many problems additional training data has a non-zero cost, which, for supervised learning, may be high. In this sense, collecting data from multiple sources allows not only to access to a huge set of data but also to improve the quality of the data solving problems such as data sparsity, misspelling, correctness and so on. Gathering data from a variety of sources is not an issue. We live in the “big data era” due to the abundance of digital data from many sources like the web, sensors, smartphones, corporate databases, and open data. But if the value comes from combining different data sets, so do the problems. Data from this plethora of sources comes in different formats. Before the learner can analyze it, the data must be cleaned up, merged and normalized into a unified and homogeneous schema that the algorithm can understand.