Susan Harkins (406) [Avatar] Offline
Please post errors in the published version of Spark in Action here. We'll publish a comprehensive list for everyone's convenience.

Thank you!
Susan Harkins
Errata Editor
Eugene Teo (6) [Avatar] Offline
Page xvii, "to prove the Mesos execution platform feasible" => "to prove that the Mesos execution platform is feasible"
Page 19, the third edition of Programming in Scala is available since April 2016.
Page 22, "see Appendix B for details" => "see Appendix A for details"
Page 22, "sudo rm -f /usr/local/spark" => "sudo rm /usr/local/spark"
dwoodbury (10) [Avatar] Offline
Chapter 5

val spark = SparkSession.builder().getOrElse(). returns "not a member" error.

Should this be getOrCreate() instead?

Mostafa (10) [Avatar] Offline
In Python Code - Section 4.1.2 - Line 11, please add underneath it the following

To show top customer with purchases.

William DeMeo (2) [Avatar] Offline
Update: I figured out how to correct this. See Answer below.

Page 31:

Back in the spark-shell, load the log file:
scala> val lines = sc.textFile("/home/spark/client-ids.log")

Problem: This doesn't work because the client-ids.log file was created (as instructed by the authors) in a new terminal window, so it resides in /home/myusername, not /home/spark. I don't know how to resolve this issue. I have tried a number of variations, for example,

scala> val lines = sc.textFile("/home/myusername/client-ids.log")

I would try moving client-ids.log to the /home/spark directory, but that doesn't appear in the file system (since it resides on a virtual machine).

Question: How do we move a file to create a file in a directory on the virtual machine?

Answer: First exit the spark shell with :q, and then, at the `spark@spark-in-action:~$` prompt, do

    cat > client-ids.log

328747 (7) [Avatar] Offline
Page 205 of pdf and print boo:

"model would be move focused on the important features"

should be

"model would be more focused on the important features"
328747 (7) [Avatar] Offline
Page 206 (section 7.6.2)

It seems that


should instead be


(error I get is: "value count is not a member of org.apache.spark.mllib.linalg.Vector")
328747 (7) [Avatar] Offline
Page 208 (7.6.2)

val scalerHP = new StandardScaler(true, true) => x.features))

should instead be

val scalerHP = new StandardScaler(true, true).fit( => x.features))
328747 (7) [Avatar] Offline
Page 230 (pdf), section 8.2.3, I think


should instead be

328747 (7) [Avatar] Offline
Section 7.6.5 (page 212), in the equation you have ||x|| and in the test on the first line below that you have ||w||. I think that in the equation


should instead be


since you want to penalize based on the size of the coefficients, not the observations.
Oliver QG (4) [Avatar] Offline
Errata in Spark in Action: appendix C A primer on linear algebra
The addition of Matrices is wrong.

The result should be :

a11 + b11 a12+b12
a21 + b21 a22+b22

Choonoh Lee (14) [Avatar] Offline

  • In page 63, there is a note titled "Pasting blocks of code Into the Spark Scala shell" But this one was an another note's title in page 33. I guess the title in page 63 should be something like "Submitting a Spark Python Application"

  • In page 122, the short URI ( refers to an invalid web page. I guess the authors missed the last character '$' at the end of the original scaladoc URL.

  • In page 125, "with built-in DSL and SQL functions and UTFs" should be "with built-in DSL and SQL functions and UDFs"

  • The title of Section 5.1.6 is "Grouping and joining data", whereas the Section 5.1.7's title is "Performing joins". There is no joining in Section 5.1.6.
  • Oliver QG (4) [Avatar] Offline
    Errata in Spark in Action: appendix C A primer on linear algebra
    Oliver QG wrote:The addition of Matrices is wrong.

    The result should be :

    a11 + b11 a12+b12
    a21 + b21 a22+b22

    Choonoh Lee (14) [Avatar] Offline
    In Figure 5.5, an white rectangle says "Analyzed local plan", but I guess it should be "Analyzed logical plan" as the text said.
    Choonoh Lee (14) [Avatar] Offline
    In the book's index (in Page xi),
    Section 5.2 is "Beyond DataFrames: introducing DataSets" and
    Section 5.7 is "Beyond DataFrames: introducing DataSets".

    The section is actually 5.2, but the introduction of Chapter 5 says
    "In section 5.2, we show you how to create DataFrames by running SQL queries ..." and
    "In the last section of this chapter we give a brief overview of DataSets"

    There are many incorrect mentions about these sections, so I don't know how to correct them.
    (Actually, I'm having some troubles cuz I'm officially translating this book to Korean.)
    Choonoh Lee (14) [Avatar] Offline
    The caption of Figure 5.2 is truncated.
    Choonoh Lee (14) [Avatar] Offline
    In page 162, "For this task, you’ll use the reduceByKeyAndWindow method."
    but in the code below, window and reduceByKey methods are actually used instead of reduceByKeyAndWindow.
    and in page 163, there is a mention "The window method isn’t the only window operation available."
    and in page 164, "instead of using the reduceByKeyAndWindow method in the previous example, you could have also used the window method and then the reduceByKey method."
    I'm not sure which side should be fixed.
    Choonoh Lee (14) [Avatar] Offline
    In the second code snippet in Page 169, (although it's not the final code)
    the topic name "metric" should be "metrics" as it was created in Page 166.
    Choonoh Lee (14) [Avatar] Offline
    In page 169 and 170, the text says 'companion object' was described in Chapter 4, but the note about it is in Chapter 2.
    In page 169, the code for KafkaProducerWrapper has two commentary arrows: "Method for sending messages" and "Producer object". Looks like they're pointing wrong code lines.
    Choonoh Lee (14) [Avatar] Offline
    In page 196, "where n, in this example, is equal to 12." should be "where n, in this example, is equal to 13." since 12 features are added to a single feature.
    In page 198, the left hand side of the first partial derivate fomular, wi should be wj.
    In page 198, "(the second point along the black line in figure 7.6)" should be "(the second point along the white line in figure 7.6)"
    In page 214, the right hand side of the mini-batch weight updating formula, w should be wj and y should be gamma.
    Choonoh Lee (14) [Avatar] Offline
    In page 219, "two algorithms that can be used for both classification and clustering." should be "two algorithms that can be used for both classification and regression."
    In Figure 8.2, w1 = 2 should be w1 = -2 in the right graph.
    In page 231, "you can see how well it performs on the training dataset." should be "you can see how well it performs on the validation dataset."
    In page 239, "we took the housing dataset used in the previous chapter" should be "we took the adult dataset used in this chapter"
    In Figure 8.8, "< 48" should be "<= 48"
    In page 241, the book mentioned "the dark gray and light gray arrows", but I can't see color differences in Figure 8.8.
    In page 241, "j = i" in the Entropy formula should be "j = 1"
    Choonoh Lee (14) [Avatar] Offline
    The two tables in Figure 8-7 don't contain the same content.
    There are five true examples in the left table, but four in the right table.
    Besides, the 5th row in the right table has age = 58, but classified into the "<= 48" group.
    Because of these inconsistencies, the Entropy calculation in Page 241 and the Information Gain calculation in Page 242 are invalid.
    (I'm not sure which table is valid, so I couldn't fix these equations neither)
    303763 (9) [Avatar] Offline
    The last line in Section 14.3.1 reads:

    See section 14.4.1 on how to load files into H2O directly, or load them by parsing them into SQL DataFrames, as you did in this section.

    But maybe it should read Spark DataFrames?
    303763 (9) [Avatar] Offline
    Section 12.1.8: spelling error.

    The directory for storing log files is determined by the parameter yarn.nodeamanager.log-dirs (set in yarn-site.xml), which defaults to <Hadoop installation directory>/logs/userlogs.

    Should be yarn.nodemanager.log-dirs (there is an extra "a" in the text).
    Choonoh Lee (14) [Avatar] Offline
    Thankfully, the authors replaced the whole example of the figure 8.7 and figure 8.8.
    Now the example makes sense and has no error.

    However, in the page 242, the sentences about the information gain of the 'education' column are now entirely incorrect.
    Those sentences should be corrected or removed.
    Choonoh Lee (14) [Avatar] Offline
    In the last (sixth) errata correction list, example equations of Entropy and Information Gain are corrected.
    However, the corrected versions of these equations seem like correspond to the original tables in Figure 8.7, which had some inconsistent data between left and right tables.
    In the last errata correction list, this figure was also modified and there are five positive examples, not four.

    Is the corrected version of Figure 8.7 supposed to be discarded?
    Calvin (13) [Avatar] Offline
    Chapter 4, section 4.3.3

    The mergeComb function is incorrect, it reads:
    def mergeComb:((Double,Double,Int,Double),(Double,Double,Int,Double)) =>  (Double,Double,Int,Double) = { 
       case((mn1,mx1,c1,tot1),(mn2,mx2,c2,tot2)) => (scala.math.min(mn1,mn1),scala.math.max(mx1,mx2),c1+c2,tot1+tot2) 

    is accidentally repeated twice when calculating the minimum, when instead it should be
    scala.math.min(mn1, mn2)

    Here's a cleaned up version with the problem fixed. Note that def need not be used and val can be used instead because this is a function literal. The same goes for all the other functions defined along with this one.
      val mergeComb: ((Double, Double, Int, Double), (Double, Double, Int, Double)) => (Double, Double, Int, Double) = {
        case ((min1, max1, count1, total1), (min2, max2, count2, total2)) =>
          (scala.math.min(min1, min2), scala.math.max(max1, max2), count1 + count2, total1 + total2)
    matplotlib (3) [Avatar] Offline
    4.1.2. Basic pair RDD functions > Counting values per key (liveBook)
    "map and sum are Scala’s standard methods and aren’t part of Spark’s API."

    `map` should be `values` ?
    matplotlib (3) [Avatar] Offline
    In 5.1.3. Using SQL functions to perform calculations on data

    >In the previous sections, we covered only some of the SQL functions Spark supports. We encourage you to explore all the available functions at

    This link is expired. See also
    Susan Harkins (406) [Avatar] Offline
    An updated errata list for Spark in Action is available at Thank you for participating in the collection process. Your contributions are a great help to us and other readers.

    Susan Harkins
    Errata Editor