The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

358699 (4) [Avatar] Offline
#1
Hello,

I read your book, and I download the code at https://github.com/Big-Data-Manning.

I built the project with maven and I imported it in eclipse.

My problem is : how can I run this project ?
Is there a class that I need to put in run configuration ? Should I install some hadoop components ?

Thanks in advance,
Spierckel Florian
CElliott (12) [Avatar] Offline
#2
I cannot answer your question authoritatively; in fact, I probably should not be trying. However, my plan is to create a main class in the same directory as E:\Development\Big-Data\src\java\manning\batchlayer\BatchWorkflow.java, and then just run through the code snippets in Chapter 9, one by one by instantiating the class BatchWorkflow and then calling the methods in it. I would call initTestData() first, then create some more test data for the "ingest" example code.

The reason I am writing is, while I did get the project to compile with maven, I don't know how to import the project into Eclipse. Can you tell me how you did that?

Thanks in advance for any help you may care to provide.

Charles Elliott
358699 (4) [Avatar] Offline
#3
You have the source code here : https://github.com/Big-Data-Manning/big-data-code
In eclipse you have 2 choices :
Go in File > Import > Maven and :

- Check out maven project : You will have to enter the github adress
OR
- Existing Maven Projects if you have done a git clone of the project before.

Hope it helps !
CElliott (12) [Avatar] Offline
#4
Thank you for your reply. I did download the code as you suggested, and I chose, eventually, to import it into Eclipse using the Maven option. The Eclipse Maven plug-in is installed. O/S is Win 8.1, Prof., 64-bit. I wrote a driver that instantiates BatchWorkflow and calls BatchWorkflow.initTestData().

I have spent about 4 days on this. There are two problems: First, there are multiple references to log4j in pom.xml, so when the code needs to call log4j, it cannot decide which instance to load. The problem was "solved" by misspelling one of the references to log4j in the pom, but then it loads the NOP logger, so there is no logging at all. The second problem is a null pointer exception at org.apache.hadoop.fs.Path.<init>(Path.java:61). I downloaded two copies of Hadoop (2.6.0 & 2.7.0), and neither has a line 61 in org.apache.hadoop.fs.Path labeling code. Line 61 is, however, in a section of code that is trying to interpret "X:\\" as a part of a file reference. The strange part of this is that part of the BatchWorkflow.initTestData() code is working because it does create the directory tree:

X:\swaroot
X:\swaroot\data
X:\swaroot\outputs
X:\swaroot\outputs\edb

However, BatchWorkflow.initTestData() apparently fails when it goes to write the data.

Whoever wrote the code in BatchWorkflow.initTestData() must never have tried it, even in a Linux environment, because the logging problem must have always been there.

A working copy of big-data-code-master_Final.zip would be greatly appreciated.

Charles Elliott
nathan.marz (88) [Avatar] Offline
#5
The way I run it is running BatchWorkflow.initTestData() followed by BatchWorkflow.batchWorkflow(). It works fine on my Mac and I'm not sure why it's not working on your system. Note that this demo code does rely on a fairly old version of Hadoop.
CElliott (12) [Avatar] Offline
#6
I appreciate your book; I have finished it now and have at least some idea how big data works. However, I do wish you had read my note with the same attention to detail that I paid your book.

First, as I mentioned in my note, the problem that prevents the book's sample code from running is in the Windows-specific code. The code can create directories under Windows, but throws an exception when it tries to write to them. Like you write, the sample code uses a very old version of Hadoop, and it has been only in very recent versions of Hadoop that it works under Windows.

Second, the sample code does not work on Windows; I don't think it ever worked on Windows, or was even tried on Windows. So, like I said, I would have appreciated a copy that worked in my environment.
352774 (5) [Avatar] Offline
#7
The book explains the concept of Lambda Architecture very well... The concepts introduced are explained using real world tools and code samples in "illustration" chapters -- with justification for why those tools are chosen for illustrating. Finally, entire implementation of "SuperWebAnalytics.com" source code is available on github.

With all this, still, readers of this book are unable to see the working implementation of the concepts - mainly because of lack of clear instructions on how to use the code.

That's really frustrating.... While the final edition of the book is released just few months back, the code itself is not kept up-to-date with the latest versions of Hadoop.

Tools/Frameworks used (such as Pail) have not seen any updates for last one year or more!!!!

The feeling of having bought a good book and reading it finally ends up with frustration of being unable to see the working implementation of the concepts smilie
352774 (5) [Avatar] Offline
#8
I've tried to run the batchlayer code on Hadoop 2.7.0 (with both single node pseudo cluster and 1 master + 2 slaves configurations) Am getting following exception. Am new to Hadoop eco & Big Data...Just have theoretical understanding from Big Data book. Any help fixing this issue would be appreciated....

ubuntu@namenode:~/bigdata-book/big-data-code/target$ hadoop jar big-data-book-1.0.0-SNAPSHOT.jar manning.batchlayer.BatchWorkflow
15/07/16 13:08:20 INFO client.RMProxy: Connecting to ResourceManager at namenode/172.31.22.27:8050
15/07/16 13:08:20 INFO client.RMProxy: Connecting to ResourceManager at namenode/172.31.22.27:8050
15/07/16 13:08:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/07/16 13:08:24 INFO mapreduce.JobSubmitter: number of splits:1
15/07/16 13:08:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1437051695857_0003
15/07/16 13:08:24 INFO impl.YarnClientImpl: Submitted application application_1437051695857_0003
15/07/16 13:08:24 INFO mapreduce.Job: The url to track the job: http://namenode:8088/proxy/application_1437051695857_0003/
15/07/16 13:08:44 INFO util.HadoopUtil: resolving application jar from found main method on: manning.batchlayer.BatchWorkflow
15/07/16 13:08:44 INFO planner.HadoopPlanner: using application jar: /home/ubuntu/bigdata-book/big-data-code/target/big-data-book-1.0.0-SNAPSHOT.jar
15/07/16 13:08:44 INFO property.AppProps: using app.id: 9246BB150AD022459F7255BE576C2787
15/07/16 13:08:44 INFO Configuration.deprecation: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
15/07/16 13:08:44 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
15/07/16 13:08:44 INFO util.Version: Concurrent, Inc - Cascading 2.0.0
15/07/16 13:08:44 INFO flow.Flow: [] starting
15/07/16 13:08:44 INFO flow.Flow: []  source: PailTap["PailScheme[['pail_root', 'bytes']->[ALL]]"]["/tmp/swa/newDataSnapshot"]"]
15/07/16 13:08:44 INFO flow.Flow: []  sink: PailTap["PailScheme[['pail_root', 'bytes']->['?data']]"]["/tmp/swa/shredded"]"]
15/07/16 13:08:44 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
15/07/16 13:08:44 INFO flow.Flow: []  parallel execution is enabled: true
15/07/16 13:08:44 INFO flow.Flow: []  starting jobs: 1
15/07/16 13:08:44 INFO flow.Flow: []  allocating threads: 1
15/07/16 13:08:44 INFO flow.FlowStep: [] starting step: (1/1) /tmp/swa/shredded
15/07/16 13:08:44 INFO client.RMProxy: Connecting to ResourceManager at namenode/172.31.22.27:8050
15/07/16 13:08:44 INFO client.RMProxy: Connecting to ResourceManager at namenode/172.31.22.27:8050
15/07/16 13:08:46 INFO mapreduce.JobSubmitter: number of splits:1
15/07/16 13:08:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1437051695857_0004
15/07/16 13:08:46 INFO impl.YarnClientImpl: Submitted application application_1437051695857_0004
15/07/16 13:08:46 INFO mapreduce.Job: The url to track the job: http://namenode:8088/proxy/application_1437051695857_0004/
15/07/16 13:08:46 INFO flow.FlowStep: [] submitted hadoop job: job_1437051695857_0004
15/07/16 13:09:33 WARN flow.FlowStep: [] task completion events identify failed tasks
15/07/16 13:09:33 WARN flow.FlowStep: [] task completion events count: 4
15/07/16 13:09:33 WARN flow.FlowStep: [] event = Task Id : attempt_1437051695857_0004_m_000000_0, Status : FAILED
15/07/16 13:09:33 WARN flow.FlowStep: [] event = Task Id : attempt_1437051695857_0004_m_000000_1, Status : FAILED
15/07/16 13:09:33 WARN flow.FlowStep: [] event = Task Id : attempt_1437051695857_0004_m_000000_2, Status : FAILED
15/07/16 13:09:33 WARN flow.FlowStep: [] event = Task Id : attempt_1437051695857_0004_m_000000_3, Status : TIPFAILED
15/07/16 13:09:33 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
15/07/16 13:09:33 INFO flow.Flow: [] stopping all jobs
15/07/16 13:09:33 INFO flow.FlowStep: [] stopping: (1/1) /tmp/swa/shredded
15/07/16 13:09:33 INFO impl.YarnClientImpl: Killed application application_1437051695857_0004
15/07/16 13:09:33 INFO flow.Flow: [] stopped all jobs
15/07/16 13:09:33 INFO util.Hadoop18TapUtil: deleting temp path /tmp/swa/shredded/_temporary
Exception in thread "main" cascading.flow.FlowException: step failed: (1/1) /tmp/swa/shredded, with job id: job_1437051695857_0004, please see cluster logs for failure messages
352774 (5) [Avatar] Offline
#9