albert.ortizl@gmail.com (5) [Avatar] Offline
#1
Hi there,

Lately we spent a considerable amount of time trying to figure out the reason behind a misleading case (that we'll expose below) and that still today can't understand the given conditions that trigger the fail of the tuple , so we thought in asking the pros and experienced ones out there hoping that you can guide us though the dark path of storm behaviour so, here we go:

We've been performing several stress tests playing around with different configuration settings. We started off by setting one instance per bolt/spout (hint: 1, task: 1) and everything worked smooth, no failures at all. Then, we decided to put some strain on the rest of the topology by using the following configuration:

- RabbitMQ spout (hint: 2, tasks:4)
- Discretization bolt (hint: 1, tasks:1)
- Fatten-The-Pig bolt (hint: 1, tasks:1)
- Bouncer bolt (hint: 1, tasks:1)
- Weighting bolt (hint: 1, tasks:1)
- Sysout bolt (hint: 1, tasks:1)

also, in order to have more visibility on how contention and backpressure strategies are used and affects the whole performance, we also used the following configuration:

config.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE, 12smilie;
config.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE, 12smilie;
config.put(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS, 120);
config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 64);


Now, after a while, we start getting traces as the ones below:

2014-08-28 10:44:28 c.d.i.d.s.s.RabbitMQSpout [DEBUG] New reading has been retrieved and successfully parsed [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', msg: '279', task: '7']
2014-08-28 10:44:28 c.d.i.d.s.s.RabbitMQSpout [DEBUG] EMIT [msg: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2']
2014-08-28 10:44:45 b.s.d.executor [DEBUG] Failing message 279: {:stream "default", :values ["4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2" "genericftp-177091" "1" "2014-08-12T21:30:00.000 UTC" "127.0.0.1" "2014-08-27 15:50:42.375541" (#<Tuple (402,1448.3394702747873)>smilie]}
2014-08-28 10:44:56 c.d.i.d.s.b.DiscretizationBolt [INFO] Discretization requested [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', nnid: '1', ts: '2014-08-12T21:30:00.000 UTC', raw: '[(402,1448.3394702747873)]']
2014-08-28 10:44:56 c.d.i.d.s.b.DiscretizationBolt [DEBUG] EMIT [msg: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2']
2014-08-28 10:44:56 c.d.i.d.s.b.DiscretizationBolt [DEBUG] ACK [msg: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2']
2014-08-28 10:44:58 c.d.i.d.s.b.FattenThePigBolt [DEBUG] Tuple enrichment requested [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', gwmac: 'genericftp-177091', nid: '1']
2014-08-28 10:44:58 c.d.i.d.s.b.FattenThePigBolt [DEBUG] Mote matched [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', gwmac: 'genericftp-177091', nid: '1', mt_id: '74735', gw_id: '8879']
2014-08-28 10:44:58 c.d.i.d.s.b.FattenThePigBolt [INFO] Tuple successfully enriched [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', gwmac: 'genericftp-177091', nid: '1', mt_id: '74735', gw_id: '8879']
2014-08-28 10:44:58 c.d.i.d.s.b.BouncerBolt [DEBUG] Mote dealing with The Bouncer [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', mt_id: '74735', gw_id: '8879']
2014-08-28 10:44:58 c.d.i.d.s.b.BouncerBolt [DEBUG] Mote fullfils specifications. Now checking datapoints [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', mt_id: '74735', gw_id: '8879']
2014-08-28 10:44:58 c.d.i.d.s.b.BouncerBolt [INFO] Mote accepted [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', mt_id: '74735', gw_id: '8879']
2014-08-28 10:44:58 c.d.i.d.s.b.WeightingBolt [INFO] Received '1' readings to be weighted [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', up-to-ts: '2014-08-12T23:30:00.000+02:00', mt_id: '74735']
2014-08-28 10:44:58 c.d.i.d.s.b.WeightingBolt [DEBUG] Weighting datapoint '402' [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', up-to-ts: '2014-08-12T23:30:00.000+02:00', mt_id: '74735']
2014-08-28 10:44:58 c.d.i.d.s.b.SysOutBolt [INFO] SYSOUT :== source: weighting-bolt:12, stream: default, id: {-4612146741903420812=-8379812311928270863}, [4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2, genericftp-177091, 1, 2014-08-12T21:30:00.000 UTC, 127.0.0.1, 2014-08-27 15:50:42.375541, [DiscretizedReading(datapoint=402, discreteValue=3.3479927538794527, accumulatedValue=1448.3394702747873, frecuency=NONE, operation=ACUM, refTimestamp=2014-08-12T21:15:00.000 UTC, dirty=false)], Mote(gateway=Gateway(id=8879, mac=genericftp-177091, timezone=Europe/London, connected=true, registered=true, depId=1659), id=74735, status=PENDING, nid=1, name=device 1, datapoints=[Datapoint [id=127326, networkid=402, status=PENDING, mote=74735]]), {402=[WeightedReading(year=2014, month=8, day=12, hour=22, minute=15, value=1.1159975846264842, frequency=NONE, dst=true), WeightedReading(year=2014, month=8, day=12, hour=22, minute=20, value=1.1159975846264842, frequency=NONE, dst=true), WeightedReading(year=2014, month=8, day=12, hour=22, minute=25, value=1.1159975846264842, frequency=NONE, dst=true), WeightedReading(year=2014, month=8, day=12, hour=22, minute=30, value=0.0, frequency=NONE, dst=true)]}]
2014-08-28 10:44:58 c.d.i.d.s.b.WeightingBolt [INFO] Weighting emited [uuid: '4e0ff06d-77a1-4341-ab4e-79cc7a33b3d2', up-to-ts: '2014-08-12T23:30:00.000+02:00', mt_id: '74735']


See how a tuple is flagged a failed after it is emmited by the spout but still it goes on until the end of the stream (SysOutBolt). As far as we understand, this tuple did not fail, it reached the final bolt as expected but, despite that, the tuple is marked as 'FAILED'. We've detected that we increased latency on the first bolt (discretization) up to the point where its capacity is above one but we don't wan't to scale out until we figure out why we're getting this false failres on some tuples. Before you suggest the usual stuff you can find on the google groups, we've double checked that all ACKs and EMITS are correctly invoked so we've got everything anchored as expected smilie

By the way, we're using v0.9.2

Cheers and thanks in advance.
sean.allen (19) [Avatar] Offline
#2
Re: [URGENT] False tuple failure?
Its pretty hard to know what is going on without source code.
albert.ortizl@gmail.com (5) [Avatar] Offline
#3
Re: [URGENT] False tuple failure?
Hi,

Sorry, independently of source code our main problem is that we have random false negative fails triggered to our spouts (randomly from 1 sec to 20/30 secs ) but the tuples completes full execution trees (100%).

We know that these fails are triggered by framework (executor.clj), but if you take a look to our configuration you can see that it is not the general timeouts, anchors, acks or emits.

Tracing tuples we could see that when we have high bolt capacities, over 1 to 1.5, storm mark as fail ( before enter to execute method ) some tuples but these tuples are finally procesed!

Thanks to help us, if you want we can send our code!
albert.ortizl@gmail.com (5) [Avatar] Offline
#4
Re: [URGENT] False tuple failure?
[SOLVED - IMPORTANT]

Hi there!

We finnally figure out what was happenning!

Bolts have a previous buffer before these executer methods, this buffer is handled by storm with lmax disruptor in a receiver queue.

When you have a big latency in a bolt, messages these queues will be full.

It seems that if you have in a spouts with different parallelism hint (I mean, 2 executors 4 tasks), some tuples will me marked as fail (false negative) but the tuples will be fully processed.

[SOLUTION] : Set in spouts same parallelism hint: 2 executors -> 2 tasks, 4->4, 8->8

[RESULT] :
- 0% false negatives fails
- 100% positive fails (topology timeouts, exceptions or declarative fails (_collector.fail(...)))
sean.allen (19) [Avatar] Offline
#5
Re: [URGENT] False tuple failure?
That's an odd thing to cause the problem.

It doesn't feel like the root of a problem based on my understanding of Storm's internals. Changing the number of executors and tasks shouldn't change the number of fails.

Whether you have 2 or 4 tasks for 2 spouts, each spout will handle the same amount of work. That shouldn't change the amount being pushed to your bolt and its receive queue.