Big Data Forum Collection: [cascading-user] Cascading/Hadoop Limitation

Ran in to an interesting limitation today..

I have an AggregateBy that does usual summary Stats (Min/Max/Avg/etc) on a field.

In this case I have 5000 fields (A crazy matrix of data), and it looks like it breaks at around 3000 fields.

Can't seem to find the Step-State.. Not sure if the serialized State is too big, or what...

I'm sure I can write a single aggregate to handle all the fields at once, but just thought it was interesting...

does the mapper or reducer side of the AggregateBy fail?

Have you run this in Driven? It might show something useful on the drill down performance view of the Step that chokes, I would be interested in seeing the data if you can share (and run it on the online EAP, not the GA release, the links are shareable, just not discoverable).

fwiw, we are actively re-writing the “map" side caching (it could plan into a prior reducer) mechanism for 2.7, this would include a flat hash array vs a hash map. evidence suggests, under high entropy on really large datasets (3000+ fields aggravates this) there should be a lot less thrashing than doing hashmap operations. will add that this mechanism is pluggable for situations the current impl is a win or you have a better strategy.

The Map side fails before it can get the first tuple.

I think the problem is much simpler than that. The Flow can't really start because (I believe) the tasks can't find the the step-state.. So my guess is each field has an aggregateBy object serialized, and for whatever reason the serialized blob gets too big for something somewhere…

Cool 2.7 feature though. I've run in to that problem before.. Hive completely choked on a high entropy Sum().. I think there is a switch in there to 'give up' on the Map side aggregation if there is too much thrashing.. It didn't work though. Plain 'ol Reduce side aggregation was 10x faster.

The Map side fails before it can get the first tuple.
I think the problem is much simpler than that. The Flow can't really start because (I believe) the tasks can't find the the step-state.. So my guess is each field has an aggregateBy object serialized, and for whatever reason the serialized blob gets too big for something somewhere…

I’m curious as to where this is actually failing. No chance for a stack trace?

Big Data Forum Collection

2015년 1월 3일 토요일

[cascading-user] Cascading/Hadoop Limitation

댓글 없음:

댓글 쓰기