2015년 1월 3일 토요일

[cascading-user] Re: Plunger: A unit testing framework for Cascading

Sorry for the delay in getting back to you on this. I do in fact have a suggestion that stems from our experiences when performing unit testing on complex flows and cascades. For the most part we can effectively leverage existing Java tooling to build suites of tests. We can also practice good unit testing behaviours by composing our flows from modular assemblies and, of course, operations. These components are readily testable. However, in regard to assemblies and larger scale tests of flows and cascades we find that our cascading development short-circuits a fundamental piece of the automated testing practices: the measurement and reporting of code coverage with tools such as Cobertura.

As it stands is is simple to attain 100% test coverage of assemblies and flows because in the true Java sense we are simply exercising the construction logic and not the data processing logic that results from said construction. However, to truly measure test coverage in these instances what we really need to be able to do is check that every vertex of the process' corresponding DAG has been exercised. I imagine that this would be as simple as measuring whether or not a vertex (pipe) has transported one or more Tuples. As it stands this is a mental exercise: we can image the graph and consider appropriate test scenarios to attain full coverage. However, this is prone to error - especially if the DAG differs from that which we think we've constructed (human error).

As a solution to this, it'd be great if there were some generic hooks into our Flows, Cascades, and Assemblies onto which we could build some tooling. I imagine such tools would interrogate the DAG after a test execution and report the names or pipes that did not transport any Tuples.

I'd be keen to hear your thoughts on this.


On 3 November 2014 at 17:01, Chris K Wensel <chris@wensel.net>wrote:
This is great. Will add it to our .org site.

Anything we can do in Cascading core and test apis to help improve things? We are working on 2.7 (in tandem with 3.0), so any suggestions now would help people with a 2.x -> 3.x migration we hope.

ckw

On Nov 3, 2014, at 4:30 AM, Elliot West <teabot@gmail.com> wrote:
Greetings,
Hotels.com are pleased to announce the contribution of a project to the Cascading open source community. ‘Plunger’ is a unit testing framework for Cascading applications whose primary aim is to simplify the creation of automated tests for cascades, flows, assemblies and operations.
At Hotels.com Cascading is the basis for numerous large scale ETL processing jobs. For us Cascading has many virtues, however we were particularly attracted by its amenability to automated testing. We rely heavily on the suites of tests that we’ve developed for our applications and therefore are always keen to lower the effort required to implement them. With this in mind we developed Plunger to streamline the development of Cascading tests. Plunger reduces boiler plate code and provides a concise API for exercising all aspects of Cascading applications. Key features include:
  • A fluent API for declaring test data.
  • A harness for rapidly connecting, exercising, and verifying assemblies.
  • Sourcing and sinking test data from and to taps.
  • Assertions for common Cascading record types.
  • Stub builders for exercising operation implementations.
  • Component serialization verification.
The project can be found on GitHub and is available under the Apache 2.0 license: https://github.com/HotelsDotCom/plunger
We hope that you find Plunger useful and welcome any feedback or contributions that you may have.
Many thanks - Elliot.
Elliot WestSoftware Dev Engineer IIHotels.com 



I've tried moving plunger to cascading-3.0-wip-61 but have encountered a few issues with our code that writes to taps. We've found this feature very useful in practice, helping us get around subtle differences between local/hadoop tap implementations, for building tests that use our own Tap implementations, and for moving test data into easily maintainable Java code instead of being baked into non-human readable files. To be fair I was always aware that I was creating some brittle implementations as I was having to dig down into cascading internals to obtain the behaviour I wanted. The methods in question are located here:
Example usage scenarios are visible in the tests:
The motivation behind this class is to be enable the creation of truly representative data in any format described by a Tap instance. This then allows the creation of integration tests that are as close to a production environment as possible. While this is achievable with a sink, identity pipe, and the respective Tap, such an approach introduces the overhead of the execution of additional Hadoop/Local processes just to write data. Plunger's implementation constructs the minimum amount of scaffolding required around the tap and exercises it directly. 

However, with cascading 3 I see that this scaffolding I create may now need to become more complex. Previously in Cascading 2 we needed to create a HadoopFlowStep to clean-up a the '_temporary' folder. Fortunately this was simple to construct. However, in the 3.0.0 version HadoopFlowStep requires more complex initialisation, requiring an ElementGraph and an FlowNodeGraph. These in turn also require complex initialisation values.

Now at this point my experience is telling me that I'm trying to do something that I shouldn't. But in practice we've found this feature very useful so I'm keen to persevere. Would it be possible to structure Taps in such a way that they can be used to write data outside of a flow with minimal dependencies?



Sorry for the delay, holidays and all.

This is totally reasonable, but I don’t have any comments really other than I’m open for suggestions.

and that Cascading 3 would be the place to introduce such changes. 

one way to do this could be to write a simple rule that injected a Counter operation at the head of every branch. but then you would need logic that could unwind the counters. and work in such a way that you could reconcile the counts across multiple topologies (mr, dag, local, etc)



to open a tap to write, you should only need

new Hfs(…).opentForWrite( new HadoopFlowProcess() )

See CascadingTestCase for lots of test helpers.


댓글 없음:

댓글 쓰기