I wonder if anyone here has used the ORC Scheme:
I've done a couple comparisons, and it seems like there is a large overhead when compared to LZO.
My test was to read LZO write as ORC. VS read LZO write LZO.
I found that even ORC with no compression took 3x the CPU time as LZO->LZO.
I found similar result. So I tested the same thing with Hive. Writing data into ORC do cost much more CPU compared to LZO text, sequence and parquet. But the output file size is smaller (using the same compression configuration). I would say the write performance overhead is not caused by ORC Cascading Scheme. It's design and implementation trade-off by ORC file format itself. For example, the builtin block-mode compression which should cost more write CPU but will get smaller output size, and better read performance.
Thanks for the response and Library!
I have yet to finish my performance comparisons on reading orc vs reading lzo.
I'll update as time allows...
댓글 없음:
댓글 쓰기