2014년 12월 31일 수요일

[cdh-user] Is "NLineInputFormat skips first line of last InputSplit" patched with CDH 4.7.0?

I spun up a cluster of EC2 and installed CDH using Cloudera Manager and packages.
I found that one line from the last File Split in the case of NLineInputFormat is being missed.
It is supposed to be an bug reported in the past.


Can you please help me on how to proceed?
Does the latest CDH parcels has this patch?



Yes, this change has been in all CDH4 releases since CDH 4.2.0, and in all CDH5 releases. Are you using the MR1 or the MR2 framework (and client libraries)?



MR2. org.apache.hadoop.mapreduce.lib.input.NLineInputFormat



MR2 would mean MR over YARN, not the new mapreduce.* API package. Are you using YARN+MR2? Looking at the sources the backport only made it to the MR2 APIs, not the MR1 ones.



Sorry just for my own understanding, what do you mean MR2 would mean MR over yarn? Isn't YARN considered MR2



I am not using Yarn. I am using MR1.
Can you guide me on how to proceed?
Is it sufficient to spin up hadoop cluster with yarn binaries and no change needed in the code?



Yes, that would work (if you're okay using YARN - I'd recommend use of YARN via CDH5, it was not ready as a platform in CDH4). On the code-end, if you are using Maven you may also need to switch to using non "mr1" dependencies.



Ok.Thanks.
I created the cluster with Yarn service included.
I tried removing the default alternative using the below command which fell back to  /etc/hadoop/conf.cloudera.yarn1

sudo update-alternatives --remove hadoop-conf /etc/hadoop/conf.cloudera.mapreduce1

And then it worked without making any changes to the code. I will update the dependency for a safer side.



Is there a way to directly point the alternatives to conf.cloudera.yarn1 instead of conf.cloudera.mapreduce1 during the cluster installation?



In CDH5 installations that should automatically be the case, but you
can adjust "Alternatives Priority" configuration under YARN to be a
value higher than the MapReduce service value. More on this is
described at http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Manager-Managing-Clusters/cm5mc_client_config.html


댓글 없음:

댓글 쓰기