Hello CDH users,We have been fighting an issue with oozie jobs launching a large number(several hundreds) of workers at the same time, where some applications stay inACCEPTED state forever, whereas jobs submitted later run and complete justfine. Once the application is in that state, all is left to do is killing itfrom the command line, or restarting the RM. This is on CDH 5.2.It seems to match the behaviour observed here:https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/gCRNTmzXNn8I'd be grateful if people familiar with Oozie/YARN could have a look at theselogs; any clues on how to solve this problem would be greatly appreciated.Thanks!David2014-12-30 23:38:55,700 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=mapred IP=10.192.71.19 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1419871364283_21371
2014-12-30 23:38:55,700 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1419871364283_21371
2014-12-30 23:38:55,700 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1419871364283_21371 State change from NEW to NEW_SAVING
2014-12-30 23:38:55,700 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1419871364283_21371
...
2014-12-30 23:38:55,706 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1419871364283_21371 State change from NEW_SAVING to SUBMITTED
2014-12-30 23:38:55,734 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1419871364283_21371from user: mapred, in queue: root.mapred-oozie-launcher, currently num of applications: 480
2014-12-30 23:38:55,734 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1419871364283_21371 State change from SUBMITTED to ACCEPTED
2014-12-30 23:38:55,734 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1419871364283_21371_000001
2014-12-30 23:38:55,734 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1419871364283_21371_000001 State change from NEW to SUBMITTED
... (in the section below you can see transitions for several apps at the same time: 23:38:55,738)
2014-12-30 23:38:55,738 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1419871364283_21371_000001 to scheduler from user: mapred
2014-12-30 23:38:55,738 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application appattempt_1419871364283_20922_000001 is done. finalState=FINISHED
2014-12-30 23:38:55,738 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1419871364283_20922 requests cleared
2014-12-30 23:38:55,738 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type ATTEMPT_ADDED for applicationAttempt application_1419871364283_21371
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
at java.util.ArrayList$Itr.next(ArrayList.java:831)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:908)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:893)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:110)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:765)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:746)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
...
2014-12-30 23:47:23,222 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1419871364283_21371_01_000001 Container Transitioned from NEW to ALLOCATED
2014-12-30 23:47:23,222 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=mapred OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1419871364283_21371 CONTAINERID=container_1419871364283_21371_01_000001
2014-12-30 23:47:23,222 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1419871364283_21371_01_000001 of capacity <memory:2048, vCores:1> on host **************************************:36417, which has 6 containers, <memory:13312, vCores:6> used and <memory:27756, vCores:0> available after allocation
2014-12-30 23:47:23,222 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at SUBMITTED
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:110)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:765)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:746)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
this is bug:
- [YARN-2910] - FSLeafQueue can throw ConcurrentModificationException
and there are many bugs in CDH5.2
you should upgrade the system into CDH5.3 which was release on December 21, 2014.
this is bug:
- [YARN-2910] - FSLeafQueue can throw ConcurrentModificationException
Damn! I missed that one despite all the research. thanks a lot!
As a sidenote, issuing a "yarn application -kill <app id>" doesn't produce a useful error code in Oozie (preventing the retries to kick in). Maybe that's fixed later too, we'll see.
- and there are many bugs in CDH5.2
you should upgrade the system into CDH5.3 which was release on December 21, 2014.
regards,
Park
The admins will be delighted as we just managed to migrate from CDH4 and stabilize the thing somehow. :-)
I guess we'll do it next year. er, wait...
Thanks a lot again, Alex -and happy new year!
댓글 없음:
댓글 쓰기