2014년 12월 31일 수요일

[cdh-user] Oozie jobs on YARN stay in ACCEPTED state forever

Hello CDH users,

We have been fighting an issue with oozie jobs launching a large number
(several hundreds) of workers at the same time, where some applications stay in
ACCEPTED state forever, whereas jobs submitted later run and complete just
fine. Once the application is in that state, all is left to do is killing it
from the command line, or restarting the RM. This is on CDH 5.2.

It seems to match the behaviour observed here:
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/gCRNTmzXNn8

I'd be grateful if people familiar with Oozie/YARN could have a look at these
logs; any clues on how to solve this problem would be greatly appreciated.

Thanks!

David

2014-12-30 23:38:55,700 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=mapred    IP=10.192.71.19    OPERATION=Submit Application Request    TARGET=ClientRMService    RESULT=SUCCESS    APPID=application_1419871364283_21371
2014-12-30 23:38:55,700 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1419871364283_21371
2014-12-30 23:38:55,700 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1419871364283_21371 State change from NEW to NEW_SAVING
2014-12-30 23:38:55,700 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1419871364283_21371

...

2014-12-30 23:38:55,706 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1419871364283_21371 State change from NEW_SAVING to SUBMITTED
2014-12-30 23:38:55,734 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1419871364283_21371from user: mapred, in queue: root.mapred-oozie-launcher, currently num of applications: 480
2014-12-30 23:38:55,734 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1419871364283_21371 State change from SUBMITTED to ACCEPTED
2014-12-30 23:38:55,734 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1419871364283_21371_000001
2014-12-30 23:38:55,734 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1419871364283_21371_000001 State change from NEW to SUBMITTED

... (in the section below you can see transitions for several apps at the same time: 23:38:55,738)

2014-12-30 23:38:55,738 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1419871364283_21371_000001 to scheduler from user: mapred
2014-12-30 23:38:55,738 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application appattempt_1419871364283_20922_000001 is done. finalState=FINISHED
2014-12-30 23:38:55,738 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1419871364283_20922 requests cleared
2014-12-30 23:38:55,738 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type ATTEMPT_ADDED for applicationAttempt application_1419871364283_21371
java.util.ConcurrentModificationException

    at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
    at java.util.ArrayList$Itr.next(ArrayList.java:831)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
    at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:908)
    at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:893)
    at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
    at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
    at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
    at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
    at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757)
    at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:110)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:765)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:746)
    at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
    at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
    at java.lang.Thread.run(Thread.java:745)

...

2014-12-30 23:47:23,222 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1419871364283_21371_01_000001 Container Transitioned from NEW to ALLOCATED
2014-12-30 23:47:23,222 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=mapred    OPERATION=AM Allocated Container    TARGET=SchedulerApp    RESULT=SUCCESS    APPID=application_1419871364283_21371    CONTAINERID=container_1419871364283_21371_01_000001
2014-12-30 23:47:23,222 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1419871364283_21371_01_000001 of capacity <memory:2048, vCores:1> on host **************************************:36417, which has 6 containers, <memory:13312, vCores:6> used and <memory:27756, vCores:0> available after allocation
2014-12-30 23:47:23,222 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at SUBMITTED
    at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
    at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
    at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
    at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757)
    at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:110)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:765)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:746)
    at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
    at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
    at java.lang.Thread.run(Thread.java:745)




this is bug: 
  • [YARN-2910] - FSLeafQueue can throw ConcurrentModificationException
and there are many bugs in CDH5.2 

you should upgrade the system into CDH5.3 which was release on December 21, 2014.



this is bug: 
  • [YARN-2910] - FSLeafQueue can throw ConcurrentModificationException
Damn! I missed that one despite all the research. thanks a lot!

As a sidenote, issuing a "yarn application -kill <app id>" doesn't produce a useful error code in Oozie (preventing the retries to kick in). Maybe that's fixed later too, we'll see. 
  • and there are many bugs in CDH5.2 

you should upgrade the system into CDH5.3 which was release on December 21, 2014.

regards, 
Park



The admins will be delighted as we just managed to migrate from CDH4 and stabilize the thing somehow. :-)

I guess we'll do it next year. er, wait...

Thanks a lot again, Alex -and happy new year!


댓글 없음:

댓글 쓰기