2014년 12월 30일 화요일

[impala-user] impala query hangs when one node has some RX packets dropped

One of the impalad node have network issues and some RX packets dropped, and some queries hangs forever, 
looking at the log on query coordinator node, some the fragments run on the broken node never report back complete status. But reading the log of the broken node, looks like the fragments is finished. I wonder if network issue can cause queris hang rather than failed?  Is there are some retry logic in reporting fragments  running status?



>> Is there are some retry logic in reporting fragments  running status?
Just take look at the code, it seems there is no retry logic for reporting exec-fragments status.
After coordinator dispatch the fragments to different backend-host to execute,
Coordinator::Exec
    Status fragments_exec_status = ParallelExecutor::Exec(
        bind<Status>(mem_fn(&Coordinator::ExecRemoteFragment), this, _1),
        reinterpret_cast<void**>(&backend_exec_states_[backend_num - num_hosts]),
        num_hosts, &latencies);
        

Then different worker parallelly to execute the fragment, actually execute the rpc call "backend_client->ExecPlanFragment". 
ParallelExecutor::Exec
  for (int i = 0; i < num_args; ++i) {
    stringstream ss;
    ss << "worker-thread(" << i << ")";
    worker_threads.AddThread(new Thread("parallel-executor", ss.str(),
        &ParallelExecutor::Worker, function, args[i], &lock, &status, latencies));
  }
  worker_threads.JoinAll();

  return status;
  
  
Hmm, there is a retry logic when execute the rpc call "backend_client->ExecPlanFragment, backend_client.Reopen()".

>> I wonder if network issue can cause queris hang rather than failed?

In your case, the networking issue may cause the query hung. However, there are various timeout setting of impala, including the query, http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/impala_timeouts.html


댓글 없음:

댓글 쓰기