Question

Stream node remain in "JOINING_FAILED" status

Hi All

We have a cluster in Production with one node on the Stream tab of the "Decisioning: Services" landing page remaining in JOINING_FAILED status (see attached screen shot). We traced it to this snippet in the Kafka server.log file:

...snip...

[2019-11-25 11:13:23,513] INFO Creating /brokers/ids/6 (is it secure? false) (kafka.zk.KafkaZkClient)

[2019-11-25 11:13:23,541] ERROR Error while creating ephemeral at /brokers/ids/6, node already exists and owner '31139484958261249' does not match current session '31140168996945922' (kafka.zk.KafkaZkClient$CheckedEphemeral)

[2019-11-25 11:13:23,541] INFO Result of znode creation at /brokers/ids/6 is: NODEEXISTS (kafka.zk.KafkaZkClient)

[2019-11-25 11:13:23,550] ERROR [KafkaServer id=6] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)

org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists

at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)

at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1476)

at kafka.zk.KafkaZkClient.registerBrokerInZk(KafkaZkClient.scala:84)

at kafka.server.KafkaServer.startup(KafkaServer.scala:254)

at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)

at kafka.Kafka$.main(Kafka.scala:75)

at kafka.Kafka.main(Kafka.scala)

[2019-11-25 11:13:23,553] INFO [KafkaServer id=6] shutting down (kafka.server.KafkaServer)

[2019-11-25 11:13:23,555] INFO [SocketServer brokerId=6] Stopping socket server request processors (kafka.network.SocketServer)

..snip...

Does anyone have an idea of what the problem is here? We know next to nothing about Kafka.

Regards,

Johan

Comments

Keep up to date on this post and subscribe to comments

November 27, 2019 - 9:37am

Hello, 

Is it a new environment? First time you have seen this issue, how long have you been running with this Production system? any changes?

November 27, 2019 - 9:57am

Hi Johan,

Try to stop that JVM and check is any active process running on the app server after JVM stop.

If any kill those process and start again and also if upgraded system make sure you clean pr_sys_statusnode table.

Hope this helps, let me know how its goes.

Regards,

Anandh

November 28, 2019 - 12:12am
Response to AnandhP3

Thank you for your response, Anandh. We've been having this problem since past Friday. We've been in production since June with twice-monthly deployments. There was no deployment last week. I didn't clean the pr_sys_statusnode table since we have not upgraded. Should I delete the row related to the machine with the problem? I tried your suggestion of stopping the JVM and killing any other Java processes after JVM stopped. There was one. It didn't help to kill it and restart. On the other machines there are two other Java processes, which makes sense since Kafka and Cassandra are both running fine over there.

Pega
November 28, 2019 - 10:37am

Hi Johan,

Hope all the nodes are able to communicate with each other.Please try to ping from one node to other & check the response.If it can talk to each other then stop all the nodes,Clear pr_sys_statusnodes table & restart server node by node.First start util nodes followed by stream node & then web nodes.

Please let me know If issue still persists after that.

 

Thanks,
Abhinav

December 2, 2019 - 1:41am
Response to Abhinav7

Hi Abhinav

Sorry it took so long to get back to this. Hectic weekend with Black Friday and all. We have permission to restart all the nodes tonight. I will let you know if it helped. All the nodes can ping each other.

Regards,

Johan

Pega
December 4, 2019 - 1:11pm
Response to JohanH55

Hi Johan,

Can you please provide an update.Did it work?

Thanks,

Abhinav

December 5, 2019 - 1:22am
Response to Abhinav7

Hi Abhinav

The restart was postponed to last night to coincide with other downtime. We took the nodes down and when we looked at the table it was empty. We brought the nodes back up and the Kafka server is still down. The four records in the table are back. So no, it didn't work.

Regards,

Johan

Pega
December 5, 2019 - 2:25pm

Hi Johan,

Okay, It means something wrong with kafka itself.Can you please share error logs which got generated during kafka startup.

Thanks,
Abhinav

December 6, 2019 - 1:37am
Response to Abhinav7

Hi Abhinav

In my very first post I included an extract from that log file. I'm attaching a copy of the latest log file.

Regards,

Johan

December 9, 2019 - 6:52am

SR logged: SR-D67690