Hi All, When a failure occurs in a DC node and fencing is executed, the “Pending Fencing Actions” displayed in crm_mon will increase if fencing continues to fail. (After cluster startup) --- [root@rh77hv-01 ~]# crm_mon -1 -Af Stack: corosync Current DC: rh77hv-01 (version 2.0.2-57d220459f) - partition with quorum Last updated: Thu Aug 29 11:05:38 2019 Last change: Thu Aug 29 11:05:20 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Online: [ rh77hv-01 rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): Started rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: * Node rh77hv-02: --- (After all DC node fencing fails) --- [root@rh77hv-01 ~]# crm_mon -1 -Af Stack: corosync Current DC: rh77hv-01 (version 2.0.2-57d220459f) - partition with quorum Last updated: Thu Aug 29 11:09:23 2019 Last change: Thu Aug 29 11:05:20 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Thu Aug 29 11:07:02 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Thu Aug 29 11:07:02 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1684, origin=rh77hv-02, last-failed='Thu Aug 29 11:09:02 2019' Pending Fencing Actions: * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.1684, origin=rh77hv-01 --- The cause of the problem is that the state of the fencing operation generated on the DC node remains without being completed or failed. This crm_mon display will confuse the user. We want to improve the problem. This problem occurs in both Pacemaker 1.1.21 and Pacemaker 2.0.2. - I attach crm_report when I run it with the latest Pacemaker master. Best Regards, Hideo Yamauchi.
Created attachment 345 [details] crm_report files.
Hmm ... that may have to do with the switch to common executor-code for fence & recousource-agents. Had you tried that scenario with older pacemaker-versions as well? I've been playing a lot with failing fence-actions and making them fail due to the executing host rebooting recently.(That had been leaving pending fence actions since ever and I was working on cleaning those up. That fix probably doesn't have anything to do with what you are observing but it should lead to the pending-fence-actions being cleaned up when you reboot / restart pacemaker on rh77hv-02.) But there might be something special when using fence_legacy which I don't remember to have tested/used for ages. Let me have a look / try to reproduce.
(In reply to Klaus Wenninger from comment #2) Hi Klaus, > Hmm ... that may have to do with the switch to common executor-code for > fence & recousource-agents. > Had you tried that scenario with older pacemaker-versions as well? No, older Pacemakers have not confirmed. Check if necessary. Do you want to check it? What is the version? > I've been playing a lot with failing fence-actions and making > them fail due to the executing host rebooting recently.(That had been > leaving pending fence actions since ever and I was working on cleaning > those up. That fix probably doesn't have anything to do with > what you are observing but it should lead to the pending-fence-actions > being cleaned up when you reboot / restart pacemaker on rh77hv-02.) > But there might be something special when using fence_legacy which > I don't remember to have tested/used for ages. > Let me have a look / try to reproduce. Thanking you in advance. Please contact me if I need to confirm anything. Best Regards, Hideo Yamauchi.
(In reply to Hideo Yamauchi from comment #3) > (In reply to Klaus Wenninger from comment #2) > > Hi Klaus, > > > Hmm ... that may have to do with the switch to common executor-code for > > fence & recousource-agents. > > Had you tried that scenario with older pacemaker-versions as well? > The change I had in mind is quite recent. So stepping back 1 version (2.0.1 or 1.20 respectively) should be enough. Thanks for checking. That would be really helpful. If possible you might as well exchange your failing fence-agent by a RHCS-style one in a 2nd step. You could use fence_dummy from CTS for that purpose. And it would of course be interesting to see if restarting pacemaker on the delegate would cleanup the issue. Thanks Klaus
Hi Klaus, I'm not good at English so I couldn't tell you correctly. I did not check the behavior with older versions. I checked the same with Pacemaker 2.0.1 again, but the same problem occurred. --- [root@rh77hv-01 pacemaker]# crm_mon -1 -Af Stack: corosync Current DC: rh77hv-01 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Thu Sep 12 10:32:12 2019 Last change: Thu Sep 12 10:27:57 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Thu Sep 12 10:28:38 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Thu Sep 12 10:28:38 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.27203, origin=rh77hv-02, last-failed='Thu Sep 12 10:30:38 2019' Pending Fencing Actions: * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.27203, origin=rh77hv-01 --- Best Regards, Hideo Yamauchi.
Hi Klaus, I think that there is a problem because it is not an official fix, but I tried to make the following changes. (This is a modification based on Pacemaker 2.0.2.) - https://github.com/HideoYamauchi/pacemaker/commit/93630e4c94d3fa16fdd1c42290f226678c005cff Then, Pending Fencing Actions will be displayed in Failed Fencing Actions after completion for requests from DC. ----- [root@rh77hv-01 pacemaker]# crm_mon -1 -Af Stack: corosync Current DC: rh77hv-01 (version 2.0.2-744a30d655) - partition with quorum Last updated: Thu Sep 12 15:10:34 2019 Last change: Thu Sep 12 15:05:19 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Thu Sep 12 15:05:51 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Thu Sep 12 15:05:51 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, last-failed='Thu Sep 12 15:07:51 2019' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, last-failed='Thu Jan 1 09:00:00 1970' ----- However, if you display with the m3 option, the number of requests from DC will also increase, so double the number of failed fencing actions will be displayed. ----- [root@rh77hv-01 pacemaker]# crm_mon -1 -Af -m3 Stack: corosync Current DC: rh77hv-01 (version 2.0.2-744a30d655) - partition with quorum Last updated: Thu Sep 12 15:10:49 2019 Last change: Thu Sep 12 15:05:19 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Thu Sep 12 15:05:51 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Thu Sep 12 15:05:51 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:07:51 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:07:39 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:07:27 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:07:15 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:07:03 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:06:50 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:06:38 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:06:26 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:06:14 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1641, origin=rh77hv-02, completed='Thu Sep 12 15:06:02 2019' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' * reboot of rh77hv-01 failed: delegate=, client=pacemaker-controld.1641, origin=rh77hv-01, completed='Thu Jan 1 09:00:00 1970' Fencing History: ----- The correction is provisional, but I understand that the essence of the problem lies in the request from DC. Best Regards, Hideo Yamauchi.
(In reply to Hideo Yamauchi from comment #6) > Hi Klaus, > > I'm not good at English so I couldn't tell you correctly. No worries. I think I got what you meant. English isn't my mother-tongue either. > > However, if you display with the m3 option, the number of requests from DC > will also increase, so double the number of failed fencing actions will be > displayed. > That points a bit in the direction that the pending action just isn't properly overwritten by the failed-action that seems to be created. But knowing that the behaviour is neither a result of the patch that converts pending fence actions from a fencer that has died nor some artefact from using fence_legacy is already helpful. Let me have a look ... Klaus
Hi Klaus, Okay! All right! Please contact me if I can help you. Best Reagrds, Hideo Yamauchi.
Looking at your posts yesterday I thought I had seen that you had switched from external/ssh to some dummy fencing rhcs-style. But obviously I hadn't looked properly. So issues with fence_legacy might still be possible. But anyway I could imagine that not merging duplicates in case of suicide fence-actions might be the reason: fenced_remote.c (merge_duplicates): } else if (safe_str_eq(other->target, other->originator)) { crm_trace("Can't be a suicide operation: %s", other->target); continue; } As you have the setup to reproduce and already built your patch would it be possible to simply disable that if-clause? Regards, Klaus
(In reply to Klaus Wenninger from comment #9) > Looking at your posts yesterday I thought I had seen that you had switched > from external/ssh to some dummy fencing rhcs-style. But obviously I hadn't > looked properly. So issues with fence_legacy might still be possible. > > But anyway I could imagine that not merging duplicates in case of suicide > fence-actions might be the reason: > > fenced_remote.c (merge_duplicates): > > } else if (safe_str_eq(other->target, other->originator)) { > crm_trace("Can't be a suicide operation: %s", other->target); > continue; > } > > As you have the setup to reproduce and already built your patch would it be > possible to simply disable that if-clause? Hi Klaus, Is your request to enable this trace, reproduce the problem and attach crm_report? I will try it. Best Regards, Hideo Yamauchi.
Hi Hideo, I believe what Klaus meant was to remove the three lines from "else if" to "continue". That will allow multiple self-fencing requests to be merged.
(In reply to Ken Gaillot from comment #11) > Hi Hideo, > > I believe what Klaus meant was to remove the three lines from "else if" to > "continue". That will allow multiple self-fencing requests to be merged. Hi Ken, OKay! I will try it. Many thanks, Hideo Yamauchi.
Hi Klaus, I made the following modifications to Pacemaker 2.0.2 and confirmed the operation. ---@daemon/fenced_remote.c } else if (safe_str_eq(op->client_name, other->client_name)) { crm_trace("Must be for different clients: %s", op->client_name); continue; #if 0 } else if (safe_str_eq(other->target, other->originator)) { crm_trace("Can't be a suicide operation: %s", other->target); continue; #endif } --- The fenced_remote trace is also enabled. The result was as follows...The problem seems to occur. --- [root@rh77hv-01 fenced]# crm_mon -1 -Af Stack: corosync Current DC: rh77hv-01 (version 2.0.2-744a30d655) - partition with quorum Last updated: Thu Sep 19 09:21:15 2019 Last change: Thu Sep 19 09:17:08 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Thu Sep 19 09:17:30 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Thu Sep 19 09:17:30 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2085, origin=rh77hv-02, last-failed='Thu Sep 19 09:19:29 2019' Pending Fencing Actions: * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 * reboot of rh77hv-01 pending: client=pacemaker-controld.2085, origin=rh77hv-01 ^C [root@rh77hv-02 ~]# crm_mon -1 -Af Stack: corosync Current DC: rh77hv-01 (version 2.0.2-744a30d655) - partition with quorum Last updated: Thu Sep 19 09:21:34 2019 Last change: Thu Sep 19 09:17:08 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Thu Sep 19 09:17:30 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Thu Sep 19 09:17:30 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2085, origin=rh77hv-02, last-failed='Thu Sep 19 09:19:29 2019' --- Attach crm_report when confirmed.(t5401.tar.bz2) Best Regards, Hideo Yamauchi.
Created attachment 348 [details] crm_report files. When changing fenced_remote.
That is really strange ... Seeing Sep 19 09:17:53 rh77hv-01 pacemaker-fenced [2081] (process_remote_stonith_exec@fenced_remote.c:1950) debug: Marking call to reboot for rh77hv-01 on behalf of pacemaker-controld.2085@356944e4-ecea-4eee-8ad0-cb74f43bf648.rh77hv-0: No data available (-61) Sep 19 09:17:53 rh77hv-01 pacemaker-fenced [2081] (remote_op_done@fenced_remote.c:525) notice: Operation reboot of rh77hv-01 by rh77hv-02 for pacemaker-controld.2085@rh77hv-02.356944e4: No data available and comparing to the code I would assume that it definitely has found the entry on rh77hv-01 and right after that log it would set the operation to failed. Really strange why crm_mon still shows it as pending. Maybe some kind of duplication of the pending entry ... On rh77hv-02 we see it taking exactly the same code-path: Sep 19 09:17:53 rh77hv-02 pacemaker-fenced [14871] (process_remote_stonith_exec@fenced_remote.c:1950) debug: Marking call to reboot for rh77hv-01 on behalf of pacemaker-controld.2085@356944e4-ecea-4eee-8ad0-cb74f43bf648.rh77hv-0: No data available (-61) Sep 19 09:17:53 rh77hv-02 pacemaker-fenced [14871] (remote_op_done@fenced_remote.c:525) error: Operation reboot of rh77hv-01 by rh77hv-02 for pacemaker-controld.2085@rh77hv-02.356944e4: No data available S But there the states seem to be overwritten properly. Strange is as well that we see: Sep 19 09:19:29 rh77hv-02 pacemaker-fenced [14871] (process_remote_stonith_exec@fenced_remote.c:1950) debug: Marking call to reboot for rh77hv-01 on behalf of pacemaker-controld.2085@7e8f6973-7463-4d8f-98a0-b81251c7fb53.rh77hv-0: No data available (-61) Sep 19 09:19:29 rh77hv-02 pacemaker-fenced [14871] (remote_op_done@fenced_remote.c:525) error: Operation reboot of rh77hv-01 by rh77hv-02 for pacemaker-controld.2085@rh77hv-02.7e8f6973: No data available 2085 is definitely the pid of pacemaker-controld on rh77hv-01 and still we see origin of the failed operation rh77hv-02!?!
Looks as if we create both entries on the originating-host in this "Forwarding complex self fencing request to peer %s" case.
And that is probably where you had been already and what inspired you to your patch above. Next question is now if we want to have both failure records - the one from the original suicide-request and the relayed one - or if one should silently disappear. Can we solve it without introducing a new xml-tag? As the original request isn't synced I think we at least would like the pending history to be the same on the nodes. Sorry that it took me a bit to understand. Klaus
Haven't thought it through fully but maybe we could use the duplicate-state. In merge_duplicates we would make the pre-existent suicide-operation a duplicate of the new relayed-operation. Regarding history we wouldn't synchronize duplicate-operations and we would purge them from the list after client-notification. ... or just for duplicate-suicide-operations ... What do you think? Klaus
Hi Klaus, Thanks for your comment. I will think a little more next week. I will comment again if I have any questions. Best Regards, Hideo Yamauchi.
(In reply to Klaus Wenninger from comment #18) > Haven't thought it through fully but maybe we could use the duplicate-state. > In merge_duplicates we would make the pre-existent suicide-operation a > duplicate of the new relayed-operation. > Regarding history we wouldn't synchronize duplicate-operations and we would > purge them from the list after client-notification. ... or just for > duplicate-suicide-operations ... > What do you think? > > Klaus Wouldn't that prevent the suicide from being attempted?
I'm more and more falling in love with the idea of a transparent solution for relayed fencing: The originating node would detect the suicide situation and try to find a relay-target/proxy and then it creates an entry in the list (pending) with that added. Receiving the relay the other node wouldn't just create a new operation with himself as originator but with himself as relay-target/proxy instead (one that matches the entry the originating node had created - we still can decide if we want to broadcast that first one already or have that done just by the proxy). So we would just have a single entry that is first showing as pending and then goes success or failed. In the history we can print out the proxy as well. That kind of picks up your idea and takes it a little further. There may be places in the code where the originator is relevant and would have to be kind of overruled by the relay-target/proxy. Klaus
Hi Klaus, Hi Ken, I thought about how to modify the message without changing the remote_fencing_op_t structure. I'll give you an example of the fix on github today. Best Regards, Hideo Yamauchi.
Hi Klaus, Hi Ken, It seems that it will take a little more time to propose the amendment. Probably next Wednesday.... Best Regards, Hideo Yamauchi.
Hi Ken, Hi Klaus, I created a new proposal based on pacemaker 2.0.2. Since it is a draft, I think that it is not a perfect correction. - https://github.com/HideoYamauchi/pacemaker/commit/2f2758ed642ae81ba15bfc29cbef100442374876 This method does not add a new XML item, but adds REMOTE_OP_ID of the node that commits suicide to the RELAY message. At the request destination, when sending the operation result based on the RELAY message, the REMOTE_OP_ID of the RELAY message is also notified of the result. ---After final fencing failure [root@rh77hv-01 ~]# crm_mon -1 -Af Stack: corosync Current DC: rh77hv-01 (version 2.0.2-744a30d655) - partition with quorum Last updated: Mon Sep 30 10:29:32 2019 Last change: Mon Sep 30 10:24:30 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Mon Sep 30 10:24:51 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Mon Sep 30 10:24:51 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, last-failed='Mon Sep 30 10:26:51 2019' [root@rh77hv-01 ~]# crm_mon -1 -Af -m3 Stack: corosync Current DC: rh77hv-01 (version 2.0.2-744a30d655) - partition with quorum Last updated: Mon Sep 30 10:29:35 2019 Last change: Mon Sep 30 10:24:30 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Mon Sep 30 10:24:51 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Mon Sep 30 10:24:51 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:26:51 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:26:51 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:26:39 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:26:39 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:26:27 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:26:27 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:26:15 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:26:15 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:26:03 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:26:03 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:25:51 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:25:51 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:25:39 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:25:39 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:25:27 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:25:27 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:25:15 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:25:15 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:25:02 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:25:02 2019' Fencing History: --- Best Regards, Hideo Yamauchi.
Hi Hideo! Yep, that is the other obvious way to tackle the issue: If we are creating a duplicate then memorize the initial one there to be able to duplicate the outcome of the operation as well. Simplicity and at first sight non-intrusiveness definitely speak for that approach. On the other hand having duplicate entries is not that appealing. Let me have a closer look and think more about it. Regards, Klaus
Hi Klaus, Thanks for your comment. > On the other hand having duplicate entries is not that appealing. I will also consider a better way to eliminate the duplication. Best Regards, Hideo Yamauchi.
Hi Klaus, By the way, which do you think should be displayed on DC for duplicate actions? DC side operation ---- * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-01, completed='Mon Sep 30 10:26:51 2019' (snip) ---- Or the operation of the requester ---- * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.2222, origin=rh77hv-02, completed='Mon Sep 30 10:26:51 2019' (snip) ---- Best Regards, Hideo Yamauchi.
Hi Hideo! Of course ideally we would have the relay-address as well but currently we are stuck with the structure that would require rebuilding all clients if changed. Probably we should replace that with allocation by the library so that the client doesn't actually have to know the size of the structure and we could add things at the end without having to rebuild all clients (similar as we did with scheduler-stuff - just a little bit more challenging as the clients are working with an array of the structure atm). But till having that I'd prefer staying with the initial origin so that we know where it really came from and that we have consistency with the client-pid. Regards, Klaus
Hi Klaus, Thanks for your comment. Sorry...I did not understand your comments well. What about the following methods? 1. DC node asks other nodes for RELAY of its own fencing. 2. The node that received the RELAY sends a QUERY including the REMOTE_OP_ID of the DC node included in the RELAY. Set REMOTE_OP_ID in the F_STONITH_CALLDATA area. ----example---- <stonith_command __name__="stonith_command" t="stonith-ng" st_async_id="d051d63c-3642-4bfa-bd73-26a78403ec3a" st_op="st_query" st_callid="8" st_callopt="64" st_remote_op="d051d63c-3642-4bfa-bd73-26a78403ec3a" st_target="rh77hv-01" st_device_action="reboot" st_origin="rh77hv-02" st_clientid="4e538e4d-9b7e-4685-afb7-6677898f9c4a" st_clientname="pacemaker-controld.1776" st_timeout="60" src="rh77hv-02"> <st_calldata> <st_relay st_remote_op="39b24f7f-eb3f-4342-b1ce-40941f27b46c"/> </st_calldata> </stonith_command> ---- 3. The DC node that receives the query sets duplicate operation with merge_duplicate if REMOTE_OP_ID of F_STONITH_CALLDATA exists in the ID of its own operation. 4. The fencing node that failed the operation will also notify the DC node operation failure via the fix presented earlier. https://github.com/HideoYamauchi/pacemaker/commit/2f2758ed642ae81ba15bfc29cbef100442374876 5. If there is a duplicate operation with the same REMOTE_OP_ID in stonith_local_history_diff(), the stacking to XML is skipped. 6. Duplicate RELAY operation failures in crm_mon will no longer be displayed. --- [root@rh77hv-01 pacemaker]# crm_mon -1 -Af Stack: corosync Current DC: rh77hv-01 (version 2.0.2-744a30d655) - partition with quorum Last updated: Fri Oct 4 12:44:37 2019 Last change: Fri Oct 4 12:42:12 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Fri Oct 4 12:42:44 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Fri Oct 4 12:42:44 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, last-failed='Fri Oct 4 12:43:49 2019' [root@rh77hv-01 pacemaker]# crm_mon -1 -Af -m3 Stack: corosync Current DC: rh77hv-01 (version 2.0.2-744a30d655) - partition with quorum Last updated: Fri Oct 4 12:44:49 2019 Last change: Fri Oct 4 12:42:12 2019 by root via cibadmin on rh77hv-01 3 nodes configured 3 resources configured Node rh77hv-01: UNCLEAN (online) Online: [ rh77hv-02 ] OFFLINE: [ rh77hv-03 ] Active resources: prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 Resource Group: grpStonith1 prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 Resource Group: grpStonith2 prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Node Attributes: * Node rh77hv-01: * Node rh77hv-02: Migration Summary: * Node rh77hv-01: prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure='Fri Oct 4 12:42:44 2019' * Node rh77hv-02: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status=Not installed, exitreason='', last-rc-change='Fri Oct 4 12:42:44 2019', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:43:49 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:43:42 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:43:36 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:43:29 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:43:23 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:43:16 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:43:10 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:43:03 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:42:57 2019' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-controld.1776, origin=rh77hv-02, completed='Fri Oct 4 12:42:50 2019' Fencing History: --- The main point of this correction method is to communicate the original REMOTE_OP_ID to the DC node using the F_STONITH_CALLDATA area. Other processing is possible by processing the op-> request area well, but it is necessary to release the op-> request area after remote_op_done. Best Regards, Hideo Yamauchi.
As far as I understood it that will give us just a single entry per failed operation. Which is definitely preferable. This is achieved by keeping kind of references between original and relay-duplicate and searching. What I don't like so much is that the remaining failed action has origin=rh77hv-02 and client=pacemaker-controld.1776 where pid 1776 is the pid of pacemaker-controld on rh77hv-01. I don't get your point 5). Filtering every time the client-library is asking? What I had envisioned initially (without having the time so far to check out all implications) was not to do the duplication in the first place. Having around a single operation that has origin=rh77hv-02, delegate=rh77hv-02, client=pacemaker-controld.1776, relay=rh77hv-02 should actually be sufficient. Without changing the history-structure we are using to pass the history-data to clients we would have to skip relay for now. I'd like to go anyway to a new structure that is allocated and freed by the library so that we can expand it in the end without having to (further) adapt the client. In case of scheduler we had done that with an incompatible switch. Klaus
(In reply to Klaus Wenninger from comment #30) > Without changing the history-structure we are using to pass the history-data > to clients we would have to skip relay for now. > I'd like to go anyway to a new structure that is allocated and freed by the > library so that we can expand it in the end without having to (further) > adapt the client. In case of scheduler we had done that with an incompatible > switch. Klaus, I assume you're talking about stonith_history_t? That looks like it's already always dynamically allocated. The history() method takes a stonith_history_t**, so the caller only declares stonith_history_t*. It would be a good idea to add a doxygen comment for the struct saying it should not be declared statically, but I think we'd be on solid ground adding a new member to the end of the struct.
Hi Klaus, Hi Ken, > What I don't like so much is that the remaining failed action has > origin=rh77hv-02 and client=pacemaker-controld.1776 where pid 1776 is the > pid of pacemaker-controld on rh77hv-01. I don't care about this display because the original requester is the crmd of the DC node. If it is changed, it will be the fenced process name and PID string of the node that actually performed the fencing. --ex-- reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.XXXXX, origin=rh77hv-02, completed='Fri Oct 4 12:42:50 2019' ------ > I don't get your point 5). > Filtering every time the client-library is asking? I think about this a little more..... Perhaps the filter should be unnecessary by keeping the RELAY operation performed by the DC in a duplicated state. The method presented in my proposed change does not change the structure of stonith_history_t nor the structure of remote_fencing_op_t. Best Regards, Hideo Yamauchi.
(In reply to Ken Gaillot from comment #31) > (In reply to Klaus Wenninger from comment #30) > > Without changing the history-structure we are using to pass the > history-data > > to clients we would have to skip relay for now. > > I'd like to go anyway to a new structure that is allocated and freed by the > > library so that we can expand it in the end without having to (further) > > adapt the client. In case of scheduler we had done that with an > incompatible > > switch. > > Klaus, > > I assume you're talking about stonith_history_t? That looks like it's > already always dynamically allocated. The history() method takes a > stonith_history_t**, so the caller only declares stonith_history_t*. It > would be a good idea to add a doxygen comment for the struct saying it > should not be declared statically, but I think we'd be on solid ground > adding a new member to the end of the struct. Had noted somewhere in the back of my mind that there was still something to be done regarding memory-management of fence-history. Hadn't updated my memory from the code recently ;-) The history is a single-linked-list where the elements are allocated by the library and a full list can be freed by the library as well. (next-pointer = NULL, makes that usable for single elements as well of course). Unfortunately there is code in crm_mon (reduce_stonith_history) that does partial freeing and moves elements from one list-element to another to overcome shortcomings of a single-linke-list. Of course that doesn't need a change in the library (although a shallow-copy-and-free or something might be another way). With a little more single-linked-list-magic or zeroing-out the moved sub-elements and next-pointer and then using stonith_history_free we could go with the current library and still be prepared for additional sub-elements.
(In reply to Hideo Yamauchi from comment #32) > If it is changed, it will be the fenced process name and PID string of the > node that actually performed the fencing. Aah ... that is better. Didn't check that in the code just saw this strange combination of origin and wrong pid in one of the first comments of this thread. > > The method presented in my proposed change does not change the structure of > stonith_history_t nor the structure of remote_fencing_op_t. Didn't say it would. Just wanted to talk about what would have to be done to get one entry that has all the info - even in tools like crm_mon. And it looks as if we need just a slight modification in reduce_stonith_history and no adaption to the library. (Thanks Ken and shame on me for not checking properly instead of relying on my memory ;-) ) Regards, Klaus
Hi Klaus, Hi Ken, Thanks for your comment. I will consider a little more on my part. Best Regards, Hideo Yamauchi.
Hi Klaus, Hi Ken, I am busy with a little different work. We will resume this work again next week. Best Regards, Hideo Yamauchi.
Hi Klaus, Hi Ken, Using only the duplicate state does not seem to improve the operation without changing the fencing history xml message. I tried to correct the duplicated operation on the DC side. Below are examples of corrections. - https://github.com/HideoYamauchi/pacemaker/commit/1eecafa034a3585c3cde51dc61c2b919dee59cd8 The following modifications. Basically the same as the previous correction method. (1) The DC node includes the remote_op id in the RELAY message. (2) The node that received the request includes the remote_op id of the RELAY message in the QUERY message. (3) The DC node that receives the QUERY message deletes the remote_op id of its own RELAY from the stonith_remote_op_list. (4) After that, proceed based on the remote_op created by the requested node. The output when all fencing fails actually is shown below. (The client of the node that requested fencing displays pacemaker-fenced.) ---- [root@rh77hv-01 pacemaker]# crm_mon -1 -Af -m3 Cluster Summary: * Stack: corosync * Current DC: rh77hv-01 (version 2.0.3-d863971b7e) - partition with quorum * Last updated: Thu Nov 14 12:41:10 2019 * Last change: Thu Nov 14 12:32:38 2019 by root via cibadmin on rh77hv-01 * 3 nodes configured * 3 resources configured Node List: * Node rh77hv-01: UNCLEAN (online) * Online: [ rh77hv-02 ] * OFFLINE: [ rh77hv-03 ] Active Resources: * prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 * Resource Group: grpStonith1: * prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 * Resource Group: grpStonith2: * prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Migration Summary: * Node: rh77hv-01: * prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure=Thu Nov 14 12:33:22 2019: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status='Not installed', exitreason='', last-rc-change='2019-11-14 12:33:22 +09:00', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:26 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:20 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:13 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:07 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:00 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:54 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:47 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:41 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:34 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:28 +09:00' Fencing History: [root@rh77hv-02 ~]# crm_mon -1 -Af -m3 Cluster Summary: * Stack: corosync * Current DC: rh77hv-01 (version 2.0.3-d863971b7e) - partition with quorum * Last updated: Thu Nov 14 12:41:24 2019 * Last change: Thu Nov 14 12:32:38 2019 by root via cibadmin on rh77hv-01 * 3 nodes configured * 3 resources configured Node List: * Node rh77hv-01: UNCLEAN (online) * Online: [ rh77hv-02 ] * OFFLINE: [ rh77hv-03 ] Active Resources: * prmPrimitive01 (ocf::pacemaker:Dummy): FAILED rh77hv-01 * Resource Group: grpStonith1: * prmStonith1-2 (stonith:external/ssh): Started rh77hv-02 * Resource Group: grpStonith2: * prmStonith2-2 (stonith:external/ssh): Started rh77hv-01 Migration Summary: * Node: rh77hv-01: * prmPrimitive01: migration-threshold=1 fail-count=1000000 last-failure=Thu Nov 14 12:33:22 2019: Failed Resource Actions: * prmPrimitive01_stop_0 on rh77hv-01 'not installed' (5): call=19, status='Not installed', exitreason='', last-rc-change='2019-11-14 12:33:22 +09:00', queued=0ms, exec=0ms Failed Fencing Actions: * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:26 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:20 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:13 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:07 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:34:00 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:54 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:47 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:41 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:34 +09:00' * reboot of rh77hv-01 failed: delegate=rh77hv-02, client=pacemaker-fenced.32132, origin=rh77hv-02, completed='2019-11-14 12:33:28 +09:00' Fencing History: ---- What do you think? Best Regards, Hideo Yamauchi.
Hi Klaus, Hi Ken, There seems to be a bug in the fix. Please wait for the correction to finish. Best Regards, Hideo Yamauchi.
Hi Klaus, Hi Ken, I fixed the problem. It can be operated with the revision proposal. - https://github.com/HideoYamauchi/pacemaker/tree/bug5401-new Best Regards, Hideo Yamauchi.
Hi Klaus, Hi Ken, I will once PR with this proposal. Please let me discuss with PR. Best Regards, Hideo Yamauchi.
Hi Ken, Hi Klaus, I implemented this correction in the next PR. - https://github.com/ClusterLabs/pacemaker/pull/1951 However, I reconsidered that it is better to simply include REMOTE_OP_ID of the RELAY message in the QUERY response. - Even a simple fix that I plan on again will not change the structure of stonith_history_t or remote_fencing_op_t. I plan to change(PR:1959) this fix after the fix that I am currently doing is incorporated. Best Regards, Hideo Yamauchi.
Fixed by commit df71a07, which will be in pacemaker 2.0.4