informix-community

Open full view…

HDR Failover doesn't work when DB server is powered off

naguibenator
Sun, 22 Jul 2018 21:11:01 GMT

The HDR cluster would be connected and replication is on. If the service is stopped gracefully on the server, failover to secondary works fine. However, if the VM crashes or is powered off, no fail over happens. This works ok: service informix stop and this too: onmade -yuk But not when powering off the server, which Is supposed to simulate the primary node crashing in a DR situation. I assume it is a setting I need to look at? This cannot be an expected behaviour.

naguibenator
Mon, 23 Jul 2018 03:19:43 GMT

Automatic switchover is configured in a primary/secondary HDR server setup. The HDR cluster would be connected and replication is on. If the service is stopped gracefully on the primary server (NODE-A), failover happens and the secondary server (NODE-B) becomes primary. However, if the VM crashes or is powered off, no fail over happens! ONCONFIG file # DRAUTO - Controls automatic failover of primary DRAUTO 3 # DRINTERVAL - The maximum interval, in seconds, between HDR DRINTERVAL 0 # DRTIMEOUT - The time, in seconds, before a network DRTIMEOUT 30

andreasl
Mon, 23 Jul 2018 10:20:38 GMT

Not seeing any mention of Connection Manager (used in its capability as Failover Arbitrator) -> why having DRAUTO 3 ? On the other hand this wouldn't explain the diff you're describing. What's in the secondary's online.log when primary's whole VM goes away?

naguibenator
Tue, 24 Jul 2018 02:11:03 GMT

The DRAUTO was set to 3 as per instruction step #5 outlined in the following IBM doc "This setting specifies that a Connection Manager controls failover arbitration." https://www.ibm.com/support/knowledgecenter/en/SSGU8G_12.1.0/com.ibm.admin.doc/ids_admin_1173.htm *On secondary node* *Prior to failure of primary* *onstat -g cluster* IBM Informix Dynamic Server Version 12.10.FC10 -- Read-Only (Sec) -- Up 13:49:21 -- 1556768 Kbytes Primary Server:awwdst22a_t22 Index page logging status: Enabled Index page logging was enabled at: 2018/06/05 23:47:35 Server ACKed Log Supports Status (log, page) Updates awwdst22b_t22 54506,25 No ASYNC(HDR),Connected,On */logs/informix/online.log* 01:51:43 SMX thread is exiting 01:51:43 DR: Receive error 01:51:43 SMX thread is exiting 01:51:43 dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken. System error = 4. 01:51:43 DR_ERR set to -1 01:51:43 DR: Turned off on secondary server 01:51:44 SCHAPI: Issued Task() or Admin() command "task( 'ha make primary force', 'awwdst22b _t22' )". 01:51:45 Skipping failover callback. *ONCONFIG* # HA_FOC_ORDER - The cluster failover rules. HA_FOC_ORDER *Connection Manager host* Connection Managers is installed on the application-server host to prioritize an application server's connectivity to the primary cluster server. */etc/informix/cm.sqlhosts* cluster_1 group - - i=10 awwdst22a_t22 onsoctcp awwdst22a informix_t22 g=cluster_1 awwdst22b_t22 onsoctcp awwdst22b informix_t22 g=cluster_1 sla_cluster_1 onsoctcp localhost informix_t22 */etc/informix/cmconfig* NAME whics-t22 DEBUG 1 LOG 1 LOGFILE /logs/informix/cmlog CM_TIMEOUT 60 EVENT_TIMEOUT 60 SECONDARY_EVENT_TIMEOUT 60 SQLHOSTS local CLUSTER cluster_1 { INFORMIXSERVER cluster_1 SLA sla_cluster_1 DBSERVERS=primary FOC ORDER=ENABLED \ PRIORITY=1 #CMALARMPROGRAM $INFORMIXDIR/etc/cmalarmprogram.sh }

andreasl
Tue, 24 Jul 2018 12:02:16 GMT

Hmmm, so CM failover arbitration is part of the picture, and these two lines would indicate the CM has indeed initiated a failover and awwdst22bt22 has received the 'ha make primary' request: 01:51:44 SCHAPI: Issued Task() or Admin() command “task( ‘ha make primary force’, ‘awwdst22bt22’ )”. 01:51:45 Skipping failover callback. Interesting part would be coming thereafter ...

naguibenator
Wed, 25 Jul 2018 04:03:49 GMT

So I have shutdown both DB servers then started NODE-A (Primary) then NODE-B (secondary) and the HDR cluster was on and connected. I then turned off NODE-A (primary) which again did not cause failover to the secondary NODE-B. This time around this is all I got in the logs 01:44:10 DR: DRAUTO is 3 (CMSM) 01:44:10 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled) 01:44:10 Event notification facility epoll enabled. 01:44:10 Trusted host cache successfully built:/etc/hosts.equiv. 01:44:10 CCFLAGS2 value set to 0x200 01:44:10 SQL_FEAT_CTRL value set to 0x8008 01:44:10 SQL_DEF_CTRL value set to 0x4b0 01:44:10 IBM Informix Dynamic Server Version 12.10.FC10 Software Serial Number AAA#B000000 01:44:11 Performance Advisory: The current size of the physical log buffer is smaller than recommended. 01:44:11 Results: Transaction performance might not be optimal. 01:44:11 Action: For better performance, increase the physical log buffer size to 128. 01:44:12 IBM Informix Dynamic Server Initialized -- Shared Memory Initialized. 01:44:12 DR: Trying to connect to primary serve r = awwdst10a_t10 01:44:12 smx creates 1 transports to server awwdst10a_t10 01:44:13 Dataskip is now OFF for all dbspaces 01:44:13 Restartable Restore has been ENABLED 01:44:13 Recovery Mode 01:44:15 DR: Secondary server connected 01:44:15 DR: Secondary server needs failure recovery 01:44:16 DR: Failure recovery from disk in progress ... 01:44:16 Logical Recovery Started. 01:44:16 10 recovery worker threads will be started. 01:44:16 Start Logical Recovery - Start Log 54497, End Log ? 01:44:16 Starting Log Position - 54497 0x12018 01:44:19 Started processing open transactions on secondary during startup Server in fast recovery until these open transactions finish 01:44:19 Finished processing open transactions on secondary during startup. 01:44:19 Checkpoint Completed: duration was 0 seconds. 01:44:19 Wed Jul 25 - loguniq 54497, logpos 0x140c0, timestamp: 0xd41ebe31 Interval: 399771 . . 02:02:03 DR: Receive error 02:02:03 SMX thread is exiting 02:02:03 SMX thread is exiting 02:02:03 dr_secrcv thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken. 02:02:03 DR_ERR set to -1 02:02:04 DR: Turned off on secondary server And there is nothing else past this point!!

andreasl
Wed, 25 Jul 2018 09:11:52 GMT

As it looks this secondary never got fully operational (missing 'DR: Secondary server operational' message) before the primary went away. This message typically comes right after "Finished processing open transactions on secondary during startup." The reason it is not coming here probably is that at least one of the transactions still open (neither committed nor rolled back) at this point also still was open on the primary. A corner case, sort of, and, if this indeed is the reason this failover wouldn't even be attempted, maybe a hole in the failover logic (and a bug). Can you repeat your test with Secondary first reaching 'operational' state? If not reached, make sure no transaction still open on primary...

naguibenator
Wed, 25 Jul 2018 11:02:30 GMT

I have repeated the test with secondary reaching 'operational' state and cluster on/connected and this is what I got after switching off the primary node: 10:39:41 DR: HDR secondary server operational 10:39:42 Checkpoint Completed: duration was 0 seconds. 10:39:42 Wed Jul 25 - loguniq 54500, logpos 0x1e018, timestamp: 0xd41f6670 Interval: 399885 . . 10:46:33 DR: Receive error 10:46:33 SMX thread is exiting 10:46:33 SMX thread is exiting 10:46:33 dr_secrcv thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken. 10:46:33 DR_ERR set to -1 10:46:33 DR: Turned off on secondary server

andreasl
Tue, 31 Jul 2018 12:24:54 GMT

Hi, I'm afraid this is going beyond what can be dealt with in this forum. Would a support case, with IBM or HCL, depending on where you bought from, be an option?

deenm
Wed, 20 Mar 2019 19:36:57 GMT

This sounds similar to what I am experiencing. Did you receive any updated from support or Dev team on this? Is it a defect?