Zato unable to deploy service after redis failover

Hi

I have setup a 3 node redis cluster with sentinel HA for our zato-2.0.7 cluster. When we trigger a successful redis node failover, zato cluster will stop deploying services (whether through webadmin upload service function or hot-deploy via pickup directory). No error or debug log in server.log.

My servers config:

[kvdb]
host=
port=6379
unix_socket_path=
password=
db=0
socket_timeout=
charset=
errors=
use_redis_sentinels=True
redis_sentinels=node1:26379,node2:26379,node3:26379
redis_sentinels_master=mymaster
shadow_password_in_logs=True
log_connection_info_sleep_time=5 # In seconds

Using web-admin’s Key/value DB Remote commands to set value and get value still work after redis failover. But deploying service refuse to work.

Keith

Using web-admin’s Key/value DB Remote commands to set value and get value still work after redis failover. But deploying service refuse to work.

Can you please confirm if before and after the failover scheduler jobs are executed? If you don’t have any jobs, please create a test one that executes every 3 seconds, for instance.

https://zato.io/docs/web-admin/scheduler/main.html

This is related to hot-deployment since both scheduler and hot-deployment use Redis for internal communication.

I created a 3 secs schedule job to write a message in the server.log. Before the failover, I could see the message written in the server.log. The schedule.log was also showing “Job executed”. I then stop the master redis to trigger a failover. The schedule log pause for a moment, but resume showing “job executed” for my test job and others like “zato.server.cluster-wide-singleton-keep-alive” job in schedule.log. However I don’t see my test job printing my message in the server.log. I also did not see any logging of “Cluster-wide singleton keep-alive” nor “Not becoming a cluster-wide singleton” in singleton.log.

The only way to restart the job properly is to restart the singleton node. I will then see my job writing message to the server.log

If i restart other nodes 1st without restarting the singleton node, I will see that node become singleton node and the singleton.log will show 2 cluster-wide singleton node. I will then see my job writing message in the server.log in less than 3 secs interval instead of the usual 3 secs interval.

That’s my observation for my above testing. I am using redis-3.2.1-2.el6.remi.x86_64 package.

Hello @keith,

this was worked around in this commit, ported to main and will be released in 2.0.8 this year:

Previously, when a connection to Redis was lost, such as during a failover, it would happen silently. The reason was that gevent does not log any exceptions except for tracebacks to stdout only - it would just quit the greenlet that was responsible for connections to Redis and nothing would be written to server.log.

Now these connections will be re-established automatically and anything that depends on them will continue to work, such as hot-deployment.

Great thanks! Will definitely test it.