Finally following up on the ODB High Availability topic from Zato general architecture questions, I discovered about MySQL InnoDB Cluster, and since it’s already in GA and has no custom external solution to make it HA, I tried to create an environment similar to my production one in my laboratory:
3 servers, each with a:
- zato server (2.0.7, since it’s the version I’m using at production right now) with 2 workers;
- zato load balancer;
- redis with sentinel;
- mysql using innodb clustering + mysql router;
Since using InnoDB Cluster there is a new component called MySQL Router, which abstracts the notion of the HA from the Zato, so in the server and webadmin configuration I just use the port from the local mysql router, and this module knows how to detect which server is the primary and changes the connection from one machine to the other in case of failure.
When all is up and running, everything seems smooth. If a secondary machine goes down, no problem, all continues to work, as expected. But when the primary goes down, we have some issues…
Before the node goes down, 2 out of the 3 servers keep logging the odb last_alive message, like below:
2018-01-11 11:16:51,279 - INFO - 7336:Dummy-17395 - zato.server.odb:22 - last_keep_alive:[2018-01-11 13:16:33], grace_time:[90], max_allowed:[2018-01-11 13:18:03], now:[2018-01-11 13:16:51.279006]
As soon as the primary goes down, we see some stack traces, generally with the message “Lost connection to MySQL server during query” (complete stack traces from one of the servers will be pasted at the end) or redis unavailable messages (since it takes sometime for the redis to redirect the master to another machine).
After a couple of minutes the traces stop, since I imagine each connection to the ODB manages to be reconnected with another server which is now the new primary. But when this happens, the servers do not log the last_alive message and the scheduled services do not run. The logs actually stop printing anything except for messages logged for any service called externally. Calling a service of mine works normally, acessing Redis data without any issue.
In my test scenario I created a new scheduled service and it never fired, even when clicking on the execute button. When I brought the down machine back up, it rejoined the cluster and started working with the scheduled jobs.
This translates to a HA scenario which still would need human intervention for restarting the remaining servers when the primary ODB machine goes down, which is not ideal.
Looking at how the mysql router works, the first or second query you perform after the node is down is kinda lost, not being rerouted immediately to the new server, which may break something inside Zato when it happens. See below:
[root@zato03 ~]# mysql -u root -P 6446 -h 127.0.0.1 -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
...
mysql> select @@hostname;
+------------+
| @@hostname |
+------------+
| zato01 |
+------------+
1 row in set (0.00 sec)
mysql> select @@hostname;
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> select @@hostname;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id: 40
Current database: *** NONE ***
+------------+
| @@hostname |
+------------+
| zato03 |
+------------+
1 row in set (0.05 sec)
So my question to you is if there any configuration I might change which could lead to the remaining servers to not stop running the scheduled jobs after a primary ODB failure.
Since I know Zato 3.0 has some differences in how it handles the scheduler architecture, is it subject to the same issues reported here? If not, how hard it is for it to be able to handle reconnecting to the DB without losing functionality until the next restart?
This is the last major architecture point for me to solve HA involving Zato so any help is appreciated.
Logs from one of the nodes at failure time:
2018-01-11 12:31:27,588 - INFO - 2757:Dummy-15 - zato.helpers.input-logger:800 - {u'impl_name': u'zato.server.service.internal.helpers.InputLogger', u'name': u'zato.helpers.input-logger', u'cid': u'K05YNF9KG5TKFF9TF37Q0AWFJ9C7', u'invocation_time': datetime.datetime(2018, 1, 11, 14, 31, 27, 585848), u'job_type': None, u'data_format': None, u'slow_threshold': 99999, u'request.payload': u'Sample payload for a startup service (first worker)', u'wsgi_environ': {u'zato.request_ctx.async_msg': Bunch(action=u'101802', channel=u'startup-service', cid=u'K05YNF9KG5TKFF9TF37Q0AWFJ9C7', msg_type=u'0001', payload=u'Sample payload for a startup service (first worker)', service=u'zato.helpers.input-logger'), u'zato.request_ctx.fanout_cid': None, u'zato.request_ctx.in_reply_to': None, u'zato.request_ctx.parallel_exec_cid': None}, u'environ': {}, u'usage': 32, u'channel': u'startup-service'}
2018-01-11 12:36:48,148 - WARNING - 2757:Dummy-53 - zato.server.service:22 - Traceback (most recent call last):
File "/opt/zato/2.0.7/code/zato-server/src/zato/server/service/__init__.py", line 401, in update_handle
self._invoke(service, channel)
File "/opt/zato/2.0.7/code/zato-server/src/zato/server/service/__init__.py", line 348, in _invoke
service.handle()
File "/opt/zato/2.0.7/code/zato-server/src/zato/server/service/internal/server.py", line 40, in handle
filter(Cluster.id == cluster_id).\
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2398, in one
ret = list(self)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2441, in __iter__
return self._execute_and_instances(context)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2456, in _execute_and_instances
result = conn.execute(querycontext.statement, self._params)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 841, in execute
return meth(self, multiparams, params)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/sql/elements.py", line 322, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 938, in _execute_clauseelement
compiled_sql, distilled_params
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1070, in _execute_context
context)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1271, in _handle_dbapi_exception
exc_info
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/util/compat.py", line 199, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1063, in _execute_context
context)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 442, in do_execute
cursor.execute(statement, parameters)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/cursors.py", line 132, in execute
result = self._query(query)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/cursors.py", line 271, in _query
conn.query(q)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 726, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 861, in _read_query_result
result.read()
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 1064, in read
first_packet = self.connection._read_packet()
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 825, in _read_packet
packet = packet_type(self)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 242, in __init__
self._recv_packet(connection)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 248, in _recv_packet
packet_header = connection._read_bytes(4)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 841, in _read_bytes
"Lost connection to MySQL server during query")
OperationalError: (OperationalError) (2013, 'Lost connection to MySQL server during query') 'SELECT cluster.id AS cluster_id, cluster.name AS cluster_name, cluster.description AS cluster_description, cluster.odb_type AS cluster_odb_type, cluster.odb_host AS cluster_odb_host, cluster.odb_port AS cluster_odb_port, cluster.odb_user AS cluster_odb_user, cluster.odb_db_name AS cluster_odb_db_name, cluster.odb_schema AS cluster_odb_schema, cluster.broker_host AS cluster_broker_host, cluster.broker_port AS cluster_broker_port, cluster.lb_host AS cluster_lb_host, cluster.lb_port AS cluster_lb_port, cluster.lb_agent_port AS cluster_lb_agent_port, cluster.cw_srv_id AS cluster_cw_srv_id, cluster.cw_srv_keep_alive_dt AS cluster_cw_srv_keep_alive_dt \nFROM cluster \nWHERE cluster.id = %s FOR UPDATE' (1,)
2018-01-11 12:36:48,168 - ERROR - 2757:Dummy-53 - zato.server.base:22 - Could not handle broker msg:[Bunch(action=u'100004', cid=u'K06A1K0EBG5M29KBQXVDMDN2RMM9', job_type=u'interval_based', msg_type=u'0001', name=u'zato.server.cluster-wide-singleton-keep-alive', payload=u'server_id:3;cluster_id:1', service=u'zato.server.cluster-wide-singleton-keep-alive')], e:[Traceback (most recent call last):
File "/opt/zato/2.0.7/code/zato-server/src/zato/server/base/__init__.py", line 47, in on_broker_msg
getattr(self, handler)(msg)
File "/opt/zato/2.0.7/code/zato-server/src/zato/server/base/worker.py", line 1028, in on_broker_msg_SCHEDULER_JOB_EXECUTED
return self.on_message_invoke_service(msg, CHANNEL.SCHEDULER, 'SCHEDULER_JOB_EXECUTED', args)
File "/opt/zato/2.0.7/code/zato-server/src/zato/server/base/worker.py", line 1007, in on_message_invoke_service
environ=msg.get('environ'))
File "/opt/zato/2.0.7/code/zato-server/src/zato/server/service/__init__.py", line 401, in update_handle
self._invoke(service, channel)
File "/opt/zato/2.0.7/code/zato-server/src/zato/server/service/__init__.py", line 348, in _invoke
service.handle()
File "/opt/zato/2.0.7/code/zato-server/src/zato/server/service/internal/server.py", line 40, in handle
filter(Cluster.id == cluster_id).\
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2398, in one
ret = list(self)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2441, in __iter__
return self._execute_and_instances(context)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/orm/query.py", line 2456, in _execute_and_instances
result = conn.execute(querycontext.statement, self._params)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 841, in execute
return meth(self, multiparams, params)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/sql/elements.py", line 322, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 938, in _execute_clauseelement
compiled_sql, distilled_params
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1070, in _execute_context
context)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1271, in _handle_dbapi_exception
exc_info
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/util/compat.py", line 199, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1063, in _execute_context
context)
File "/opt/zato/2.0.7/code/eggs/SQLAlchemy-0.9.9-py2.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 442, in do_execute
cursor.execute(statement, parameters)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/cursors.py", line 132, in execute
result = self._query(query)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/cursors.py", line 271, in _query
conn.query(q)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 726, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 861, in _read_query_result
result.read()
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 1064, in read
first_packet = self.connection._read_packet()
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 825, in _read_packet
packet = packet_type(self)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 242, in __init__
self._recv_packet(connection)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 248, in _recv_packet
packet_header = connection._read_bytes(4)
File "/opt/zato/2.0.7/code/eggs/PyMySQL-0.6.2-py2.7.egg/pymysql/connections.py", line 841, in _read_bytes
"Lost connection to MySQL server during query")
OperationalError: (OperationalError) (2013, 'Lost connection to MySQL server during query') 'SELECT cluster.id AS cluster_id, cluster.name AS cluster_name, cluster.description AS cluster_description, cluster.odb_type AS cluster_odb_type, cluster.odb_host AS cluster_odb_host, cluster.odb_port AS cluster_odb_port, cluster.odb_user AS cluster_odb_user, cluster.odb_db_name AS cluster_odb_db_name, cluster.odb_schema AS cluster_odb_schema, cluster.broker_host AS cluster_broker_host, cluster.broker_port AS cluster_broker_port, cluster.lb_host AS cluster_lb_host, cluster.lb_port AS cluster_lb_port, cluster.lb_agent_port AS cluster_lb_agent_port, cluster.cw_srv_id AS cluster_cw_srv_id, cluster.cw_srv_keep_alive_dt AS cluster_cw_srv_keep_alive_dt \nFROM cluster \nWHERE cluster.id = %s FOR UPDATE' (1,)
]
2018-01-11 12:39:44,370 - INFO - 2757:Dummy-61 - company.xxx.helper:22 - This is a custom message inside my service