Recently we had an issue with the Zato 2.0 in our production environment.
With a series of bulk enqueue of queue items using the HTTP channels on Zato and RabbitMQ as a Queue, the memory utilization on the machine hosting the Zato application has crossed 90% of utilization.
And approximately at the same time, we have seen the following logs in the Zato server.
2018-09-04 16:43:06,745 - e[1;33mCRITICALe[0m - 31140:Dummy-1 - gunicorn.main:167 - WORKER TIMEOUT (pid:31150)
2018-09-04 16:43:07,916 - e[1;33mCRITICALe[0m - 31140:Dummy-1 - gunicorn.main:167 - WORKER TIMEOUT (pid:31150)
2018-09-04 16:43:07,967 - e[1;33mCRITICALe[0m - 31140:Dummy-1 - gunicorn.main:167 - WORKER TIMEOUT (pid:31150)
2018-09-04 16:43:09,058 - e[1;37mINFOe[0m - 7381:Dummy-1 - gunicorn.main:22 - Booting worker with pid: 7381
Which basically says that the zato server was killed and rebooted by itself again. This log is followed by the sequence of logs which is basically to re-establish the outgoing connection pools.
The actual issue is after this incident, we observed that the Scheduler and the SQL Notification services have stopped working. Later this it forced us to restart the Zato on our production environment to make all the services up and running.
Scheduler - We could see no activity on the scheduler.log which made us realize that the schedule is not working.
SQL Notification - We couldn’t realize this until we analyse that data in the table records and realized that the service is not being invoked.
Is there any way that we can monitor/know that the SQL Notification service is running or not?
What could be the possible issue here? Is this a known issue with Zato 2.0? What could have been the possible resolution apart from restarting the Zato on the production environment?