Issue with starting SQL Notification and Scheduler - Zato 2.0


#1

Hi,

Recently we had an issue with the Zato 2.0 in our production environment.

With a series of bulk enqueue of queue items using the HTTP channels on Zato and RabbitMQ as a Queue, the memory utilization on the machine hosting the Zato application has crossed 90% of utilization.

And approximately at the same time, we have seen the following logs in the Zato server.

2018-09-04 16:43:06,745 - e[1;33mCRITICALe[0m - 31140:Dummy-1 - gunicorn.main:167 - WORKER TIMEOUT (pid:31150)
2018-09-04 16:43:07,916 - e[1;33mCRITICALe[0m - 31140:Dummy-1 - gunicorn.main:167 - WORKER TIMEOUT (pid:31150)
2018-09-04 16:43:07,967 - e[1;33mCRITICALe[0m - 31140:Dummy-1 - gunicorn.main:167 - WORKER TIMEOUT (pid:31150)
2018-09-04 16:43:09,058 - e[1;37mINFOe[0m - 7381:Dummy-1 - gunicorn.main:22 - Booting worker with pid: 7381

Which basically says that the zato server was killed and rebooted by itself again. This log is followed by the sequence of logs which is basically to re-establish the outgoing connection pools.

The actual issue is after this incident, we observed that the Scheduler and the SQL Notification services have stopped working. Later this it forced us to restart the Zato on our production environment to make all the services up and running.

Note:
Scheduler - We could see no activity on the scheduler.log which made us realize that the schedule is not working.
SQL Notification - We couldn’t realize this until we analyse that data in the table records and realized that the service is not being invoked.
Is there any way that we can monitor/know that the SQL Notification service is running or not?

What could be the possible issue here? Is this a known issue with Zato 2.0? What could have been the possible resolution apart from restarting the Zato on the production environment?


#2

Hello,

I have a different root cause which causes my workers to sometimes restart and in these cases, it’s “normal” for the scheduler sometimes to stop firing the jobs until the server is restarted. I have a script monitoring for such cases and restarting the environment automatically (which maybe is something you cannot do on your environment).

I’m currently evaluating Zato 3 with the new scheduler isolation and I believe this decoupling can have the side effect of not being a problem anymore in this new architecture, if you have the chance to test this in a lab environment and upgrading in the future, maybe it’s a good fit for you as well. Not sure if the SQL Notification would still remain with the current behavior, I don’t use it so I cannot say anything about it.

[]'s


#3

Dariusz,

Do you have any suggestions on how to resolve the issue explained by Kiran?

Thanks,
Prasantarmaye


#4
  • When you issue a command such as zato start /path/to/server what is actually first started is a process called arbiter

  • This arbiter process is a tiny one which starts actual server processes, you can find it in logs, e.g.

INFO - 13071:MainThread - Listening at: http://0.0.0.0:17010 (13071)
INFO - 13090:MainThread - Booting worker with pid: 13090
  • Above, arbiter (pid 13071) started a single server process with pid 13090

  • Server processes need to periodically let the arbiter know that they are still up. They do it by storing a timestamp in a special file that arbiter expects.

  • If the arbiter does not see the update to this file it assumes that the server process is down

  • How long a particular server has between updates is governed by main.gunicorn_timeout entry in server.conf - by default it is 240 seconds = 4 minutes

  • If a process does not store the timestamp, the arbiter restarts it and stores an entry on CRITICAL level in server.log

  • The reason why a server process did not store its status and arbiter had to restart it is almost certainly because it was short on CPU and/or RAM

  • Remember that Zato is based on multi-processing but each process is an asynchronous server which means that if you carry out a CPU-extensive operation then this server, this process doing this CPU-heavy operation will be completely blocked

  • If such an operation takes longer than main.gunicorn_timeout, arbiter will restart it

  • Also, but only in Zato 2.0 and earlier, if such an operation appropriates a CPU for its own and it so happens that scheduler runs in the same process, no tasks will run because there is no access to the CPU

  • This is one of reasons, rather than a side effect, of why in Zato 3.0 the scheduler is a separate component started on its own - you can run it in a small VM and be sure it will never go down because one of services runs for a long time and takes complete posession of a CPU

  • Also, in Zato 3.0 AMQP connections were completely rewritten from scratch and take vastly less RAM, e.g. 0.5 MB instead of 0.5 GB (incidentally, the same underlying architectural change was applied to IBM MQ and ZeroMQ)

  • It is not possible to achieve the same in 2.0 - it was not a trivial change and required architectural changes, both visible to users (scheduler is a separate component) and internal ones (AMQP, IBM MQ and ZeroMQ redesigned and re-implemented from nil)

  • Updates, changes or modifications to 2.0 are delivered only to clients with commercial support and there will be no further changes to 2.0 otherwise

  • There is no way to track progress of SQL notifications in 2.0 nor in 3.0 but, if you open a ticket in GH, it could be done for 3.0, e.g. each time a notification runs it would store in logs information what query it issues and using what parameters

  • I wrote a longer message a while back about how asynchronous servers and Zato in particular work