Zato services and worker timeouts


#1

Since working with the ideas suggested at Zato scheduler jobs stop working after Gunicorn worker timeout and restart (2.0.7 on RHEL 6.7), I noticed the problem is slightly worse than I noticed.

I am detecting services being stuck during calls to move files frequently, even using the regular zato FTP outgoing connections, with similar symptons just like using external libraries:

  • 2 different services calls try to use an FTP command (like move) at the same time (for different files);
  • Both do not answer, service gets stuck;
  • Future calls to the same service do not execute anything, because I have a lock to avoid double services running at the same time;
  • After timeout from the lock, a new service execution continues working properly
  • After some time, gunicorn workers are detected as TIMEOUT:
    2018-03-16 16:56:11,060 - CRITICAL - 62105:Dummy-1 - gunicorn.main:204 - WORKER TIMEOUT (pid:62619)
    2018-03-16 16:56:11,128 - CRITICAL - 62105:Dummy-1 - gunicorn.main:204 - WORKER TIMEOUT (pid:62619)
  • Life goes on until eventually the entire stack colapses. A cronjob of mine detects this state and restarts the entire Zato platform.

Using interval timed jobs I could not make so the 2 instances do not run at the same second (and the cron level also does not allow me to control the seconds).

It seems we have more than just external libraries not playing nice with gevent.

Ideas (same zato 2.0.7 on RHEL 6.7, as usual)?


#2

Hi @rtrind,

is it stock 2.0.7 or do you have some extra libraries installed and connections running, such as SFTP?


#3

It’s my production environment, so there is the libssh2 libraries for the SFTP remote service, so not stock.

[]'s


#4

@rtrind - do these libraries make any connections? Are they actually running?


#5

Here is my scenario, simplified:

service1 - uses SFTP library to get files from an external location to a local LAN server (but still not the machine zato is located)
services2, 3 and 4 - uses Zato FTP library to open some files for reading or move them to other folders on the same local LAN server.

service1 runs once a minute, with 2 different instances (because I have 2 equivalent environments).

On the case I reported on this thread, it was 2 instances of a service which do not use SFTP library (like service4) running between executions of service1, so no code from the SFTP library was being called at that point in time.

I am afraid if service1 could be turned off (which I can’t, since it’s the service which brings files for the other services to run) we could see the same behavior.

At this point in time I’m kinda thinking if there is nothing external to Zato I could do to make sure we are for sure dealing with Zato library issues. Do you believe the FTP itself could be doing something out of protocol and causing something? I’m at a point I tried 3 or 4 strategies to deal with remote files and since all are not working, maybe the problem is external to gevent, I don’t know. Today this FTP is on vsftpd. I’m not sure I could directly replace this library, since other products on the company access this same machine and I could break something changing this, but if we find it’s the culprit I can think of alternative approaches, maybe.

Thanks!


#6

I have difficulties with your situation because you are using an external library - anything really can happen in such a case.

If they are pure-Python then everything is usually fine, i.e. we don’t have MongoDB connections but the driver is gevent-friendly and some people use Zato with MongoDB without any issues. Same goes for, say, CouchDB.

But in your case, I just don’t know and it is really impossible to offer deeply involved free support over forum if it is a customized environment with non-standard libraries, this is normally covered by commercial arrangements.

I can for sure say that I am not aware of anything out of ordinary with FTP connections and, since they are built into core, if there were any issues with them, that would be fixed.

What I also know is that in the future, SFTP could be handled just like IBM MQ. Earlier this year I added a lot of functionality for embedding non-gevent-friendly libraries, this was partly based on the feedback I had from you about SFTP and it for now includes only IBM MQ, but SFTP or the likes of DB2 can be covered by it at one point.

This is not a straightforward job - with IBM MQ it was several hundred lines of code needed, so again, a commercial contract is the surest way to see it added or alternatively, you can yourself base a new kind of SFTP connector on how MQ works:

Coincidentally, there is a pull request to add support for SAP 3.0, so you could also have a look at how that was done, except that with SFTP you for sure need to base it on MQ instead of Odoo as far as the server side goes. Fronted-wise, yes, you could base it on Odoo, like in this PR:


#7

Alright, @dsuch, thanks for the info. I will wait for 3.0 release and reassess from there. I really appreciate the help and understand the commercial support need for something as complex as this for now.

Thanks!


#8

Hey @dsuch and @rtrind

Having this exact same issue with scheduler as well

In our staging environment, we have a quickstart cluster(for testing only) with a standard setup

  • two servers
  • lb
  • web admin

Redis runs on the same instance as well and postgres DB

We have a scheduled job that invokes an API, downlaods XML -> json parses it -> posts it to another API as json. The entire process takes about 3 seconds to a minute at a time depending on how much data there is.

We have about 8 of these jobs running with different parameters, so 8 jobs running every 30 seconds.
The only change i have made to our zato config for servers is up’d the worker timeout to 10 minutes. Though, when these jobs run alone, RAM consumption for Gunicorn climbs like mad. On a 16gb server with zato running as above and 8 jobs running every thirty seconds for about 1-5 seconds. After two hours of running this. zato is using close to 9gb of RAM. This should not be the case as neither the code nor the scheduler should use a lot of RAM…

As per the above screen shot from htop I can see that there are more than 2 gunicorn workers(as per my zato server config, there should only be two). and that the master processes for this are using a mad amount of ram.

My suspicion here is that the scheduler is not killing off its working and is instead forking it… Can this be confirmed perhaps? I also suspect that the gunicorn workers that are not being killed after they have been completed are not reporting to the master gunicorn service and this is why the service is timing out after a while.


#9

Hello @giannis.katsini,

I think your situation is different to the original one, or at least the root cause is different.

For reference, this is what happens in 2.0.x when a new scheduled job is triggered (it is not the same in 3.0 but it is also an internal detail, which is why it is not documented - it may change at any time):

  • Scheduler always runs on one of Zato servers, i.e. scheduler is a background gevent task running in one of Zato workers
  • Scheduler publishes a message to Redis pub/sub about the new job’s existence
  • All server processes are subscribed to pub/sub in Redis
  • One of server processes picks up the message and starts processing it
  • Processing may mean, for instance, running your service

What happens next depends entirely on what your service is doing.

We have tools used in commercial support to measure RAM consumption of various user-specific workloads using all Python JSON libraries. The reason we have such a specialized tool is that there is a vast discrepancy between them depending on how documents look like - but in principle, you can easily assume that it is not uncommon for the RAM usage to be 10x bigger than JSON files you are processing.

For instance, 500 MB of invoices can easily inflate to 5GB, this is how Python libraries work, it is completely outside of Zato. This is parsing alone, without any actual processing, iterations, look ups, transformation to your business objects or anything. Same goes for XML, this is not uncommon.

In some environments, I have also seen channels / outgoing connections configured to use JSON as data format (which means Zato would parse incoming data as JSON) and then services were parsing it anyway again, which means double the RAM usage.

The best way to see if something is caused by Zato or not is simply to run 500 req/s with small JSON documents for a few hours. The RAM will stay the same. If you keep adding functionality and at one point you will note RAM usage increases, you will know this is something unrelated to Zato as such and there is something in the Python code of services that causes it.

Thus, the question is, how big are your documents and what your code is doing with them?

Also, I understand that you have two servers, but how many worker processes are assigned to either? (gunicorn_wokers in server.conf)

Thanks.


#10

Hey @dsuch.

Thank you for getting back so soon!

We’re running two servers with 2 workers configured. As of now, gunicorn is using close to 14gb of ram :frowning:

All that the job does is

  • Download XML from an external API
  • parse each XML node using lxml - we objectify the xml first
  • After each node has been parsed, we convert it to a standard json object and POST it to another API.

Today alone, my logs show me that for each thirty second run, there are only 2-3 nodes in the XML downloaded once each thirty seconds for 8 different vendors.

I’m not quite sure how to further diagnose this. . . :frowning:


#11

OK, but if you have a total of four workers then you should have four gunicorn processes in htop.

What does zato info /path/to/server1 / server2 show? This will give you all subprocesses a given server uses along with other metadata such as TCP connections established. How can output from this command be reconciled with what htop shows?


#12

Hi @dsuch

Below is the output from info

None of this looks alarming to me…

What is concerning though is that a single server is using A LOT more RAM than another. If I understand what you are saying correctly, the servers will evenly pull messages from the tasks queue and there shouldn’t be this much bias of resources.

Server 1

+--------------------------------+--------------------------------------------------------------------------------------------+
|              Key               |                                           Value                                            |
+================================+============================================================================================+
| component_details              | {"created_user_host": "zato@ip-10-137-170-186", "version": "Zato 2.0.8.rev-050c6697",      |
|                                | "component": "SERVER", "created_ts": "2018-01-31T08:57:09.957902"}                         |
+--------------------------------+--------------------------------------------------------------------------------------------+
| component_full_path            | /opt/zato/env/server1                                                                      |
+--------------------------------+--------------------------------------------------------------------------------------------+
| component_host                 | ip-10-137-170-186/ip-10-137-170-186.ec2.internal                                           |
+--------------------------------+--------------------------------------------------------------------------------------------+
| component_running              | True                                                                                       |
+--------------------------------+--------------------------------------------------------------------------------------------+
| current_time                   | 2018-04-16T12:35:24.941080                                                                 |
+--------------------------------+--------------------------------------------------------------------------------------------+
| current_time_utc               | 2018-04-16T12:35:24.941100                                                                 |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_connections        | {'LISTEN': [{u'to': None, u'from': u'127.0.0.1:17010', u'formatted': u'127.0.0.1:17010'}]} |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_create_time        | 2018-04-16T08:11:36.930000                                                                 |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_create_time_utc    | 2018-04-16T08:11:36.930000+00:00                                                           |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_name               | gunicorn:                                                                                  |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_pid                | 1610                                                                                       |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_username           | zato                                                                                       |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_workers_no         | 2                                                                                          |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_workers_pids       | [1619, 1620]                                                                               |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1619_connections        | {'ESTABLISHED': [{u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:56644', u'formatted':     |
|                                | u'127.0.0.1:56644       ->        127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from':    |
|                                | u'127.0.0.1:56900', u'formatted': u'127.0.0.1:56900       ->        127.0.0.1:6379'},      |
|                                | {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:56970', u'formatted': u'127.0.0.1:56970    |
|                                | ->        127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:56948',        |
|                                | u'formatted': u'127.0.0.1:56948       ->        127.0.0.1:6379'}, {u'to':                  |
|                                | u'127.0.0.1:6379', u'from': u'127.0.0.1:56902', u'formatted': u'127.0.0.1:56902       ->   |
|                                | 127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:56904', u'formatted':    |
|                                | u'127.0.0.1:56904       ->        127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from':    |
|                                | u'127.0.0.1:35182', u'formatted': u'127.0.0.1:35182       ->        127.0.0.1:6379'},      |
|                                | {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:33862', u'formatted': u'127.0.0.1:33862    |
|                                | ->        127.0.0.1:6379'}], 'LISTEN': [{u'to': None, u'from': u'127.0.0.1:17010',         |
|                                | u'formatted': u'127.0.0.1:17010'}]}                                                        |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1619_create_time        | 2018-04-16T08:11:39.670000                                                                 |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1619_create_time_utc    | 2018-04-16T08:11:39.670000+00:00                                                           |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1620_connections        | {'ESTABLISHED': [{u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:57892', u'formatted':     |
|                                | u'127.0.0.1:57892       ->        127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from':    |
|                                | u'127.0.0.1:50898', u'formatted': u'127.0.0.1:50898       ->        127.0.0.1:6379'},      |
|                                | {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:56646', u'formatted': u'127.0.0.1:56646    |
|                                | ->        127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:35186',        |
|                                | u'formatted': u'127.0.0.1:35186       ->        127.0.0.1:6379'}, {u'to':                  |
|                                | u'127.0.0.1:6379', u'from': u'127.0.0.1:56932', u'formatted': u'127.0.0.1:56932       ->   |
|                                | 127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:57346', u'formatted':    |
|                                | u'127.0.0.1:57346       ->        127.0.0.1:6379'}, {u'to': u'137.117.185.69:443',         |
|                                | u'from': u'10.148.3.19:41718', u'formatted': u'10.148.3.19:41718     ->                    |
|                                | 137.117.185.69:443'}, {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:35184',              |
|                                | u'formatted': u'127.0.0.1:35184       ->        127.0.0.1:6379'}, {u'to':                  |
|                                | u'52.201.30.197:443', u'from': u'10.148.3.19:46870', u'formatted': u'10.148.3.19:46870     |
|                                | ->     52.201.30.197:443'}], 'CLOSE_WAIT': [{u'to': u'52.206.189.228:443', u'from':        |
|                                | u'10.148.3.19:50712', u'formatted': u'10.148.3.19:50712     ->    52.206.189.228:443'},    |
|                                | {u'to': u'52.201.30.197:443', u'from': u'10.148.3.19:43262', u'formatted':                 |
|                                | u'10.148.3.19:43262     ->     52.201.30.197:443'}], 'LISTEN': [{u'to': None, u'from':     |
|                                | u'127.0.0.1:17010', u'formatted': u'127.0.0.1:17010'}]}                                    |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1620_create_time        | 2018-04-16T08:11:39.690000                                                                 |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1620_create_time_utc    | 2018-04-16T08:11:39.690000+00:00                                                           |
+--------------------------------+--------------------------------------------------------------------------------------------+


Server 2

+--------------------------------+--------------------------------------------------------------------------------------------+
|              Key               |                                           Value                                            |
+================================+============================================================================================+
| component_details              | {"created_user_host": "zato@ip-10-137-170-186", "version": "Zato 2.0.8.rev-050c6697",      |
|                                | "component": "SERVER", "created_ts": "2018-01-31T08:57:09.972700"}                         |
+--------------------------------+--------------------------------------------------------------------------------------------+
| component_full_path            | /opt/zato/env/server2                                                                      |
+--------------------------------+--------------------------------------------------------------------------------------------+
| component_host                 | ip-10-137-170-186/ip-10-137-170-186.ec2.internal                                           |
+--------------------------------+--------------------------------------------------------------------------------------------+
| component_running              | True                                                                                       |
+--------------------------------+--------------------------------------------------------------------------------------------+
| current_time                   | 2018-04-16T12:37:47.884327                                                                 |
+--------------------------------+--------------------------------------------------------------------------------------------+
| current_time_utc               | 2018-04-16T12:37:47.884347                                                                 |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_connections        | {'LISTEN': [{u'to': None, u'from': u'127.0.0.1:17011', u'formatted': u'127.0.0.1:17011'}]} |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_create_time        | 2018-04-16T08:11:40                                                                        |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_create_time_utc    | 2018-04-16T08:11:40+00:00                                                                  |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_name               | gunicorn:                                                                                  |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_pid                | 1621                                                                                       |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_username           | zato                                                                                       |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_workers_no         | 2                                                                                          |
+--------------------------------+--------------------------------------------------------------------------------------------+
| master_proc_workers_pids       | [1631, 1632]                                                                               |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1631_connections        | {'ESTABLISHED': [{u'to': u'52.201.30.197:443', u'from': u'10.148.3.19:47208',              |
|                                | u'formatted': u'10.148.3.19:47208     ->     52.201.30.197:443'}, {u'to':                  |
|                                | u'127.0.0.1:6379', u'from': u'127.0.0.1:35188', u'formatted': u'127.0.0.1:35188       ->   |
|                                | 127.0.0.1:6379'}, {u'to': u'137.117.185.69:443', u'from': u'10.148.3.19:41350',            |
|                                | u'formatted': u'10.148.3.19:41350     ->    137.117.185.69:443'}, {u'to':                  |
|                                | u'127.0.0.1:6379', u'from': u'127.0.0.1:56858', u'formatted': u'127.0.0.1:56858       ->   |
|                                | 127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:58748', u'formatted':    |
|                                | u'127.0.0.1:58748       ->        127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from':    |
|                                | u'127.0.0.1:56656', u'formatted': u'127.0.0.1:56656       ->        127.0.0.1:6379'},      |
|                                | {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:56856', u'formatted': u'127.0.0.1:56856    |
|                                | ->        127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:56854',        |
|                                | u'formatted': u'127.0.0.1:56854       ->        127.0.0.1:6379'}, {u'to':                  |
|                                | u'127.0.0.1:6379', u'from': u'127.0.0.1:56852', u'formatted': u'127.0.0.1:56852       ->   |
|                                | 127.0.0.1:6379'}, {u'to': u'52.206.189.228:443', u'from': u'10.148.3.19:51694',            |
|                                | u'formatted': u'10.148.3.19:51694     ->    52.206.189.228:443'}], 'CLOSE_WAIT': [{u'to':  |
|                                | u'52.206.189.228:443', u'from': u'10.148.3.19:51116', u'formatted': u'10.148.3.19:51116    |
|                                | ->    52.206.189.228:443'}, {u'to': u'52.206.189.228:443', u'from': u'10.148.3.19:41608',  |
|                                | u'formatted': u'10.148.3.19:41608     ->    52.206.189.228:443'}, {u'to':                  |
|                                | u'158.85.167.221:80', u'from': u'10.148.3.19:60504', u'formatted': u'10.148.3.19:60504     |
|                                | ->     158.85.167.221:80'}, {u'to': u'52.201.30.197:443', u'from': u'10.148.3.19:46500',   |
|                                | u'formatted': u'10.148.3.19:46500     ->     52.201.30.197:443'}, {u'to':                  |
|                                | u'52.206.189.228:443', u'from': u'10.148.3.19:49324', u'formatted': u'10.148.3.19:49324    |
|                                | ->    52.206.189.228:443'}, {u'to': u'52.206.189.228:443', u'from': u'10.148.3.19:50024',  |
|                                | u'formatted': u'10.148.3.19:50024     ->    52.206.189.228:443'}], 'LISTEN': [{u'to':      |
|                                | None, u'from': u'127.0.0.1:17011', u'formatted': u'127.0.0.1:17011'}]}                     |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1631_create_time        | 2018-04-16T08:11:42.120000                                                                 |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1631_create_time_utc    | 2018-04-16T08:11:42.120000+00:00                                                           |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1632_connections        | {'ESTABLISHED': [{u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:56972', u'formatted':     |
|                                | u'127.0.0.1:56972       ->        127.0.0.1:6379'}, {u'to': u'127.0.0.1:6379', u'from':    |
|                                | u'127.0.0.1:56934', u'formatted': u'127.0.0.1:56934       ->        127.0.0.1:6379'},      |
|                                | {u'to': u'127.0.0.1:6379', u'from': u'127.0.0.1:57884', u'formatted': u'127.0.0.1:57884    |
|                                | ->        127.0.0.1:6379'}, {u'to': u'137.117.185.69:443', u'from': u'10.148.3.19:42042',  |
|                                | u'formatted': u'10.148.3.19:42042     ->    137.117.185.69:443'}, {u'to':                  |
|                                | u'127.0.0.1:6379', u'from': u'127.0.0.1:59350', u'formatted': u'127.0.0.1:59350       ->   |
|                                | 127.0.0.1:6379'}, {u'to': u'52.201.30.197:443', u'from': u'10.148.3.19:47194',             |
|                                | u'formatted': u'10.148.3.19:47194     ->     52.201.30.197:443'}, {u'to':                  |
|                                | u'127.0.0.1:6379', u'from': u'127.0.0.1:56658', u'formatted': u'127.0.0.1:56658       ->   |
|                                | 127.0.0.1:6379'}], 'CLOSE_WAIT': [{u'to': u'52.201.30.197:443', u'from':                   |
|                                | u'10.148.3.19:45148', u'formatted': u'10.148.3.19:45148     ->     52.201.30.197:443'},    |
|                                | {u'to': u'52.206.189.228:443', u'from': u'10.148.3.19:48794', u'formatted':                |
|                                | u'10.148.3.19:48794     ->    52.206.189.228:443'}], 'LISTEN': [{u'to': None, u'from':     |
|                                | u'127.0.0.1:17011', u'formatted': u'127.0.0.1:17011'}]}                                    |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1632_create_time        | 2018-04-16T08:11:42.200000                                                                 |
+--------------------------------+--------------------------------------------------------------------------------------------+
| worker_1632_create_time_utc    | 2018-04-16T08:11:42.200000+00:00                                                           |
+--------------------------------+--------------------------------------------------------------------------------------------+

#13

Right, but zato info shows four Zato processes whereas htop displays much more of them - what started them? Can you say what they are. really?


#14

I believe htop is listing the threads from each process. It’s a bit confusing here as usually “ps” command will list threads with the same pid as the process to indicate the threads are spawn by the process. The image above did shows 2 master,4 workers and the rest are workers threads. One worker did consume around 10.9G of mem though. Hope it helps.