Too many open files

Hi,

We are receiving the following error when there is load and long-running processes on zato in the production servers.

2019-05-13 18:15:08,627 - e[1;31mERRORe[0m - 31715:DummyThread-7752189 - service.enqueue:126 - Exception occurred while processing the enqueue service ---- [Errno 24] Too many open files
Traceback (most recent call last):
File “/opt/zato/perch/server1/work/hot-deploy/current/service.py”, line 30, in handle

File “/opt/zato/3.0/code/local/lib/python2.7/site-packages/openerplib/main.py”, line 313, in proxy
args, kw)
File “/opt/zato/3.0/code/local/lib/python2.7/site-packages/openerplib/main.py”, line 180, in proxy
result = self.connector.send(self.service_name, method, *args)
File “/opt/zato/3.0/code/local/lib/python2.7/site-packages/openerplib/main.py”, line 83, in send
return getattr(service, method)(*args)
File “/usr/lib/python2.7/xmlrpclib.py”, line 1243, in call
return self.__send(self.__name, args)
File “/usr/lib/python2.7/xmlrpclib.py”, line 1602, in __request
verbose=self.__verbose
File “/usr/lib/python2.7/xmlrpclib.py”, line 1283, in request
return self.single_request(host, handler, request_body, verbose)
File “/usr/lib/python2.7/xmlrpclib.py”, line 1311, in single_request
self.send_content(h, request_body)
File “/usr/lib/python2.7/xmlrpclib.py”, line 1459, in send_content
connection.endheaders(request_body)
File “/usr/lib/python2.7/httplib.py”, line 1038, in endheaders
self._send_output(message_body)
File “/usr/lib/python2.7/httplib.py”, line 882, in _send_output
self.send(msg)
File “/usr/lib/python2.7/httplib.py”, line 844, in send
self.connect()
File “/usr/lib/python2.7/httplib.py”, line 821, in connect
self.timeout, self.source_address)
File “/opt/zato/3.0/code/local/lib/python2.7/site-packages/gevent/socket.py”, line 86, in create_connection
raise err
error: [Errno 24] Too many open files

When we monitored the open files created by zato some of the files are not been closed. Not sure why these are not getting closed even after the requests are processed.

Before invoking 200 HTTP requests, the open file count for zato server was around 300 but after the invoking of requests it spiked to 500 but never reached back to 300 but it ended around 350. Because of this the open files are getting piled up and receiving the above error.

Are there any issues with some of the internal processing of closing the open files after the requests are processed?

Thanks,
Sai Bhargav

None that I am aware of, it is the first time I hear about it.

The open files that you observe keep growing - can you say what kind of files these are? Do you have any more details?

Hi @dsuch,
Based on our analysis performed we believe that the open files created are socket files when an HTTP request is initiated.

Linux commands used to check the open files

  1. watch -c “lsof -u zato | grep gunicorn | wc -l” (for all the runs below)
  2. lsof -u zato -a -d 0-999 | wc -l

We have monitored the open files created for two days
Day 1. Started the file count at ~395 and reached 670(max) and then dropped to ~490 for 200 requests.
Day 2. Started the file count at ~490 and reached 670(max) and then dropped to ~498 for 200 requests.

Please note that these more amount of time for processing.

We tried to replicate the same on the vagrant machines by changing the default zato timeout and with a sample service with sleep time and without sleep time.

Run 1:

Sample service file with no sleep time and default zato timeout of 15 sec.

Invoked 100 requests in sequence.

Open file count increased from 454 to 455 and drop back to 454.

Run 2:

Sample service file with no sleep time, third-party execution and default zato timeout of 15 sec.

Invoked 100 requests in sequence.

Open file count increased from 454 to 456 and drop back to 454.

Run 3:

Sample service file with a sleep time of 20 sec and default zato timeout of 15 sec.

Invoked 100 requests in sequence.

Open file count increased from 454 to 458 and drop back to 456.

Run 4:

Sample service file with a sleep time of 20 sec and default zato timeout of 15 sec.

Invoked 100 requests in sequence.

Open file count increased from 456 to 460 and drop back to 456.

Run 5:

Restarted zato to make config changes of default timeout.

Sample service file with a sleep time of 20 sec, third-party execution and default timeout of 5 sec.

Invoked 100 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

Open file count increased from 454 to 460 and drop back to 454.

Run 6:

Sample service file with a sleep time of 20 sec and a default timeout of 5 sec.

Invoked 100 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

Open file count increased from 454 to 459 and drop back to 454.

Run 7:

Restarted zato to make config changes of default timeout.

Sample service file with a sleep time of 20 sec and default timeout of 1 sec.

Invoked 200 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

Open files count increased from 454 to 474 back to 454.

Run 8:

Sample service file with a sleep time of 60 sec and default timeout of 1 sec.

Invoked 200 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

Open files count increased from 454 to 514 back to 455.

Run 9:

Sample service file with a sleep time of 60 sec along with third-party requests and default timeout of 1 sec.

Invoked 200 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

Open files count increased from 455 to 517 back to 455.

Run 10:

Sample service file with a sleep time of 250 sec along with third-party requests and default timeout of 1 sec.

Invoked 200 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

Open files count increased from 456 to 517 back to 457.

The increase of 1 file count was when the service file was uploaded. but never came back to original.

Run 11:

Sample service file with a sleep time of 250 sec along with third-party requests looping 386 times and default timeout of 1 sec.

Invoked 200 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

Open files count increased from 457 to 860 back to 458.

Run 12:

Sample service file with a sleep time of 300 sec along with third-party requests looping 386 times and default timeout of 1 sec.

Invoked 200 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

Open files count increased from 458 to 890 back to 459.

Run 13:

Sample service file with a sleep time of 1 hr along with third-party requests looping 386 times and default timeout of 1 sec.

Invoked 200 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

Open files count increased from 456 to 900 back to 456.

Run 14:

Sample service file with a sleep time of 1 hr along with third-party requests looping 386 times, with outgoing SQL connections and default timeout of 1 sec.

Invoked 200 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

open files count increased from 456 to 920 back to 459

Run 15:
Sample service file with a sleep time of 1 hr along with with third-party requests looping 386 times, with outgoing SQL connections, Outgoing Odoo connections(sleep within the connection) and default timeout of 1 sec.

Invoked 200 requests in sequence.

Zato has thrown gateway timeout errors as the requests processing took more than 5 seconds. So there was parallel execution of requests to some extent.

open files count increased from 456 to 920 never reached back to the normal where we started

From all the above runs it looked like the pilling up of files was not consistent.

Not sure if this can be fixed looking forward to your reply. Open to provide any further insights if needed.

Thanks,
Sai Bhargav

I do not understand this part “So there was parallel execution of requests to some extent.” which is repeated a few times. Can you tell me what you mean by it?

I will not be able time to analyze all the scenarios, particularly without sample code, configuration and access to environments where it can be reproduced.

Of all these cases, run 15 really stands out but my first question is - why do you have a service that requires one hour to complete? What is it doing?

Another point, I am not clear what the purpose of having short timeouts on the load-balancer is if you have long-running services? Like in run 15, the service runs for at least one hour but the load-balancer’s timeout is one second? Why is it not in sync?

I understand that there is something that you see should not be there but you are asking for a substantial investigation through a community forum whereas it looks like a job for several days and I cannot do it through the forum; ongoing projects and commercial support have priority.

Hi, @sai_bhargav! Could you please provide example code of the services being called, and how they are called? From your post I take it you are connecting to Odoo and to an SQL database, but little to nothing less.