How to handle timeouts


#1

Hello everyone.
I am looking a way to kill a service when it has exceeded given timeout in seconds.

So basically situation is this: I have created HTTP/JSON API by using Zato’s HTTP channels and services. These services are accessing our PostgreSQL database (SQLAlchemy) and return the result to requesting client.
Generally this setup works fine, but if the request against PostgreSQL takes too long, the following error 504 (Gateway Time-out, The server didn’t respond in time.) is returned.

Main issue here is not that the status 504 is returned, but that the service which was invoked over HTTP channel is never killed in these sitautions.
This consumes server resources for nothing, as the result is never returned to the client (In worser situation, the client could keep on requesting forwever).

For testing solutions and verifying the issue, I’ve created a small service called LoopTest. Running the service takes 100~ seconds to finish and it prints steps while processing.
Services print/debug output is visible for me, as I*m using only one server in foreground mode (zato start server1/ --fg).

from zato.server.service import Service

class LoopTest(Service):

	def handle_GET(self):
		""" URL: /xxxx/tests/v1/looptest/
		""" 
     
		for i in range(1,100):
			print("Looping i:"+str(i))
			time.sleep(1)

So after this long ramble, the actual question is following: What should I do about this?

For now, my only idea is to use a “proxy” service, which will invoke the actual API service with invoke_by_impl_name method (seems to have support for timeouts) or by other means.

Early thanks for any responses.

Ps.
Some people on the forum might suggest optimizing the query and the database and they would be right.
However I’m afraid that there will be similliar issues in the future, which I would not notice until its too late.


#2

Hi @liuz,

I understand your question and I have a partial answer to it followed by a longer explanation.

First, you can reconfigure the load-balancer and make it wait longer for replies from servers - by default it waits up 15000 ms (15 seconds) for responses and then it returns 504, which is what you encounter.

If you know that sometimes it will take up to 120 seconds you can set it accordingly:

https://zato.io/docs/web-admin/load-balancer/gui.html

The value needed is in “Timeouts (ms)” section, position “server”.

But if we tackle your question in a broader manner we can restate it to ‘How do we interrupt any arbitrary Python code’ to which the answer is that it is impossible to achieve just like that, unless there is cooperation from the code in question.

For instance, in your example of time.sleep - it’s not about any feature lacking in Zato, it is simply impossible to stop such arbitrary code from running (short of using signals or brittle Python VM hacks).

This means that for anything to be interruptible, we need to have support from that piece.

In the case of SQL connections, I’d like to suggest checking out extra arguments passed to SQLAlchemy.

When you create an outgoing SQL connection’s definition there is a field called Extra, you can use it to provide key=value parameters, each in its own line, and they are passed to SQLAlchemy’s create_engine method.

If for your database it is possible to specify timeouts, and SQLAlchemy supports it, then this is the place to configure it.

Another approach would be to use the load-balancer itself to stand in front of the database - I haven’t used it before but it sounds feasible. The idea is that instead of invoking the DB directly you would invoke the load-balancer which would act as a reverse proxy with a timeout between itself and DB. Say, if timeout > 5 s occurs, you get a timeout error in your service and can return an error to the client - but note that if it is an INSERT or UPDATE you would still not know if it succeeded, though again, I think it won’t be an issue in your case because you don’t want to wait anyway so it’s probably only SELECTs.

Finally, to refer to the part about using a proxy service - yes, this would work but it’s possible that it is not needed if you check out the approaches I mentioned above plus it still would not interrupt the service being invoked, it would continue to run even if the proxy one returned.

Regards.


#3

Hi @liuz,

I had an issue which maybe can help you think about solutions regarding timeouts.

My environment is composed of an scheduler periodically starting a service, which in turn depends on remote SFTP and FTP calls, moving files from place to place. I use the redis for storing transient state between each call, so it’s very important in my case to never have 2 service calls running at the same time.

With the locking mechanism provided by zato, I can ensure a way of the second invocation to know the first one is still ongoing and stopping it, which works fine. But since my service can run for several minutes (when a lot of files is put in the remote SFTP at the same time) I got a state where the timeout configured in the lock was not big enough and the execution continued even though the lock was already gone. So, like @dsuch stated, I needed to make my service aware of this and detecting when the timeout was reached and either secure the lock again or stop the current execution there and let the next one continue working.

I imagine you can use the same architecture in your example if it’s composed of several database calls, so you are in control of continuing the execution after a certain time is passed (the same as your timeout configuration) or not.

If your problem is related to a single call which is long, then I believe I cannot explain it better than the previous post.

Hope it helps!


#4

@rtrind - this is not related to the original question but would it be possible for you to describe fuller how you are processing the static files?

I’m asking because I’m interested in adding support for this kind of transfer to Zato.

In 3.0, there will be two processing models:

  • Services
  • Publish/subscribe

Pub/sub will make use of services and services will be able to participate in pub/sub too but in principle, they can be used independently.

I’m also thinking about adding file transfer - it’s a model seen in practice so it would make sense to have it as a built-in feature.

In fact, with pub/sub already static files found in selected directories can be published to topics from which subscribers get them, possibly after transformations, but I think there could be room for something more specific to file transfer alone which is why I’m gathering use-cases before setting out to implementation.

This won’t happen for 3.0 almost certainly but it would be good to discuss now - please feel free to open a new thread if you prefer it.

Thanks.


Zato future file transfer processing model discussion
#5

Let’s keep this thread for the OP discussion. I created a new topic for discussion (Zato future file transfer processing model discussion).

[]'s


#6

Greetings,

I followed @dsuch’s advice and fixed the problem temporally by increasing the timeout value. However I’m still looking solutions for avoiding resource hauling. After looking for some sort timeout setting for the SQLAlchemy, we did not find anything (expect for connecting timeout) . I will look into the load-balancer solution for our database among other things.

@rtrind I’m not sure if I can use your solution for this scenario, but our setup is storing files as well, in this scenario your explanation in the other thread is going to be helpful. Especially the part related to situation, where you have 15~20k files in a folder. This is something, which we need handle carefully.

Thank you both for responses.


#7

Hello @liuz,

did you try to set statement_timeout for your query per PostgreSQL’s docs?

https://www.postgresql.org/docs/current/static/runtime-config-client.html

It seems that you need to run this SET statement right before your actual query.

If you can try it out and report back that it works, I will see about incorporating it for Zato 3.0, it seems a useful feature.

Thanks.


#8

Hi @dsuch,

Thanks for the information, we compeletly missed this possibility and it seems to be very nice solution.
Soon as I made some research related to statement_timeout, I realized that I could define separated outgoing connection for services that need more strict control. However I was not able to do this via editing the connection settings.

I’m not sure, but it seems that the Extra field does not allow required information (answer for question [1]) to be set, as it restricts the input to key=value input.

However i was able to create service for testing the statement_timeout manually [2] with session.execute and with help of tested pg_sleep(ms).

 class TimeoutTest(Service):
	 """ 
	 """
 
	 def handle(self):
		 """ Testing timeout 
		 """
		 with closing(self.outgoing.sql.get(xxxxxxxxxx).session()) as session:
			 print("Setting timeout")
			 result = session.execute("SET statement_timeout TO 1000")
			 print("Starting slow query")
			 try:
				 result = session.execute("SELECT pg_sleep(10)")
				 print("Finished")
			 except Exception as ex: 
				 print("Exception occured as expected:"+str(ex))
 
 
		 print("Closing down")
 

Setting timeout
2017-12-01 15:04:23,812 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2017-12-01 15:04:23,812 - INFO - 28110:Dummy-1500 - sqlalchemy.engine.base.Engine:109 - BEGIN (implicit)
2017-12-01 15:04:23,813 INFO sqlalchemy.engine.base.Engine SET statement_timeout TO 9000
2017-12-01 15:04:23,813 - INFO - 28110:Dummy-1500 - sqlalchemy.engine.base.Engine:109 - SET statement_timeout TO 9000
2017-12-01 15:04:23,814 INFO sqlalchemy.engine.base.Engine ()
2017-12-01 15:04:23,814 - INFO - 28110:Dummy-1500 - sqlalchemy.engine.base.Engine:109 - ()
Starting slow query
2017-12-01 15:04:23,822 INFO sqlalchemy.engine.base.Engine SELECT pg_sleep(10)
2017-12-01 15:04:23,822 - INFO - 28110:Dummy-1500 - sqlalchemy.engine.base.Engine:109 - SELECT pg_sleep(10)
2017-12-01 15:04:23,822 INFO sqlalchemy.engine.base.Engine ()
2017-12-01 15:04:23,822 - INFO - 28110:Dummy-1500 - sqlalchemy.engine.base.Engine:109 - ()
Exception occured as expected:(ProgrammingError) (u'ERROR', u'57014', u'canceling statement due to statement timeout', u'postgres.c', u'2971', u'ProcessInterrupts', u'', u'') u'SELECT pg_sleep(10)' ()
2017-12-01 15:04:32,826 INFO sqlalchemy.engine.base.Engine ROLLBACK
2017-12-01 15:04:32,826 - INFO - 28110:Dummy-1500 - sqlalchemy.engine.base.Engine:109 - ROLLBACK
Closing down

However I’m not sure how long the statement_timeout is valid as my knowledge about session pooling is limited.
For example, if one service has setted timeout value, does it affect compeletly different service?
(However i’m guessing that this could be avoided by creating separated connection, for avoiding possible problems)

Links

[1] Creating engine with statement_timeout:

[2] Setting statement_timeout by execute:


#9

You are right that it cannot be set on the level of Extra parameters because Extra are passed not to individual sessions but to SQLAlchemy’s engine.

Here is how it works - in SQLAlchemy there are engines, which are basically entry points to a given database.

An engine will create a connection pool and while it creates it, it may accept extra parameters, which is what the Extra field is for.

Such a connection pool is shared by all services in a given Zato server process.

Now, each service may request an SQL session from connection pool - a session is tied to that one particular service that requested it and only for the duration of the with closing(...) as session: block. A session is simply a series of modifications to SQL objects (either ORM or direct SQL statements).

But to answer your question - will setting statement_timeout impact this one SQLAlchemy session or other sessions, since we are discussing a new possible feature, I don’t know myself yet and would like to ask you to check it with another service running concurrently and making use of the same outgoing connection.

Based on your findings I would know how to proceed for 3.0 - whether statemenet_timeout can be used for each session separately or, if not, whether it should be a new attribute of SQL outgoing connections.

Thanks.


#10

Sure, I will look into this and write a post into this thread after I get some results. If I’m lucky I might even have rather simple solution to use in this version of the Zato. :slight_smile:

About new attribute
Depending on how large amount of work is required and when the 3.0 is released it might be nice to have this as new attribute in outgoing connections. It might be even possible to use both of the features in combination, for example. By setting the default timeout value for the connection in general and using more specific timeouts within a single service.


#11

Greetings everyone,

I have now made some testing with “statement_timeout” parameter. As usual I started by writing some simple services. I’m not going to give detailed information about these services as what I found out, made them kind of inessential in this case.

SET command within a transaction
When i was testing services I never had a situation where “statement_timeout” value would remain within a session. While doing invokes, following question came to my mind: “If you use SET command, are they affected by rollbacks?”. I tested this with psql-client and indeed they do. It was also written into the documentation: https://www.postgresql.org/docs/8.3/static/sql-set.html

If SET (or equivalently SET SESSION) is issued within a transaction that is later aborted, the effects of the SET command disappear when the transaction is rolled back. Once the surrounding transaction is committed, the effects will persist until the end of the session, unless overridden by another SET.

What next?
As I stated before this result made the testing kind of pointless as the rollback was removing possibility of session having the statement_timeout value being set. I think that this behavior is not really an issue, as long it remains constant. However, there might be a connection to previous thread:

If the rollback behavior changes at some point, I’ll try to test this again, as I’m not completely sure about SQLAlchemy’s way of handling the sessions and I am interested to see what will happen. As “temporal” solution, I’m planning to create separated outgoing connection for the services where the “statement_timeout” is used, just to be safe.


#12

I gathered more information about interrupting SQL queries in PostgreSQL and here is the current situation:

  • PG as such lets users cancel queries in progress on the protocol level, so it’s doable in principle

  • There is also a PG function called pg_cancel_backend which lets one cancel a query (not the whole connection, also known as backend, only an individual query, despite the name)

  • This function needs to know the backend’s PID - this means that on Zato level it’s too late for us to run this function, it would require support from the underlying PG library and/or SQLAlchemy

  • This function also needs to be executed from another connection, not the same one that runs the query, so again, it needs cooperation from library/SQLAlchemy

  • All of the above applies to PostgreSQL’s protocol v3.0 (which is what PostgreSQL 9.5 uses) - there are some discussions about protocol v4.0 which could possibly allow in-band query cancellation, i.e. by the same connection that runs the query

All told, the individual pieces seem to be available, so the functionality as such to interrupt SQL queries in PostgreSQL sounds feasible, but it needs to be done below Zato - in a given SQL library or, possibly, in SQLAlchemy too, and in Zato it would be just a single parameter such as timeout=10 that we would have.


#13

Greetings,
In my previous reply I said, that I could not test “SET transaction_timeout XXXX” command, as the rollback was made always. However a while ago I had a facepalm moment, when I realised that I could test this simply by using a commit… So after testing I have come to a following conclusion:

If this command is used within Zato’s service and commit was made, the state will be staying until the session is dead. Meaning that depending on your session pool and usage, you might have a situation where different service will use the session where the timeout has been set.

After playing around with an idea of using “RESET transaction_timeout” command i found out that you can simply add LOCAL into the original command. And the timeout would stay within the current transaction.
Example:. “SET LOCAL transaction_timeout xxxx”. However if this solution is used, you must be aware that every commit or rollback will reset this value back to default.


#14

Hi @liuz,

we all spent quite a bit of time on this matter - now that you have it working, perhaps you could wrap it up with a self-contained service that exhibits the functionality? That would be a great reference point in case of similar questions in the future.

Thanks.


#15

Greetings,

I’m placing this example here as @dsuch suggested.

from zato.server.service import Service
from sqlalchemy.exc import ProgrammingError
import traceback, json

def response(service_ref, state, status_code=200):
	""" Creates response for the HTTP request
		Parameters:
			service_ref:
				Reference to the service instance
			state:
				State of request as string
			status:
				HTTP status code to be returned

	"""
	service_ref.response.content_type = "application/json"
	service_ref.response.payload = json.dumps({"state":state})
	service_ref.response.status_code = status_code


class TimeoutExample(Service):
	""" Simple service for demonstration purposes
	"""

	def handle_GET(self):
		""" Example HTTP GET handler
		    Sets local timeout and calls pg_sleep for simulating a long query.
		"""

		# Get SQL session from outgoing connections
		connection_id = "XXXXXXXXXX"
		session = self.outgoing.sql.get(connection_id).session()

		# Note statement_timeout takes milliseconds
		timeout = 2000 # Time in milliseconds
		sleep_time = 3 # Time in seconds


		# Setting the timeout
		result = session.execute("SET LOCAL statement_timeout TO {}".format(timeout))

		try: # Simulating long query with pg_sleep
			result = session.execute("SELECT pg_sleep({})".format(sleep_time))
			return response(self, "Finished")

		except ProgrammingError as ex:
			# Check if the exception is caused by cancelled query, check following URL for code 57014
			# (https://www.postgresql.org/docs/8.2/static/errcodes-appendix.html

			if int(ex.orig[1]) == 57014:
				# Until more accurate HTTP response code is found
				return response(self, "Query timeout", 504)

			else: # Handle other SQLAlchemy ProgrammingErrors
				self.logger.critical("Unexpected exception, details:"+str(traceback.format_exc()))
				return response(self, "Unknown error occured", 500)

		except Exception:
			return response(self, "Unknown error occured", 500)
			self.logger.critical("Unexpected exception occured, details:"+str(traceback.format_exc()))

Disclaimer

  • Use at your own risk
  • This solution works only within the transaction, if commit or rollback is called. The statement_timeout must be set again.

Ps.
I’m not using the example’s pattern myself. Instead I have created a base class containing this functionality for keeping actual services more clean and for having easier control of the timeout.