Scheduler fires same service twice


#1

Hi,
We are having some issues where a couple scheduled services now seem to want to do the same job across the two servers running on our small cluster.

This popped up after a cluster restart yesterday. It is as if the scheduler is tasking both servers to execute the same service at the same time, in effect doing the same job twice and causing a bit of a mess.

Here is how it used to look in an older log file:

2017-10-20 15:37:25,861 - ^[[1;37mINFO^[[0m - 10502:Dummy-6876 - dist-core-pdf-production.poll-result-folder:22 - ----------- START ----------
2017-10-20 15:37:25,861 - ^[[1;37mINFO^[[0m - 10502:Dummy-6876 - dist-core-pdf-production.poll-result-folder:22 - ----------- END ----------

And that was that until next schedule cycle.

Here is after cluster restart:

2017-11-05 00:40:00,012 - ^[[1;37mINFO^[[0m - 24086:Dummy-3168 - dist-core-pdf-production.poll-result-folder:22 - ----------- START ----------
2017-11-05 00:40:00,014 - ^[[1;37mINFO^[[0m - 24086:Dummy-3168 - dist-core-pdf-production.poll-result-folder:22 - ----------- END ----------
2017-11-05 00:40:00,014 - ^[[1;37mINFO^[[0m - 24087:Dummy-1555 - dist-core-pdf-production.poll-result-folder:22 - ----------- START ----------
2017-11-05 00:40:00,015 - ^[[1;37mINFO^[[0m - 24087:Dummy-1555 - dist-core-pdf-production.poll-result-folder:22 - ----------- END ----------

And here is after cluster restart also, earlier than the example above. Here the two instances of this service are stealing files from each other, causing panic:

2017-11-04 23:43:00,007 - ^[[1;37mINFO^[[0m - 24108:Dummy-958 - dist-core-pdf-production.poll-result-folder:22 - ----------- START ----------
2017-11-04 23:43:00,008 - ^[[1;37mINFO^[[0m - 24107:Dummy-1908 - dist-core-pdf-production.poll-result-folder:22 - 1 pdf result file(s) up for import. Sending to result->
2017-11-04 23:43:00,009 - ^[[1;37mINFO^[[0m - 24108:Dummy-958 - dist-core-pdf-production.poll-result-folder:22 - 1 pdf result file(s) up for import. Sending to result->
2017-11-04 23:43:00,561 - ^[[1;37mINFO^[[0m - 24107:Dummy-1908 - dist-core-pdf-production.poll-result-folder:22 - [Errno 2] No such file or directory: u’/fileexchange/dist/pdf/Pdf_bundle_Queue_6_2017-11-04_F061_.zip’
2017-11-04 23:43:00,563 - ^[[1;37mINFO^[[0m - 24107:Dummy-1908 - dist-core-pdf-production.poll-result-folder:22 - Error unzipping pdf file bundle

We are running Zato 2.0.7 at the moment, I’ve read the posts about this regarding versions < 2.0.4 and that it was fixed.

Anyone got a fix for this apart from cluster restart and prayers?

Thanks


#2

Don’t restart the whole cluster. Instead do this:

  1. Stop one server
  2. Wait 120 seconds
  3. Confirm in logs that the other server’s scheduled jobs run
  4. Start the stopped server back

To be precise, the time you need to wait must be greater than server.conf’s singleton.connector_server_keep_alive_job_time * singleton.grace_time_multiplier, which by default is 30 * 3 = 90 seconds, so 120 seconds will cover it.


#3

No luck with stop and restart of one of the servers.

Goes straight back shooting scheduled services twice.

2017-11-05 17:27:00,011 - INFO - 24108:Dummy-4383 - dist-core-pdf-production.produce-worklist:22 - ----------- START ----------
2017-11-05 17:27:00,055 - INFO - 24108:Dummy-4383 - dist-core-pdf-production.produce-worklist:22 - ----------- END ----------
2017-11-05 17:27:00,059 - INFO - 24108:Dummy-4384 - dist-core-pdf-production.produce-worklist:22 - ----------- START ----------
2017-11-05 17:27:00,085 - INFO - 24108:Dummy-4384 - dist-core-pdf-production.produce-worklist:22 - ----------- END ----------
2017-11-05 17:27:08,882 - INFO - 24108:Dummy-4386 - dist-core-document.store:22 - ----------- START ----------
2017-11-05 17:27:08,912 - INFO - 24108:Dummy-4386 - dist-core-document.store:22 - ----------- END ----------
2017-11-05 17:27:26,153 - INFO - 24107:Dummy-11172 - dist-core-document.store:22 - ----------- START ----------
2017-11-05 17:27:26,237 - INFO - 24107:Dummy-11172 - dist-core-document.store:22 - ----------- END ----------


#4

Restart the cluster in this case but bring the servers back individually, 120 seconds apart, as above.


#5

Also, if you don’t want to restart the whole cluster, repeat the previous procedure but with servers reversed, the one that you stopped first now will run but the other will be stopped and then started.


#6

Yeah, I’ve just done it reversed. As soon as the second server boots up the double service fired thing starts again.

I’ll reverse it again and give it another try. If that fails I’ll take the whole cluster down and bringing it back up with only one server.


#7

This is happening definitely because of the lack of concensus between
the Zato Servers about who should be the one that execute the job to
don’t repeat executions.
Also related to this another thread

The solution that I have used before was to run an isolated Zato
servers using the same Redis Server/DB but with different db index to
be able to control what Zato Server (running exclusively for jobs)
will be executing a specific job and don’t allow to another Zato
Server to be able to execute the same Job. With this approach you just
need to “manually”/“externally” distribute your Zato Jobs writing the
job config to the Zato server that will be in charge to execute it,
talking directly to the isolated Zato Server not part of the Zato
Service Servers cluster.


#8

@aekroft @badsequel Another point is that in 3.0 the scheduler runs under its own daemon, without a need for any synchronization.


#9

Well, kinda new to this beast called Zato so running isolated servers and custom configs etc is way above my head at the moment. Our cluster has it roots in quickstart and usually works well for our purposes.

If I cannot get these two servers to cooperate today I am leaving one of them down for the time being. We shall see.

And I am looking forward to when 3.0 and the documentation is out.

Anyway, continuing my struggle here. I’ll post my final result for today in a bit.


#10

Forgot to update this a week ago.

My findings were that restarting individual servers changed nothing. As soon as the second server came up scheduled services started running twice.

A cluster restart fixed the issue last week, but today - out of the blue - this problem came back. We are now running on a single server.


#11

Would implementing distributed locks be a good way to handle this issue?

https://zato.io/docs/progguide/dist-locks.html


#12

I don’t have the double execution issue on my environment, but on my case there is a risk of a service running for more time than expected and, in some cases, the scheduler fired the next execution before the first one finished. Implementing the lock seems to be a good way to avoid any problems until you understand the root cause, imo.


#13

Hi @badsequel,

yes, distributed locks are what is needed but this has to be done in the core, not in your services. The root cause is to be found before your code runs.

That said, you could apply the locks just like @rtrind said as a workaround.

Unfortunately, for 100% I won’t have time in this week to look into it any further, I’m not sure about the next one.


#14

Hello

I work with OP.

We don’t really want to hack too much before 3.0 comes and thus we are currently thinking about trying 2 methods. 1: Run a single server with 4 workers instead. 2: expose the services over http channel and running the services through the OS’s Cron scheduler. this would atleast use the load balancer and only fire a service once.

We tried some tests with locks, however it seemed like even though we did timeout=0, expires=175 and let the locked service run for 60 seconds with a sleep, the second service(that was supposed to never run because timeout=0) ran when the first service ended after 60 seconds, OR after 175 seconds. So it seemed to us like it worked much more like a queue. during the test we invoked the service twice through two opened zato web dashboards.

We were unable to catch any exceptions at all.

cheers