Zato future file transfer processing model discussion


#1

This is a new post to discuss this reply in another thread:

My main usage of Zato is for a mini-workflow for file pre-processing and queue management. The workflow is like this:

  1. An external provider service files in a remote SFTP;
  2. My Service1 pools periodically this folder, using an external library (today it’s ssh2-python) and moves files to an internal folder based of the origin folder and type of file, using zato FTP feature;
  3. My Service2 pools periodically some folders from the previous service and converts each of them to another format;
  4. My Service3 pools periodically every file already processed by services 1 and 2 and for each file, add each file to a queue to service an external system;
  5. My Service4 runs periodically and manages the queue created by service2, moving files to be processed to the the external system folder and checking if files from previous executions are done processing, so they are removed from the queue.

All the serviced folders are not local to the Zato machines, so I used always a pooling system to find files, manage which files were already processed or not and so on (also I found no way of using something like inotify remotely).

If Zato could provide a way to notify a service or a pub/sub system with new remote files directly, it would be an awesome addition. Also a cleaner integrated remote file system library would be good. The current version of pyfs (at least in 2.0.7) is problematic. It does not support SFTP natively and the FTP implementation does a poor job of manage operations in large folders. As soon as you approach 15~20k files in a folder, each simple operation takes way longer than normal (which made me implement an external purge system to cleanup the end folders to avoid larger issues).

I failed at managing to upgrade pyfs without breaking zato, paramiko is not gevent friendly and ssh2-python was the closer I got to a stable system, but if I use it in all my services (1-4), we start having gunicorn timeouts, restarts and eventually the cluster stops working, needing a full restart. To make my system stable, I used ssh2-python only at the Service1 (which is the one the customer required to use SFTP) and the local zato FTP system for all the rest. It’s not ideal (since this lib is faster) but it’s working.

Thanks for the question and if you need more clarification, let me know.


How to handle timeouts
#2

Thanks @rtrind, let me explain how it could look like in Zato 3.0, assuming that you were using local files only (I know you are not, I will refer to that as well).

You didn’t say what sort of business data it is so only for illustration’s purposes I will assume these are invoices. You also didn’t say what protocol the external system uses, let’s say it is HTTP (but please confirm what it is).

  • A local directory /data/files/invoices is configured as input source for topic my.invoices
  • An external HTTP system subscribes to topic my.invoices providing an HTTP callback address to use for each message published to that topic
  • An inotify background task runs for that directory, listening for new files
  • For each file stored in this directory, inotify notifies Zato and gives it the file (contents or path) on input
  • By default, Zato simply delivers the file to subscriber(s)
  • But, if you configure it so, a hook service of yours can run - before a message is published to topic or before a message is delivered to subscriber - which means that you can change contents of data once for all subscribers or for each separately. In fact, you can run hook services in both situations if needed.
  • Pub/sub takes care itself of deliveries, retries, keeping track of what was sent already or not and about making sure messages are sent in the same order as they were published - there are also other options like using timestamps embedded in messages or deliveries by message priority

This, I think, would be the whole of it if your messages were available locally. Most of it would be a matter of configuring topic names and directories in web-admin + the custom hook service to transform messages from one format to another.

But since these are not local files, I have had another notion for a related feature that I discussed with a user. In that situation, there was a need to expose Windows binaries as REST, basically .exe as a service.

We came up with an idea of creating an ExeAgent, a standalone program, distributed separately from core Zato, that would accept remote REST requests, run a binary and return output from that binary in a REST response.

With files, we could make it work in a similar manner, but here we would have a FileAgent. A separate program, running on Windows as a service, on Linux under systemd/supervisord/standalone, that would listen for changes in directories and report all new files to pub/sub topics of choice.

That is, the direction would be reversed - instead of pulling data from SFTP/FTP, we would have an agent running on that remote resource’s system to push messages to Zato pub/sub.

Naturally, one cannot always do it - sometimes one works with a business partner who will only grant one remote access to an FTP. But in the case of all systems under one’s direct control, such an agent could do the trick I think.

Since it would not be distributed in the core Zato package, this agent can be made available after 3.0 is released - I just want to cut down on the number of features that 3.0 has because I want to release it as soon as it is practical to have it out.

Eventually, we would need to have logic to listen for changes in remote FTP/SFTP resources, scan them periodically, and keep track manually (in core Zato) of what changed and then fire notifications to pub/sub, but the above could be first steps, I think? Would that cover your needs or do you see room for improvement, or perhaps have other ideas?

Thanks.


#3

In summary, all your suggestions seems to be aligned with my use cases.

Extra information of my scenarios:

  • All files on the workflow are simple XML files with public TV scheduling data (I cannot detail any more than this to avoid any NDA issues);
  • I have no control on the external system which supplies the data for my input SFTP, or even the machine where the SFTP itself resides;
  • The other external system (which consume the files after Service4) also pools a folder to start the procedure and based on the extension of the file I know in which phase the process is. After if finishes, I know the results because the file is renamed with an extra “.success” or “.failed” extension (which usually has a side “.error” with the description of what went wrong), so no HTTP on either side of the workflow. I pool a REST API of this system to check for other information, but not for the file workflow.

Now let me ramble about things based on your points.

I can see your workflow for local files being helpful, since I could use it for the internal communication between services (so the only cases not using it would be the input for service1 and the output for service 4). In other words, if service1 calls service2 sending the file directly using the zato architecture, there was no need for me to control this using folders anymore (at least internally).

The “Remote Inotify as a service” could be extremely helpful. For example, when I output the files on service4, the destination machine is ours to manage (also on RHEL), so we could easily have a plugin capturing the file rename and sending this info to Zato. I see no problem on this agent not being on Zato core, I even prefer it like this (since in my case, this machine won’t have a Zato server there).

The only thing missing on our discussion is the input external SFTP. This one is supplied by the customer, and even though we have SSH access to it, we cannot directly install anything there, so the input procedure would not be covered here. But we can always be creative. For example, we can rsync the entire folder to a local repository and since this local machine is managed by us, then we can have the remote inotify zato plugin working there to inform us of a new file. I never went deep down on if rsync messes up some notifications when synchronizing remotely, but it’s something we could test (although I don’t like the inefficiency of this, initially).

Your other idea of simply having something like a Zato outgoing configuration for file pooling over SFTP/FTP and sending this info for pub/sub would also help me greatly (and keep the services generally like they are now).

In other words, I find your architecture really elegant in providing a micro service bus layer and keeping the separation between the processes and configuration. The only point I had to deviate from that was basically on the remote file access, controlling all of this by my own code and I maybe I am not a huge demographic but a file based workflow seems really common in the business scenarios, so I think any implementation on any of the directions you indicated would help me a lot in having a more Zato centric design.

Thanks A LOT for all this discussion!


#4

Hi @rtrind,

there won’t be SFTP for 3.0 but as far as FTP goes - would it be helpful if there were FTP notifications, periodically checking a set of directories and calling a service of choice giving it on input changes since the last time?

  • Added files
  • Modified files (changes to size)
  • Deleted files

Ultimately, a file transfer solution would be smart enough to understand that if at one point there was a file called my-data.xml and then there is my-data.xml.success, then it means an event in the workflow, but for 3.0 it would instead say ‘my-data.xml was deleted’ and ‘my-data.xml.success’ is a new file.

Can you tell me how often you check for changes to files? How many times a minute?

Thanks.


#5

Most of the services run once a minute and process all the changes. Only one of them runs every 15 seconds, so you can see I have a lot of pooling happening all the time.

Your suggestion would probably be good enough (even in the 3.0 implementation) to use at least in my Service4.

My general message to you is to think about the use cases I’m giving you but you do not need to rush any implementation to support it in 3.0 (for me, at least). My current architecture is working right now with no major setbacks (since I restricted the scope of the ssh2-python to only the Service1 access). An official support for SFTP on Zato would be more useful right now, so I would know I’m using a gevent safe library and avoiding gunicorn timeouts.

If Zato itself would need to pool over FTP, it would save me the trouble of doing it in some of my service but the architecture would remain quite the same (since I hardly pool the same folder in more than one service, so no gains of performance here), it would just be more elegant. The inotify based remote plugin seems like a more robust approach for my use cases, avoiding a lot of useless roundtrips to check about new, renamed or missing files, running strictly when there is a need to run. But as a product owner, maybe you don’t prefer the plugin idea (since it’s not integrated to Zato directly).

Maybe you could instead of implementing an specific SFTP remote inotify plugin, you could just leave a generic opening at Zato (with an API or something) so any user could implement it’s own plugins for some specific use cases. I would gladly contribute with this “remote inotify” plugin for Zato.

This would allow me to have a remote inotify system, listening for changes on folders in a controlled machine, triggering calls to Zato to wake up the corresponding service (one for each file or batch of changes, maybe). I’m not sure what would be the best approach to handle the file transfering (can I send the file contents to Zato or just the URL and keep the file access handled by the service) and this would share some light if proper SFTP support would still be relevant or not in this scenario.

I am rambling already, so I will stop here. Thanks again for the discussion.


#6

Thanks @rtrind, this is all good.

I’m certainly not rushing anything and for now this is only an attempt to collect use-cases, the analysis phase, there are dozens of new features ready for 3.0 and I want to put out the release as soon as possible.

I actually prefer for the remote FileAgent to be distributed separately because it will have to work with Windows systems, e.g. it will listen for changes to files on Windows, and we don’t have Zato on Windows for now. There will be a version of Zato for Windows and OS X for development purposes, but that is a couple of months of work and won’t be ready soon.

The idea of plugins has been discussed a few times and this will be done after 3.0 in two phases, first a way to create plugins for HTTP/REST-based APIs (including generation out of WSDL/WADL/RAML/Swagger) and then plugins for any networking code. I have it mostly worked out and designed, it’s only a matter of time and priorities.

Right now the focus was on publish/subscribe, it took months and months of work but is literally 99% done and it will be a completely new integration model for Zato which will enable it to handle even more integration processes and use-cases.