(Migrated) Zato as Data Warehouse

(This message has been automatically imported from the retired mailing list)

I=92m evaluating Zato as a means of exposing data from multiple sources. =
Seems like a great fit for what we need. However, I have some basic =
questions that I=92m struggling with. For example. I have data points =
I want to expose from the following CSV file: =
http://www.irs.gov/pub/irs-soi/11in20me.xls. For obvious reasons, in =
creating a channel for that data I=92d prefer not to have to transform =
the CSV file with each request. I figured there must be a way to do an =
initial transformation, and store that data in ZATO for future use. Am =
I missing something here? Am I missing the point? Should I not be =
expecting ZATO to warehouse any API data points?

Thanks for any help you can provide.

On 12/18/2013 04:14 PM, leveille wrote:

For obvious reasons, in creating a channel for that data I’d prefer not to have
to transform the CSV file with each request. I figured there must be a way to
do an initial transformation, and store that data in ZATO for future use.

Hello,

as we have briefly discussed it on IRC, please find below a sample
PoC-like code of what you’ve described.

https://zato.io/support/zato-discuss/dw-cache/dw-cache.py
https://zato.io/support/zato-discuss/dw-cache/export.json

Basically, what you need in your case is to

  • Create a service to connect to a remote server, fetch data and store
    it in Redis
  • Create a scheduler job so this service is executed periodically
  • Create another service that will expose the already cached data super
    fast from Redis

The code above that just that.

  • CacheData connects to an external data source (here it is simply a CSV
    file uploaded by me here http://tutorial.zato.io/data.csv)

  • The file is parsed and turned into a Python dictionary of dictionaries

  • Each dictionary is stored under a separate Redis key

  • GetData is a service exposed through a channel. The service expects a
    list of keys to return data for. For each given on input the output list
    of dictionaries is populated with corresponding data read from Redis.
    Serialization to JSON/XML/SOAP is automatic so you don’t have to code it.

The export.json file is an enmasse export file - that is, you can click
everything in GUI but you can also export it to an external file and
this is what I have done for you convenience.

After deploying services and running enmasse everything can be already
exposed by external systesm, here it is with curl:

$ curl localhost:11223/get-irs-data -d ‘{“key”:[“Joint returns”]}’
{“response”: [{“data”: {“under_25k”: “38660”, “all_returns”: “256241”,
“under_1”: “3113”}}]}

$ curl localhost:11223/get-irs-data -d ‘{“key”:[“Joint returns”, “Paid
preparer\u0027s signature”]}’

{“response”: [{“data”: {“under_25k”: “38660”, “all_returns”: “256241”,
“under_1”: “3113”}}, {“data”: {“under_25k”: “106568”, “all_returns”:
“309047”, “under_1”: “6966”}}]}

Naturally, this is only sample code prepared in an hour or so more
features can be added, like:

  • Don’t override old data in the cache
  • Don’t return a list of dictionaries, use a plain dictionary instead
  • Combine data from several data sources
  • Parse Excel instead of CSV

But hopefully, this should get you started :slight_smile:

Here are some links to features of Zato used:

Note that in order to run this code you need the latest version from
master as it depends on parts that have been added in recent weeks.

On 12/18/2013 04:14 PM, leveille wrote:

For obvious reasons, in creating a channel for that data I’d prefer not to have
to transform the CSV file with each request. I figured there must be a way to
do an initial transformation, and store that data in ZATO for future use.

Hello,

as we have briefly discussed it on IRC, please find below a sample
PoC-like code of what you’ve described.

https://zato.io/support/zato-discuss/dw-cache/dw-cache.py
https://zato.io/support/zato-discuss/dw-cache/export.json

Basically, what you need in your case is to

  • Create a service to connect to a remote server, fetch data and store
    it in Redis
  • Create a scheduler job so this service is executed periodically
  • Create another service that will expose the already cached data super
    fast from Redis

The code above that just that.

  • CacheData connects to an external data source (here it is simply a CSV
    file uploaded by me here http://tutorial.zato.io/data.csv)

  • The file is parsed and turned into a Python dictionary of dictionaries

  • Each dictionary is stored under a separate Redis key

  • GetData is a service exposed through a channel. The service expects a
    list of keys to return data for. For each given on input the output list
    of dictionaries is populated with corresponding data read from Redis.
    Serialization to JSON/XML/SOAP is automatic so you don’t have to code it.

The export.json file is an enmasse export file - that is, you can click
everything in GUI but you can also export it to an external file and
this is what I have done for you convenience.

After deploying services and running enmasse everything can be already
exposed by external systesm, here it is with curl:

$ curl localhost:11223/get-irs-data -d ‘{“key”:[“Joint returns”]}’
{“response”: [{“data”: {“under_25k”: “38660”, “all_returns”: “256241”,
“under_1”: “3113”}}]}

$ curl localhost:11223/get-irs-data -d ‘{“key”:[“Joint returns”, “Paid
preparer\u0027s signature”]}’

{“response”: [{“data”: {“under_25k”: “38660”, “all_returns”: “256241”,
“under_1”: “3113”}}, {“data”: {“under_25k”: “106568”, “all_returns”:
“309047”, “under_1”: “6966”}}]}

Naturally, this is only sample code prepared in an hour or so more
features can be added, like:

  • Don’t override old data in the cache
  • Don’t return a list of dictionaries, use a plain dictionary instead
  • Combine data from several data sources
  • Parse Excel instead of CSV

But hopefully, this should get you started :slight_smile:

Here are some links to features of Zato used:

Note that in order to run this code you need the latest version from
master as it depends on parts that have been added in recent weeks.