Schedule Data Kiosk Queries
Learn how to schedule Data Kiosk queries.
Use this guide to learn how to automatically make regular calls to createQuery
while avoiding redundant data requests.
Note
A full code sample for this scheduling mechanism is published in our Samples Repo on GitHub.
Overview of Data Kiosk
Using Data Kiosk involves the following steps:
- Subscribe to the notification: Subscribe an SQS queue to the
DATA_KIOSK_QUERY_PROCESSING_FINISHED
notification, which notifies you when data processing is complete. - Create the query: Submit a GraphQL query using the
createQuery
operation. - Retrieve the document: Use
dataDocumentId
orerrorDocumentId
to retrieve document details with thegetDocument
operation. These attributes are in the notification payload. If you receiveerrorDocumentId
, you can find the reason for the failure in the document. - Store the data: Retrieve and store the JSONL file from the
documentUrl
for further processing and access.
Note
For more information on Data Kiosk workflows, refer to:
Tutorial: Build a scheduling mechanism for Data Kiosk
Learn how to make automatic and regular queries to Data Kiosk, while avoiding redundant calls.
Step 1. Define schedule parameters
Determine the start date and the rate at which you want to make queries.
Note
If the query start date is in the past and you want the scheduler to backfill, create queries as often as possible until you're querying the present day. This ensures that your data remains current. For example, if the start date is one month ago and you want information from each day, adjust the solution so that the scheduler quickly runs queries for each day over the past month until it reaches the current day.
If you do not require backfilling, set the start date to the present or a future date to avoid the scheduler creating past queries.
Common frequencies include daily, weekly, or monthly. The choice of rate depends on how often your data updates. For example, if your data updates daily, a daily frequency is appropriate. Do not query too frequently. If a dataset updates daily, an hourly query would be redundant.
Step 2. Adjust query parameters
Create a function that shifts the start and end dates in the query based on your desired rate so that each query retrieves new data. For example, if the schedule rate is daily, the function should adjust the start and end dates by one day for each new query.
Warning!
Different datasets have different attribute key names and different data reload periods. Know the attribute keys for the start and end dates for each dataset. When you update or shift the start and end date attributes, key names might vary across datasets.
Step 3. Create the query
To automatically submit queries, use an event scheduler such as AWS EventBridge. With EventBridge, you can set up recurring tasks using cron expressions or rate expressions (for example, every five minutes, hourly, daily). Configure EventBridge to automatically adjust query dates and make the calls to createQuery
at defined intervals to continuously and efficiently retrieve data.
Step 4. Save schedule information
Store schedule information such as start dates, rates, and associated queries in a database. Make sure that you can cancel or delete schedules as needed, and keep logs of queries for auditing and reporting purposes.
Best practices and considerations
Avoid errors and handle redundancy. Some things to consider include:
-
Avoid redundancy: Validate that queries request unique data. Ensure that there is no overlap with previously stored data timestamps, and don't create multiple schedules for the same query. Query creation fails if the previous query is still running.
-
Log errors: Don't allow a failed query to stop the entire scheduling mechanism. Implement comprehensive logging to capture errors during query submission and data retrieval. Process the error documents that Data Kiosk returns and fix the errors that they contain.
-
Retry: If a query is throttled by the concurrent query limitation, incorporate retry mechanisms that use exponential backoff.
-
Monitor and alert: Monitor in real time and set error alerts for failed queries, exceeded rate limits, or prolonged processing times. Quickly identify and respond to issues.
For details about different errors and how to fix them, refer to Handling processing errors.
Updated 2 months ago