How to best store and access most recent data?

leem · August 29, 2024, 12:54am

Hi, my application involves storing the value of about 1,000,000 sensors as they change over time. Updates happen in the order of tens of thousands per second.

The database has only one simple table:

CREATE TABLE 'sensors' (
  location SYMBOL capacity 256 CACHE,
  x SHORT,
  y SHORT,
  value FLOAT,
  timestamp TIMESTAMP,
  source LONG,
) timestamp (timestamp) PARTITION BY WEEK WAL;

Each sensor is uniquely identified by the location, x and y values.

When the application starts, I need to read the most recent value recorded for each sensor. Some sensors might not have been updated for months. There’s also a use-case where I need to get the latest known values of mulitple sensors within a specified time period.

Questions:

Is there a better schema for this use-case? Hopefully I have provided enough info to determine that.
Is this the best way to retrieve latest values at startup? select * from surface LATEST ON timestamp PARTITION BY location,x,y
How do I archive older data (by potentially removing old partitions) without compromising the latest value for each sensor. For example, one sensor might not have updated in 6 months. If I delete any partitions older than, say, 4 months, I’ve lost all data for that sensor. Am I better off deleting data on a per-sensor basis, ensuring that only data older than the latest is removed?

Thanks.

goodroot · August 29, 2024, 8:17pm

Your schema seems solid. However, consider using VARCHAR(256) instead of SYMBOL capacity 256 CACHE for the location column, as suggested in this Github issue.

To retrieve the latest values at startup, your query seems to be correct. You can use the LATEST BY clause to get the most recent value for each sensor.

Regarding data retention, you can use the ALTER TABLE DROP PARTITION statement to drop old partitions. However, as you mentioned, this might result in losing all data for sensors that haven’t been updated in a while. One strategy to handle this is to create a backup of your data before dropping old partitions, as suggested in the documentation.

Another strategy is to create a separate table that stores the latest value for each sensor. This table can be updated each time a sensor value is updated. This way, you can safely drop old partitions from the main table without losing the latest sensor values. This strategy is similar to the one discussed in a Stack Overflow post, where separate tables were created for each sensor.

As always, delete is forever. Backup for peace of mind!

leem · August 30, 2024, 3:50am

Thanks for the advice.

Could you please give a bit mode details on the “separate table” option mentioned in the second-last paragraph? If I understand the stackoverflow post and translate it to my use-case, I would end up with nearly one million tables - which I cannot imagine is a feasible approach. However, I’m curious about the separate table that is “updated each time a sensor value is updated”. Does that mean a table that contains a single row for each sensor, being updated with each new sensor value? Does that not conflict with the advice to not do table updates (here )? I’m sure I’m misunderstanding something.

Thanks again.

nwoolmer · August 30, 2024, 11:46am

Hi @leem

I think the suggestion is the following.

Create a separate table, WAL, with a dummy timestamp. Set this dummy timestamp as the designated timestamp. Store your ‘true’ timestamp in a separate column. Set deduplication keys on dummy timestamp and sensor_id (or similar key).

Before dropping an old partition, use INSERT INTO SELECT to copy the latest row for each sensor in this partition into the second table. These will be deduplicated automatically, and deduplicated inserts are quite efficient, versus UPDATE.

Then drop the partition you no longer require.

When you need to query the data, simply query both tables and combine the result based on the latest timestamp (i.e using max()). We can help you prepare the correct queries to do this.

We will shortly be releasing deduplication for variable sized types. This will allow you to user a VARCHAR for sensor_id instead of a SYMBOL or LONG, should you need to.

We are also WIP for materialised views, which I think are on the roadmap for Q4. This will resolve your problem long term, and remove these manual steps. You would simply define a view based on the LATEST BY query, and it will maintain this for you.

leem · September 1, 2024, 10:44pm

Thanks for the clarification. I think i understand the suggestion now. Will try it out.

javier · September 3, 2024, 9:00am

I have one more suggestion to make. LATEST ON is optimized for single symbol lookups. The reason for this is that when you use PARTITION BY symbol_column, QuestDB knows how many different values there are for that symbol, so it will scan the column by decreasing timestamp until it finds at least one value for each possible symbol value, and will stop scanning at that point. When you do PARTITION BY multiple columns, there will be combination of those columns, so QuestDB will just do a full scan.

With this in mind, I am thinking you could have a symbol column which is the concatenation of location, x, and y. I am assuming you want to still query by location, but you don’t need to query by x or y independently, so you would have a location column and a sensor_id or full_location column. Those would take 32 bits each, which is the same storage as two shorts (16 bytes each). If you need to query by x or y, then my proposal would take 32 extra bytes per row, which might or might not be a problem for you.

leem · September 3, 2024, 11:06pm

Another good suggestion. However, in my case, the “location” is an area that contains multiple sensors in a grid, with the position of each sensor designated by the x/y values. There are use cases where I need to get the latest values for a subset of sensors within a location, so still need to be able to search/query by x and y values.

I’ve implemented the earlier suggestions (effectively using a cache table to keep the latest values only) and it’s working well. I guess the only (small) downside is two writes for each value – one to the regular “history” table and one to the “cache” table. Unless there’s a way around that, where I can make one write API call to the database and have it configured to write the data to two different tables? Unfortunately the data sent is not exactly the same because of the additional dummy timestamp as described in the original suggestion.

Thanks again for your help.

nwoolmer · September 4, 2024, 2:23pm

This issue was recently discussed internally, and after the our initial materialised views release, we may look at improving LATEST ON behaviour and remove the need to do this manually. I’m glad you have a workaround for now, and hopefully this will be simpler in future.

Thanks for your feedback!