I’ve got a table that’s receiving sensor data; it has about 80 columns and 200m rows. I ingest the data using a python script, with a postgres configuration, from csv files. I have some new data that I want to store in this same table. The new data is from the same sensor as before, just a slightly different sensor configuration. The columns are overlapping (no new column names in the new data, it’s a subset of the existing columns). The timestamps are unique, and the values in the columns I’ve selected for deduplication are also unique. Nevertheless, when I try to ingest this new data into the existing table, it does not appear. I can ingest it into a brand new table, but not append rows to the existing table with this new(ish) sensor data. Where should I be looking to debug this?
An excerpt of my insertion code:
df = pd.read_csv(filename, skipinitialspace=True)
ts = pd.to_datetime(df['Timestamp'], origin='unix', unit='s')
df = df.drop("Timestamp", axis=1)
newcols = [c.replace('.', '_') for c in df.columns]
df.columns = newcols
df.insert(0, "cycle", cycle_name)
df.insert(0, "collection", collection)
df.insert(0, "date", date)
df.insert(0, "ts", ts)
df.insert(0, "db_version", db_version)
df.dropna(axis=1, how='all', inplace=True)
with Sender.from_conf(conf) as sender:
sender.dataframe(df, table_name="sensorA", at="ts")
Please can you try running wal_tables() and check if the table has been suspended?
If it is not suspended, please check the writerTxn and sequencerTxn numbers. If writerTxn is lower than sequencerTxn, then there is data queued which has been committed, but not applied to the table.
There are no tables suspended, but indeed two of them have sequencerTxn > writerTxn. No E or C logs, here’s what I found:
2024-07-10T11:55:04.670658Z I i.q.c.w.WalWriter open ‘sensorA~101’
2024-07-10T11:55:04.708206Z I i.q.c.w.WalWriter opened WAL segment [path=‘/home/.questdb/db/sensorA~101/wal42/0/_event.i’]
2024-07-10T11:55:05.239224Z I i.q.c.w.WalWriter committed data block [wal=/home/.questdb/db/sensorA~101/wal42/0, segmentTxn=0, seqTxn=38097, rowLo=0, roHi=9631, minTimestamp=2024-06-27T23:54:38.579798Z, maxTimestamp=2024-06-27T23:55:33.429046Z]
2024-07-10T11:55:05.803678Z I i.q.c.w.WalWriter committed data block [wal=/home/.questdb/db/sensorA~101/wal42/0, segmentTxn=1, seqTxn=38098, rowLo=9631, roHi=25232, minTimestamp=2024-06-27T23:55:33.429973Z, maxTimestamp=2024-06-27T23:59:18.581292Z]
The timestamps shown match up to the data I’m working with. Maybe it’s interesting to note that the code block I showed above gets run 12 times, each for a different ‘cycle’. I only get two logs saying ‘committed data block’. I’m missing 4 of the 12 cycles in the table (the new cycle type).
Thanks for the info. Since the table is not suspended, it should make progress. Is the writerTxn increasing and catching up ‘eventually’?
Are you certain all of the data is being flushed? Please try manually flushing the sender after you have submitted your data via the dataframe or row methods.
Also, what does your deployment look like? Are you hitting any limits on IOPS, or file descriptors? Its possible that transactions are stalling due to storage limitations.
I have two tables that are not suspended but have writerTxn not catching up to the other one. I’ve restarted a few times, nothing. The alter table wal doesn’t have any effect. Nor does using flush() (to be honest, that’s never solved anything for me, the automated flush seems to do the job).
I don’t think it’s a deployment issue, it’s just a few thousand rows, and I can commit them to a new table, just not this existing table.
Is there another way to fix the writer vs sequencer problem?
This is odd. So for example, the sensorB table is not progressing past 74446 despite not being suspended?
It is possible to submit data faster than it can be applied to the table; this can happen especially when data is being sent or committed out-of-order (O3). To handle O3 data, older partitions must be split and re-written, which slows ingestion versus appending data. When this happens, you will have some delay to committed data becoming readable in the table.
Are you sending rows that are historical i.e with timestamps earlier than the most recent in the table?
If it is not suspended, but also not progressing at all, that’s very odd.
Re: flushing, it may not be necessary when you use the dataframe api versus the row api. However, its always best to flush at the end to be sure all data has been sent. Flushing occurs by default every 75k rows or 1 second. However, these thresholds only apply when the next row is ‘sent’ - so the 1 second interval is not a periodic timer.
Correct, tables sensorA and sensorB have been in that state for a bit of time now.
Yes, the data I’m trying to insert is older than the newest data. Close to the end - the data currently goes through June 29, and I’m trying to add data that is from June 27 (and forward). There are some new types (new values within existing columns) that weren’t accounted for in my ingest script when I first ran it for the end of June. So I’m trying to go back and get that new data to be inserted. If there is not a good fix for that, then I can always start with June 30 as the date when the new types appear. Is that what you suggest, or should I try to work with the partitioning to get that data committed?
It shouldn’t be a problem to backfill columns like that.
Please could you send the full logs so I can have a look for some clues. You can either attach them here, or send them to support@questdb.io.
Also, its worth just reviewing your deployment to make sure you aren’t hitting any limits. O3 writes can require larger amounts of CPU and RAM temporarily, so worth checking the monitoring on the box.
[… backchanneling with some logs and info shared…]
The non-ingested rows appear to be a symptom of the file sizes getting too large because of the way I’m partitioning the data. From the log file (at startup of the service):
2024-07-11T19:12:38.784830Z A server-main fs.file-max checked [limit=1048576]
2024-07-11T19:12:38.784942Z A server-main vm.max_map_count limit is too low [limit=65530] (SYSTEM COULD BE UNSTABLE)
2024-07-11T19:12:38.784984Z A server-main make sure to increase fs.file-max and vm.max_map_count limits: Capacity planning | QuestDB
The table in question is partitioned by month. Using show partitions from sensorA, I see that the file sizes have gotten larger over time:
index
partitionBy
name
minTimestamp
maxTimestamp
numRows
diskSize
diskSizeHuman
readOnly
active
attached
detached
attachable
0
MONTH
1970-01
1970-01-01T00:02:50.000000Z
1970-01-02T17:59:14.000000Z
483392
268443648
256.0 MiB
FALSE
FALSE
TRUE
FALSE
FALSE
1
MONTH
2024-02
2024-02-09T22:27:12.000000Z
2024-02-21T12:12:55.000000Z
352313
194293760
185.3 MiB
FALSE
FALSE
TRUE
FALSE
FALSE
2
MONTH
2024-03
2024-03-11T20:55:35.000000Z
2024-03-30T12:37:30.046689Z
51296250
34484932608
32.1 GiB
FALSE
FALSE
TRUE
FALSE
FALSE
3
MONTH
2024-04
2024-04-01T00:03:59.119940Z
2024-04-30T23:59:38.038635Z
80949585
55013153462
51.2 GiB
FALSE
FALSE
TRUE
FALSE
FALSE
4
MONTH
2024-05
2024-05-01T00:04:54.399281Z
2024-05-16T00:44:55.963012Z
55210141
37665046528
35.1 GiB
FALSE
FALSE
TRUE
FALSE
FALSE
5
MONTH
2024-06
2024-06-01T00:02:52.069818Z
2024-06-26T17:41:50.637777Z
81007045
55384002560
51.6 GiB
FALSE
FALSE
TRUE
FALSE
FALSE
6
MONTH
2024-06-26T174150-637778
2024-06-26T17:41:50.638778Z
2024-06-26T19:24:10.434047Z
422862
289214464
275.8 MiB
FALSE
FALSE
TRUE
FALSE
FALSE
7
MONTH
2024-06-26T192410-434048
2024-06-26T19:24:10.434731Z
2024-06-26T20:24:00.654056Z
162334
111149056
106.0 MiB
FALSE
FALSE
TRUE
FALSE
FALSE
8
MONTH
2024-06-26T202400-654057
2024-06-26T20:24:00.654822Z
2024-06-27T18:15:05.862907Z
1560698
1066938368
1017.5 MiB
FALSE
FALSE
TRUE
FALSE
FALSE
9
MONTH
2024-06-27T181505-862908
2024-06-27T18:15:05.863907Z
2024-06-27T20:25:05.203004Z
525036
359383040
342.7 MiB
FALSE
FALSE
TRUE
FALSE
FALSE
So I’ve got 50Gb files for April and June, and now the June items are getting split off into different smaller partitions to accommodate it, and this is not resulting in good ingest and retrieval.
So the question now is, how to fix? The information in Time Partitions | QuestDB is not so clear to me. How can I re-partition the existing table and make it by day going forward?
Unfortunately, there’s no in-place means to re-partition the table.
The solution is to create another table with the new partitioning. Then use INSERT INTO new_table SELECT * FROM old_table to copy the data across.
This will insert data in batches of 1 million by default, so it should not run out of memory during the copy. I’d recommend not inserting other data into the new table during this time, so that it completes ASAP.
Ok. If I have code that relies on that sensorA table being there, is the right thing to do to make the new table, then rename or delete the old table, and then rename the new table back to sensorA?
tl;dr I would use a second table, copy it over. Delete the original table. Make a new table with the correct name, and copy it back. This will ensure you have a clean start, and your table name matches the folder structure. Context below!
Its possible to rename a table with RENAME TABLE foo TO bah. Let’s say table foo had folder foo~15. The folder will not be renamed on disk, so you’ll have a table bah with folder foo~15.
You can then make a new table with CREATE TABLE foo, and you’ll have a folder called foo~16.
Suffice is to say that this can be confusing, especially if you rely on the folder structures for any monitoring (bypassing commands like table_partitions()). We also have some bugs related to this when using the CREATE TABLE IN VOLUME feature, so its best to avoid this if possible at the moment.
EDIT: For reference, its probably enough to rename the table first, then copy into a correctly named table. But for safety, clean start might be best!
CREATE TABLE sensorA_copy;
INSERT INTO sensorA_copy SELECT * FROM sensorA; # still monthly partitioned
DROP TABLE sensorA;
CREATE TABLE sensorA (columnA columnB columnC) TIMESTAMP(timestamp) PARTITION BY day;
INSERT INTO sensorA SELECT * FROM sensorA_copy;
DROP TABLE sensorA_copy;