Only ILP Apply Worker has 100% CPU and there are no special errors or messages in Quest DB Log or System Log. When this symptom occurs, Insert no longer proceeds and data is lost.
Hi @allen ,
How are you submitting data? High CPU usage is usually caused by lots of small, often out-of-order, commits.
Your table can become suspended. However, successful inserts should not result in data loss. If you are using ILP/HTTP, you should receive an error when trying to submit, if the data has not been committed.
To see the ‘lost’ data, you may need to unsuspend your table: ALTER TABLE RESUME WAL | QuestDB
I am having a similar issue that I am chasing currently. We are using ILP over MQTT and to your point there are times when we would have hundreds of small ILP entries that arrives.
We are using telegraf with MQTT. The ILP arrives via tcp at tcp://127.0.0.1:9009
.
I was looking to be able to add some logging/metrics to try and find the issue at hand.
I upgraded from 7.4.2 to 8.1.4 now as well to see if it makes any difference.
Based on what you mentioned about out of order commits would it make sense to not add timestamps to the ILP so that the database adds default timestamps as the data arrives.
Or I am considering using RabbitMQ instead of MQTT as when I am using a RabbitMQ queue - message order is (more) guaranteed.
What would be the best way to debug the high CPU issue?
Any other thoughts on this would be appreciated.
Hey @Andre_Vermeulen ,
Chances are that your MQTT packets are disordered and unbatched, leading to lots of frequent, small, out-of-order commits. Optimal performance is achieved by sending mostly ordered data, in larger batches.
Best bet is to upgrade and swap to ILP/HTTP if possible i.e send to localhost:9000/write
.
By default, with ILP/HTTP, it will batch in groups of 75k or a 1 second time interval between the current row and the last time the buffer was flushed.
It may also make sense to have an MQTT client subscribe to multiple topics and then batch yourself by using the ILP client libraries. This will enable you to guarantee the sending behaviour, for example, by filtering different messages for different tables into their own individual buffers and transactions, meaning you can ensure all the data is written to a table, or none of it.
In terms of whether to include or not include a timestamp, that depends on your use case. You could choose to use an ingestion timestamp i.e server-assigned, and then copy the data into other tables sorted by a different designated timestamp. We will be releasing materialised views soon, which will help to automate this kind of workflow.
This is area of continuing improvement, so future versions will hopefully handle frequent, small transactions more efficiently. Either way though, it’d be best to avoid them where possible!
Hi @nwoolmer - thanks for the reply it makes sense.
As far as sending ILP lines go. Should they be sent 1 by 1 (HTTP post for each). Or can one send a batch. Reading the documentation line by line.
Sending of ILP to localhost:9000/write
- doesn’t seem like a documented feature yet. Will try it though.
Send a batch!
We have open source ILP clients which help to build and send to the ILP/HTTP endpoint, so either use them or refer to them for context: InfluxDB Line Protocol Overview | QuestDB
The endpoint may not be in the REST listing as most people go via the clients.
The clients batch by 75k rows or 1 second intervals by default, and this is configurable.
So just to clarify then, I can do an HTTP POST to localhost:9000/write
containing a ILP lines seperated by new lines?
Yep, UTF-8 encoded.
Fwiw, we also support /api/v2/write
which works the same, but matches the InfluxDB API. So Telegraf and other tools can use this endpoint instead of /write
.
Also, just a note on transactionality.
In QuestDB, transactions are per-table. If you submit a batch of rows in the endpoint that are entirely for the same table, then they will either all be committed to WAL (and later applied to the table), or none of them will be.
If you submit mixed batches, then each the data is split and committed to each table separately. But if there is an error on one table, you may have writes to one table and not the other.
So in an ideal world, each batch is for a single table, and you use deduplication on the table for idempotent retries.
Thanks for your feedback @nwoolmer - I have now changed my code to send ILP/HTTP in batches - each table in it’s own batch.
I will keep an eye on the CPU. And then incrementally add more load if things are looking good CPU wise.