Trouble with java heap fine tuning

archmag · July 19, 2024, 8:21am

I’m currently struggling with some queries which requires more heap than default value. My setup is based on dedicated EC2 server. ExecStart in questdb.service is

ExecStart=/usr/bin/java \
--add-exports java.base/jdk.internal.math=io.questdb \
-p {{questdb_jar}} \
-m io.questdb/io.questdb.ServerMain \
-Xmx16g \
-DQuestDB-Runtime-66535 \
-ea -Dnoebug \
-XX:+UnlockExperimentalVMOptions \
-XX:+AlwaysPreTouch \
-XX:+UseParallelGC \
-XX:-MaxFDLimit \
-d {{questdb_dir}}

But questdb_memory_jvm_max (prometheus) shows ~8GB. I’ve enabled swap on the system just in case but it’s unrelated to my question. I’ve tried to set

ram.usage.limit.percent=0
ram.usage.limit.bytes=21474836480 (20GB)

But with no luck. I’m still getting not enough java heap space exception.

Am I doing something wrong? Why -Xmx16g is ignored?

jerrinot · July 19, 2024, 8:50am

Hi @archmag,

Xmx should not be ignored. This is strange.

Can you share:

Your exact QuestDB version
The SQL and the output of EXPLAIN
The full stacktrace of the OOM Exception

Usually QuestDB is not demanding on JVM heap - most of the memory is allocated outside of a JVM heap.

archmag · July 19, 2024, 9:10am

8.0.3
sql:

    SELECT facility, name, start_ts, end_ts 
    FROM heats 
    WHERE start_ts BETWEEN '2024-07-19 04:13:00' AND '2024-07-19 05:00:00' AND name IN ('VIGB-19Jul24@00_13_00', 'CNSRC-19Jul24@00_40_00', 'BAYB-18Jul24@23_59_20')
), s AS (
    SELECT 
        facility, 
        tower_symbols, 
        ts, sm.flow_to_customer, 
        flow_from_customer_scfm, 
        good_gas_sp_scfm, 
        good_gas_contract_scfm, 
        good_gas_internal_scfm, 
        process_flow_from_customer_scfm, 
        recovered_from_customer_process_scfm 
    FROM sm 
    WHERE ts BETWEEN '2024-07-19 04:13:00' AND '2024-07-19 05:00:00' AND tower_symbols IS NOT NULL
) 
SELECT 
    h.name, 
    h.facility, 
    MIN(s.ts) AS start_ts, 
    MAX(s.ts) AS end_ts, 
    ROUND(SUM(s.flow_to_customer)) AS flow_to_customer, 
    ROUND(SUM(s.flow_from_customer_scfm)) AS flow_from_customer_scfm, 
    ROUND(SUM(s.good_gas_sp_scfm)) AS good_gas_sp_scfm, 
    ROUND(SUM(s.good_gas_contract_scfm)) AS good_gas_contract_scfm, 
    ROUND(SUM(s.good_gas_internal_scfm)) AS good_gas_internal_scfm, 
    ROUND(SUM(s.process_flow_from_customer_scfm)) AS process_flow_from_customer_scfm, 
    ROUND(SUM(s.recovered_from_customer_process_scfm)) AS recovered_from_customer_process_scfm 
FROM s INNER JOIN h ON s.facility = h.facility AND s.ts BETWEEN h.start_ts AND h.end_ts 
GROUP BY 1,2
ORDER BY 3 DESC;```
explain:

“QUERY PLAN”
“Sort light”
" keys: [start_ts desc]"
" VirtualRecord"
" functions: [name,facility,start_ts,end_ts,round(SUM),round(SUM1),round(SUM2),round(SUM3),round(SUM4),round(SUM5),round(SUM6)]"
" GroupBy vectorized: false"
" keys: [name,facility]"
" values: [min(ts),max(ts),sum(flow_to_customer),sum(flow_from_customer_scfm),sum(good_gas_sp_scfm),sum(good_gas_contract_scfm),sum(good_gas_internal_scfm),sum(process_flow_from_customer_scfm),sum(recovered_from_customer_process_scfm)]"
" SelectedRecord"
" Filter filter: s.ts between h.start_ts and h.end_ts"
" Hash Join Light"
" condition: h.facility=s.facility"
" SelectedRecord"
" Async Filter workers: 8"
" filter: tower_symbols is not null"
" DataFrame"
" Row forward scan"
" Interval forward scan on: sm"
" intervals: [("2024-07-19T04:13:00.000000Z","2024-07-19T05:00:00.000000Z")]"
" Hash"
" Async Filter workers: 8"
" filter: name in [VIGB-19Jul24@00_13_00,CNSRC-19Jul24@00_40_00,BAYB-18Jul24@23_59_20]"
" DataFrame"
" Row forward scan"
" Interval forward scan on: heats"
" intervals: [("2024-07-19T04:13:00.000000Z","2024-07-19T05:00:00.000000Z")]"

3. Stack trace is very strange: 
```2024-07-19T05:04:01.879098Z E i.q.c.h.HttpConnectionContext internal error [fd=130, e=`
java.lang.OutOfMemoryError: Java heap space
`]``` 

It takes time to rise this issue. Query itself works. After ~12h it rises the exception. I was trying to use java21 and java17. Still no luck in addressing it (except restart in cron)

javier · July 19, 2024, 9:16am

Do I see a missing final \ on the second line above?

(this might be a formatting issue with the community forum, just to be extra sure)

archmag · July 19, 2024, 11:46am

formatting issue. I’ll edit original post

jerrinot · July 19, 2024, 11:57am

Okay, it looks like there might be a memory leak. Are you willing to share a heap dump? You should know that a heap dump might contain your private data, so you should be comfortable with this. If so, it’s best to run QuestDB with this JVM flag: -XX:+HeapDumpOnOutOfMemoryError. This generates a heap dump on OutOfMemoryError. Then we can inspect the heap dump.

If you are not comfortable sharing a heap dump, there are other options:

You could enable verbose GC logs. This generates a text file showing garbage collector activity. This could tell us if it’s a slow leak (memory creeping up) or something more sudden.
Use jmap to show a class histogram. This is less invasive than a heap dump, and it can still provide an indication of what’s leaking (if anything at all).

One more question: Is it throwing OOM after 12 hours regardless of the Xmx parameter?

nwoolmer · July 19, 2024, 12:12pm

You can also share logs etc. Through a private chat we can open with you, or by sending them to support@questdb.io, if you dont want them on the forum!

archmag · July 19, 2024, 1:02pm

I’ll dump the heap. The issue is not related to fix 12h interval. The issue is rather dependant on user activity (higher query rate means problems)

archmag · July 22, 2024, 7:29am

I’ve found the root cause of the issue. Please take a look:

so when pg.select.cache.enabled=true the memory is constantly filling up (until there is nothing left to operate with). I’m not sure which query contributes the most. We have various queries, some of them use parameters and some of them are just hardcoded strings.

Let me know if you want github issue with proper description for it. One follow up question I have is, are there any parameters to be set which would evict forcfully entire cache after some period of time?

Cheers and thanks for assistance!

jerrinot · July 22, 2024, 10:28am

Hello, this is helpful. It indicates that something is retaining memory. A heap dump would help to precisely pinpoint the root cause.

I consulted with my colleagues, and we have a theory that the query plan cache is retaining memory because the query includes in-memory sorting. This allocates memory which is then kept in the query cache for further use.

There is a function flush_query_cache() to flush query caches. You can execute the SQL command SELECT flush_query_cache(); and observe the effects on memory consumption. If the memory decreases, then the hypothesis above is likely correct.

Another thing to try is to use PostgreSQL prepared statements. I do not know your use-case, but most applications use the same structure of SQL queries and only change parameters (such as values in the WHERE clause). If you hard-code these parameters in each SQL text, then from the database perspective, all these queries are unique, they will have unique plans, and they will be cached separately. If you use prepared statements, the database creates a plan once and then keeps reusing it when you execute the same query with different parameters. This could also mitigate the problem.

archmag · July 23, 2024, 7:29am

Sadly flush_query_cache didn’t work and I’m reluctant to keep experimenting on production env. I don’t have enough load on staging env to simulate the same behaviour. Turning off the cache works for me.

Regarding to prepared statements, we’re working on migrating queries but our use case is a bit different. We have lot of columns which are created dynamically. Most of queries are dynamically created (dynamic list of columns). Anyway I’ll try to migrate “normal” queries but those with dynamically created list of columns won’t be migrated as column list is not parameter.

Once again thanks for assistance

petr-tichy · September 12, 2024, 11:27am

Hope you already found out, but the order of parameters to java command matters here, see fix(core): fix ignored JVM arguments in AMI by petr-tichy · Pull Request #4258 · questdb/questdb · GitHub.
Long story short everything after -m is passed to the main class and not applied as Java option.

archmag · September 16, 2024, 7:01am

Thanks a lot! I don’t how I’ve missed it. java --help explicitly shows usage