Skip to content

[VL] Memory Leak possibly TableScan #9456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nimesh1601 opened this issue Apr 28, 2025 · 8 comments
Open

[VL] Memory Leak possibly TableScan #9456

nimesh1601 opened this issue Apr 28, 2025 · 8 comments
Labels
bug Something isn't working triage

Comments

@nimesh1601
Copy link

nimesh1601 commented Apr 28, 2025

Backend

VL (Velox)

Bug description

There seems to be a memory leak from Table Scan
Got this response from Velox team : Velox Issue

Sample query
WITH

cdl_summary AS (
SELECT
user_id,
CASE
WHEN <TRACKING_LABEL_COL> = 'store_front' THEN 'storefront'
ELSE <TRACKING_LABEL_COL>
END AS source,

SUM(CASE WHEN name IN (<IMPRESSION_EVENTS>)
     AND datestr BETWEEN :START_7D   AND :END_DATE THEN 1 ELSE 0 END) AS impression_count_7d,
SUM(CASE WHEN name IN (<IMPRESSION_EVENTS>)
     AND datestr BETWEEN :START_14D  AND :END_DATE THEN 1 ELSE 0 END) AS impression_count_14d,
SUM(CASE WHEN name IN (<IMPRESSION_EVENTS>)
     AND datestr BETWEEN :START_28D  AND :END_DATE THEN 1 ELSE 0 END) AS impression_count_28d,
SUM(CASE WHEN name IN (<IMPRESSION_EVENTS>)
     AND datestr BETWEEN :START_56D  AND :END_DATE THEN 1 ELSE 0 END) AS impression_count_56d,
SUM(CASE WHEN name IN (<IMPRESSION_EVENTS>)
     AND datestr BETWEEN :START_112D AND :END_DATE THEN 1 ELSE 0 END) AS impression_count_112d,
 
SUM(CASE WHEN name IN (<CLICK_EVENTS>)
     AND datestr BETWEEN :START_7D   AND :END_DATE THEN 1 ELSE 0 END) AS click_count_7d,
SUM(CASE WHEN name IN (<CLICK_EVENTS>)
     AND datestr BETWEEN :START_14D  AND :END_DATE THEN 1 ELSE 0 END) AS click_count_14d,
SUM(CASE WHEN name IN (<CLICK_EVENTS>)
     AND datestr BETWEEN :START_28D  AND :END_DATE THEN 1 ELSE 0 END) AS click_count_28d,
SUM(CASE WHEN name IN (<CLICK_EVENTS>)
     AND datestr BETWEEN :START_56D  AND :END_DATE THEN 1 ELSE 0 END) AS click_count_56d,
SUM(CASE WHEN name IN (<CLICK_EVENTS>)
     AND datestr BETWEEN :START_112D AND :END_DATE THEN 1 ELSE 0 END) AS click_count_112d,
 
SUM(CASE WHEN name IN (<ORDER_EVENTS>)
     AND datestr BETWEEN :START_7D   AND :END_DATE THEN 1 ELSE 0 END) AS order_count_7d,
SUM(CASE WHEN name IN (<ORDER_EVENTS>)
     AND datestr BETWEEN :START_14D  AND :END_DATE THEN 1 ELSE 0 END) AS order_count_14d,
SUM(CASE WHEN name IN (<ORDER_EVENTS>)
     AND datestr BETWEEN :START_28D  AND :END_DATE THEN 1 ELSE 0 END) AS order_count_28d,
SUM(CASE WHEN name IN (<ORDER_EVENTS>)
     AND datestr BETWEEN :START_56D  AND :END_DATE THEN 1 ELSE 0 END) AS order_count_56d,
SUM(CASE WHEN name IN (<ORDER_EVENTS>)
     AND datestr BETWEEN :START_112D AND :END_DATE THEN 1 ELSE 0 END) AS order_count_112d,
 
AVG(CASE 
      WHEN name = 'marketplace_scrolled'
       AND datestr BETWEEN :START_7D AND :END_DATE THEN 0
      WHEN name IN ('feed_item_card_scrolled','feed_item_dish_card_scrolled')
       AND datestr BETWEEN :START_7D AND :END_DATE
       THEN feed.display_item_position
      ELSE NULL
    END) AS avg_scroll_depth_7d,
 
 
MAX(CASE WHEN name IN (<INTERACTION_EVENTS>)
         AND datestr BETWEEN :START_112D AND :END_DATE
     THEN epoch_ms ELSE NULL END) AS latest_interaction_time

FROM <SCHEMA_CDL>.<TABLE_CDL> cdl
WHERE TRUE
AND datestr BETWEEN :START_112D AND :END_DATE
AND name IN (<IMPRESSION_EVENTS>, <CLICK_EVENTS>, <ORDER_EVENTS>)
AND COALESCE(<SESSION_ID_COL>, user_id) IS NOT NULL
AND is_first_event = TRUE
AND user_id IS NOT NULL AND user_id <> ''
AND <FEED_CONTEXT_COL> IN ('home','vertical','allstores','all_stores')

GROUP BY 1,2
),

xlb_summary AS (
SELECT
user_id,
CASE
WHEN <TRACKING_LABEL_COL> = 'store_front' THEN 'storefront'
ELSE <TRACKING_LABEL_COL>
END AS source,

MAX(CASE WHEN name IN (<INTERACTION_EVENTS>)
         AND datestr BETWEEN :START_112D AND :END_DATE
     THEN epoch_ms ELSE NULL END) AS latest_interaction_time

FROM <SCHEMA_XLB>.<TABLE_XLB> xlb
WHERE TRUE

GROUP BY 1,2
),

data_summary AS (
SELECT
COALESCE(cdl.user_id, xlb.user_id) AS user_id,
COALESCE(cdl.source, xlb.source) AS source,
COALESCE(cdl.impression_count_7d, 0)
+ COALESCE(xlb.impression_count_7d, 0) AS impression_count_7d,

GREATEST(cdl.latest_interaction_time, xlb.latest_interaction_time)
  AS latest_interaction_time

FROM cdl_summary cdl
FULL OUTER JOIN xlb_summary xlb
ON cdl.user_id = xlb.user_id
AND cdl.source = xlb.source
),

summary_agg AS (
SELECT
user_id,
SUM(impression_count_7d) AS total_impression_7d,

SUM(order_count_112d)     AS total_order_112d

FROM data_summary
GROUP BY 1
)

INSERT OVERWRITE TABLE <SCHEMA_TARGET>.<TABLE_TARGET>
PARTITION (datestr)
SELECT
CONCAT(d.user_id, '|', d.source) AS uuid,
d.user_id,
d.source,
d.impression_count_7d,

agg.total_impression_7d,

d.latest_interaction_time,
:END_DATE AS datestr
FROM data_summary d
JOIN summary_agg agg
ON d.user_id = agg.user_id
WHERE d.source IS NOT NULL;

Gluten version

No response

Spark version

None

Spark configurations

NA

System information

NA

Relevant logs

E20250427 14:32:18.128706   176 VeloxMemoryManager.cc:401] Failed to release Velox memory manager after 43350ms as there are still outstanding memory resources. 
E20250427 14:32:18.128832   176 MemoryPool.cpp:435] [MEM] Memory leak (Used memory): Memory Pool[op.0.0.0.TableScan LEAF root[root] parent[node.0] MALLOC track-usage thread-safe]<unlimited max capacity unlimited capacity used 276.00KB available 748.00KB reservation [used 276.00KB, reserved 1.00MB, min 0B] counters [allocs 114189, frees 114184, reserves 0, releases 1, collisions 0])>
E20250427 14:32:18.128959   176 Exceptions.h:66] Line: cpp/velox/memory/VeloxMemoryManager.cc:102, Function:removePool, Expression: pool->reservedBytes() == 0 (1048576 vs. 0), Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (1048576 vs. 0)
Retriable: False
Expression: pool->reservedBytes() == 0
Function: removePool
File: cpp/velox/memory/VeloxMemoryManager.cc
Line: 102
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN6gluten20ListenableArbitrator10removePoolEPN8facebook5velox6memory10MemoryPoolE
# 4  _ZN8facebook5velox6memory13MemoryManager8dropPoolEPNS1_10MemoryPoolE
# 5  _ZN8facebook5velox6memory14MemoryPoolImplD2Ev
# 6  _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE24_M_release_last_use_coldEv
# 7  _ZN6gluten18VeloxMemoryManagerD1Ev
# 8  _ZN6gluten18VeloxMemoryManagerD0Ev
# 9  _ZN6gluten13MemoryManager7releaseEPS0_
# 10 Java_org_apache_gluten_memory_NativeMemoryManagerJniWrapper_release
# 11 0x00007fc63ca15074
@nimesh1601 nimesh1601 added bug Something isn't working triage labels Apr 28, 2025
@nimesh1601
Copy link
Author

@zhouyuan any clue ?

@PHILO-HE PHILO-HE changed the title Memory Leak possibly TableScan [VL] Memory Leak possibly TableScan Apr 29, 2025
@zhztheplayer
Copy link
Member

@nimesh1601 Can you try setting spark.gluten.sql.columnar.backend.velox.IOThreads=0 then see if the error goes away?

@nimesh1601
Copy link
Author

@zhztheplayer Thanks for your suggestion. I tried the given configuration, and it worked, didn't got any failures, but wouldn't it impact performance ?

@nimesh1601
Copy link
Author

I also tried running the same application with the new logs you have added, but couldn't see them

@zhztheplayer
Copy link
Member

zhztheplayer commented May 1, 2025

I also tried running the same application with the new logs you have added, but couldn't see them

Got it, thank for trying.

but wouldn't it impact performance ?

Usually it depends. Perhaps you can have some test in person for your environment?

We thought the memory leak issues related to the IO threads should already be fixed by facebookincubator/velox#12701 facebookincubator/velox#8181, but now it appears they still exist somehow. We'd investigate on it. Feel free to help if interested since it's repeatable from your end.

@nimesh1601
Copy link
Author

We thought the memory leak issues related to the IO threads should already be fixed by facebookincubator/velox#12701, but now it appears they still exist somehow. We'd investigate on it. Feel free to help if interested, since it's repeatable from your end.

I am also trying a few things out to fix this, and I will post the updates for the same. It will be great if you can tag me in the further discussion on this issue, and I will be happy to help in trying any fixes.

@zhztheplayer
Copy link
Member

We thought the memory leak issues related to the IO threads should already be fixed by facebookincubator/velox#12701, but now it appears they still exist somehow. We'd investigate on it. Feel free to help if interested, since it's repeatable from your end.

I am also trying a few things out to fix this, and I will post the updates for the same. It will be great if you can tag me in the further discussion on this issue, and I will be happy to help in trying any fixes.

Sure. The Velox PR I referred was wrong, I have updated inline.

@FelixYBW
Copy link
Contributor

FelixYBW commented May 11, 2025

this should be the reason and the PR has no followup. @rui-mo do we?

facebookincubator/velox#8205

I remember there is a quick fix.
facebookincubator/velox#8205 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

3 participants