[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <D2CEBE80-E29A-4D63-8028-4F41A1F8B7B4@linux.ibm.com>
Date: Wed, 17 Sep 2025 11:28:52 +0530
From: Venkat <venkat88@...ux.ibm.com>
To: Julian Sun <sunjunchao@...edance.com>
Cc: tj@...nel.org, akpm@...ux-foundation.org, stable@...r.kernel.org,
songmuchun@...edance.com, shakeelb@...gle.com, hannes@...xchg.org,
roman.gushchin@...ux.dev, mhocko@...e.com,
linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>, riteshh@...ux.ibm.com,
ojaswin@...ux.ibm.com, linux-fsdevel@...r.kernel.org,
linux-xfs@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
Madhavan Srinivasan <maddy@...ux.ibm.com>,
Linux Next Mailing List <linux-next@...r.kernel.org>,
cgroups@...r.kernel.org, linux-mm@...r.kernel.org
Subject: Re: [linux-next20250911]Kernel OOPs while running generic/256 on
Pmem device
> On 15 Sep 2025, at 11:47 PM, Julian Sun <sunjunchao@...edance.com> wrote:
>
> Hi,
>
> On Mon, Sep 15, 2025 at 10:20 PM Venkat <venkat88@...ux.ibm.com> wrote:
>>
>>
>>
>>> On 13 Sep 2025, at 8:18 AM, Julian Sun <sunjunchao@...edance.com> wrote:
>>>
>>> Hi,
>>>
>>> Does this fix make sense to you?
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index d0dfaa0ccaba..ed24dcece56a 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3945,9 +3945,10 @@ static void mem_cgroup_css_free(struct
>>> cgroup_subsys_state *css)
>>> * Not necessary to wait for wb completion which might
>>> cause task hung,
>>> * only used to free resources. See
>>> memcg_cgwb_waitq_callback_fn().
>>> */
>>> - __add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry);
>>> if (atomic_dec_and_test(&wait->done.cnt))
>>> - wake_up_all(wait->done.waitq);
>>> + kfree(wait);
>>> + else
>>> + __add_wait_queue_entry_tail(wait->done.waitq,
>>> &wait->wq_entry);;
>>> }
>>> #endif
>>> if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
>>
>> Hello,
>>
>> Thanks for the fix. This is fixing the reported issue.
>
> Thanks for your testing and feedback.
>>
>> While sending out the patch please add below tag as well.
>>
>> Tested-by: Venkat Rao Bagalkote <venkat88@...ux.ibm.com>
>
> Sure. That's how it should be.
>
> Could you please try again with the following patch? The previous one
> might have caused a memory leak and had race conditions. I can’t
> reproduce it locally...
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 80257dba30f8..35da16928599 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3940,6 +3940,7 @@ static void mem_cgroup_css_free(struct
> cgroup_subsys_state *css)
> int __maybe_unused i;
>
> #ifdef CONFIG_CGROUP_WRITEBACK
> + spin_lock(&memcg_cgwb_frn_waitq.lock);
> for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> struct cgwb_frn_wait *wait = memcg->cgwb_frn[i].wait;
>
> @@ -3948,9 +3949,12 @@ static void mem_cgroup_css_free(struct
> cgroup_subsys_state *css)
> * only used to free resources. See
> memcg_cgwb_waitq_callback_fn().
> */
> __add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry);
> - if (atomic_dec_and_test(&wait->done.cnt))
> - wake_up_all(wait->done.waitq);
> + if (atomic_dec_and_test(&wait->done.cnt)) {
> + list_del(&wait->wq_entry.entry);
> + kfree(wait);
> + }
> }
> + spin_unlock(&memcg_cgwb_frn_waitq.lock);
> #endif
> if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
> static_branch_dec(&memcg_sockets_enabled_key);
>
Hello,
I tried this patch on the two on my CI nodes, and tests passed. Reported issue is fixed with this.
Regards,
Venkat.
>>
>> Regards,
>> Venkat.
>>>
>>> On Fri, Sep 12, 2025 at 8:33 PM Venkat <venkat88@...ux.ibm.com> wrote:
>>>>
>>>>
>>>>
>>>>> On 12 Sep 2025, at 10:51 AM, Venkat Rao Bagalkote <venkat88@...ux.ibm.com> wrote:
>>>>>
>>>>> Greetings!!!
>>>>>
>>>>>
>>>>> IBM CI has reported a kernel crash, while running generic/256 test case on pmem device from xfstests suite on linux-next20250911 kernel.
>>>>>
>>>>>
>>>>> xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
>>>>>
>>>>> local.config:
>>>>>
>>>>> [xfs_dax]
>>>>> export RECREATE_TEST_DEV=true
>>>>> export TEST_DEV=/dev/pmem0
>>>>> export TEST_DIR=/mnt/test_pmem
>>>>> export SCRATCH_DEV=/dev/pmem0.1
>>>>> export SCRATCH_MNT=/mnt/scratch_pmem
>>>>> export MKFS_OPTIONS="-m reflink=0 -b size=65536 -s size=512"
>>>>> export FSTYP=xfs
>>>>> export MOUNT_OPTIONS="-o dax"
>>>>>
>>>>>
>>>>> Test case: generic/256
>>>>>
>>>>>
>>>>> Traces:
>>>>>
>>>>>
>>>>> [ 163.371929] ------------[ cut here ]------------
>>>>> [ 163.371936] kernel BUG at lib/list_debug.c:29!
>>>>> [ 163.371946] Oops: Exception in kernel mode, sig: 5 [#1]
>>>>> [ 163.371954] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
>>>>> [ 163.371965] Modules linked in: xfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack bonding tls nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng vmx_crypto dax_pmem fuse ext4 crc16 mbcache jbd2 nd_pmem papr_scm sd_mod libnvdimm sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt
>>>>> [ 163.372127] CPU: 22 UID: 0 PID: 130 Comm: kworker/22:0 Kdump: loaded Not tainted 6.17.0-rc5-next-20250911 #1 VOLUNTARY
>>>>> [ 163.372142] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries
>>>>> [ 163.372155] Workqueue: cgroup_free css_free_rwork_fn
>>>>> [ 163.372169] NIP: c000000000d051d4 LR: c000000000d051d0 CTR: 0000000000000000
>>>>> [ 163.372176] REGS: c00000000ba079b0 TRAP: 0700 Not tainted (6.17.0-rc5-next-20250911)
>>>>> [ 163.372183] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000000 XER: 00000006
>>>>> [ 163.372214] CFAR: c0000000002bae9c IRQMASK: 0
>>>>> [ 163.372214] GPR00: c000000000d051d0 c00000000ba07c50 c00000000230a600 0000000000000075
>>>>> [ 163.372214] GPR04: 0000000000000004 0000000000000001 c000000000507e2c 0000000000000001
>>>>> [ 163.372214] GPR08: c000000d0cb87d13 0000000000000000 0000000000000000 a80e000000000000
>>>>> [ 163.372214] GPR12: c00e0001a1970fa2 c000000d0ddec700 c000000000208e58 c000000107b5e190
>>>>> [ 163.372214] GPR16: c00000000d3e5d08 c00000000b71cf78 c00000000d3e5d05 c00000000b71cf30
>>>>> [ 163.372214] GPR20: c00000000b71cf08 c00000000b71cf10 c000000019f58588 c000000004704bc8
>>>>> [ 163.372214] GPR24: c000000107b5e100 c000000004704bd0 0000000000000003 c000000004704bd0
>>>>> [ 163.372214] GPR28: c000000004704bc8 c000000019f585a8 c000000019f53da8 c000000004704bc8
>>>>> [ 163.372315] NIP [c000000000d051d4] __list_add_valid_or_report+0x124/0x188
>>>>> [ 163.372326] LR [c000000000d051d0] __list_add_valid_or_report+0x120/0x188
>>>>> [ 163.372335] Call Trace:
>>>>> [ 163.372339] [c00000000ba07c50] [c000000000d051d0] __list_add_valid_or_report+0x120/0x188 (unreliable)
>>>>> [ 163.372352] [c00000000ba07ce0] [c000000000834280] mem_cgroup_css_free+0xa0/0x27c
>>>>> [ 163.372363] [c00000000ba07d50] [c0000000003ba198] css_free_rwork_fn+0xd0/0x59c
>>>>> [ 163.372374] [c00000000ba07da0] [c0000000001f5d60] process_one_work+0x41c/0x89c
>>>>> [ 163.372385] [c00000000ba07eb0] [c0000000001f76c0] worker_thread+0x558/0x848
>>>>> [ 163.372394] [c00000000ba07f80] [c000000000209038] kthread+0x1e8/0x230
>>>>> [ 163.372406] [c00000000ba07fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
>>>>> [ 163.372416] Code: 4b9b1099 60000000 7f63db78 4bae8245 60000000 e8bf0008 3c62ff88 7fe6fb78 7fc4f378 38637d40 4b5b5c89 60000000 <0fe00000> 60000000 60000000 7f83e378
>>>>> [ 163.372453] ---[ end trace 0000000000000000 ]---
>>>>> [ 163.380581] pstore: backend (nvram) writing error (-1)
>>>>> [ 163.380593]
>>>>>
>>>>>
>>>>> If you happen to fix this issue, please add below tag.
>>>>>
>>>>>
>>>>> Reported-by: Venkat Rao Bagalkote <venkat88@...ux.ibm.com>
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Venkat.
>>>>>
>>>>>
>>>>
>>>> After reverting the below commit, issue is not seen.
>>>>
>>>> commit 61bbf51e75df1a94cf6736e311cb96aeb79826a8
>>>> Author: Julian Sun <sunjunchao@...edance.com>
>>>> Date: Thu Aug 28 04:45:57 2025 +0800
>>>>
>>>> memcg: don't wait writeback completion when release memcg
>>>> Recently, we encountered the following hung task:
>>>> INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
>>>> [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
>>>> [Wed Jul 30 17:47:45 2025] Call Trace:
>>>> [Wed Jul 30 17:47:45 2025] __schedule+0x934/0xe10
>>>> [Wed Jul 30 17:47:45 2025] ? complete+0x3b/0x50
>>>> [Wed Jul 30 17:47:45 2025] ? _cond_resched+0x15/0x30
>>>> [Wed Jul 30 17:47:45 2025] schedule+0x40/0xb0
>>>> [Wed Jul 30 17:47:45 2025] wb_wait_for_completion+0x52/0x80
>>>> [Wed Jul 30 17:47:45 2025] ? finish_wait+0x80/0x80
>>>> [Wed Jul 30 17:47:45 2025] mem_cgroup_css_free+0x22/0x1b0
>>>> [Wed Jul 30 17:47:45 2025] css_free_rwork_fn+0x42/0x380
>>>> [Wed Jul 30 17:47:45 2025] process_one_work+0x1a2/0x360
>>>> [Wed Jul 30 17:47:45 2025] worker_thread+0x30/0x390
>>>> [Wed Jul 30 17:47:45 2025] ? create_worker+0x1a0/0x1a0
>>>> [Wed Jul 30 17:47:45 2025] kthread+0x110/0x130
>>>> [Wed Jul 30 17:47:45 2025] ? __kthread_cancel_work+0x40/0x40
>>>> [Wed Jul 30 17:47:45 2025] ret_from_fork+0x1f/0x30
>>>> The direct cause is that memcg spends a long time waiting for dirty page
>>>> writeback of foreign memcgs during release.
>>>> The root causes are:
>>>> a. The wb may have multiple writeback tasks, containing millions
>>>> of dirty pages, as shown below:
>>>>>>> for work in list_for_each_entry("struct wb_writeback_work", \
>>>> wb.work_list.address_of_(), "list"):
>>>> ... print(work.nr_pages, work.reason, hex(work))
>>>> ...
>>>> 900628 WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
>>>> 1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
>>>> 1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0
>>>> 1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0
>>>> 1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340
>>>> 2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400
>>>> 2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0
>>>> 3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0
>>>> 3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0
>>>> 3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40
>>>> 3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800
>>>> 3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600
>>>> 3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0
>>>> 3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40
>>>> 3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340
>>>> 3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280
>>>> 3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040
>>>> 3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00
>>>> 3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800
>>>> 3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240
>>>> 3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400
>>>> 3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680
>>>> 3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
>>>> 3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
>>>> b. The writeback might severely throttled by wbt, with a speed
>>>> possibly less than 100kb/s, leading to a very long writeback time.
>>>>>>> wb.write_bandwidth
>>>> (unsigned long)24
>>>>>>> wb.write_bandwidth
>>>> (unsigned long)13
>>>> The wb_wait_for_completion() here is probably only used to prevent
>>>> use-after-free. Therefore, we manage 'done' separately and automatically
>>>> free it.
>>>> This allows us to remove wb_wait_for_completion() while preventing the
>>>> use-after-free issue.
>>>> com
>>>> Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
>>>> Signed-off-by: Julian Sun <sunjunchao@...edance.com>
>>>> Acked-by: Tejun Heo <tj@...nel.org>
>>>> Cc: Michal Hocko <mhocko@...e.com>
>>>> Cc: Roman Gushchin <roman.gushchin@...ux.dev>
>>>> Cc: Johannes Weiner <hannes@...xchg.org>
>>>> Cc: Shakeel Butt <shakeelb@...gle.com>
>>>> Cc: Muchun Song <songmuchun@...edance.com>
>>>> Cc: <stable@...r.kernel.org>
>>>> Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
>>>>
>>>> Regards,
>>>> Venkat.
>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Julian Sun <sunjunchao@...edance.com>
>>
>
> Thanks,
> --
> Julian Sun <sunjunchao@...edance.com>
Powered by blists - more mailing lists