linux-kernel - Re: [PATCH] memcg: Always call cond

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDWN9mP3wPQnwo4m@gmail.com>
Date: Tue, 27 May 2025 03:03:34 -0700
From: Breno Leitao <leitao@...ian.org>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Muchun Song <muchun.song@...ux.dev>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Chen Ridong <chenridong@...wei.com>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Michal Hocko <mhocko@...e.com>, cgroups@...r.kernel.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	kernel-team@...a.com, Michael van der Westhuizen <rmikey@...a.com>,
	Usama Arif <usamaarif642@...il.com>,
	Pavel Begunkov <asml.silence@...il.com>,
	Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH] memcg: Always call cond_resched() after fn()

Hello Shakeel,

On Fri, May 23, 2025 at 11:21:39AM -0700, Shakeel Butt wrote:
> On Fri, May 23, 2025 at 10:21:06AM -0700, Breno Leitao wrote:
> > I am seeing soft lockup on certain machine types when a cgroup
> > OOMs. This is happening because killing the process in certain machine
> > might be very slow, which causes the soft lockup and RCU stalls. This
> > happens usually when the cgroup has MANY processes and memory.oom.group
> > is set.
> > 
> > Example I am seeing in real production:
> > 
> >        [462012.244552] Memory cgroup out of memory: Killed process 3370438 (crosvm) ....
> >        ....
> >        [462037.318059] Memory cgroup out of memory: Killed process 4171372 (adb) ....
> >        [462037.348314] watchdog: BUG: soft lockup - CPU#64 stuck for 26s! [stat_manager-ag:1618982]
> >        ....
> > 
> 
> > Quick look at why this is so slow, it seems to be related to serial
> > flush for certain machine types. For all the crashes I saw, the target
> > CPU was at console_flush_all().
> > 
> > In the case above, there are thousands of processes in the cgroup, and
> > it is soft locking up before it reaches the 1024 limit in the code
> > (which would call the cond_resched()). So, cond_resched() in 1024 blocks
> > is not sufficient.
> > 
> > Remove the counter-based conditional rescheduling logic and call
> > cond_resched() unconditionally after each task iteration, after fn() is
> > called. This avoids the lockup independently of how slow fn() is.
> > 
> > Cc: Michael van der Westhuizen <rmikey@...a.com>
> > Cc: Usama Arif <usamaarif642@...il.com>
> > Cc: Pavel Begunkov <asml.silence@...il.com>
> > Suggested-by: Rik van Riel <riel@...riel.com>
> > Signed-off-by: Breno Leitao <leitao@...ian.org>
> > Fixes: 46576834291869457 ("memcg: fix soft lockup in the OOM process")
> 
> Can you share the call stack but I think from the above, it seems to be
> from oom_kill_memcg_member().

Sure, this is what I see at the crash time:

	[73963.996160]  Memory cgroup out of memory: Killed process 177737 (adb) total-vm:24896kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:320kB oom_score_adj:0
	[73964.026146]  Memory cgroup out of memory: Killed process 177738 (sh) total-vm:8064kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:320kB oom_score_adj:0
	[73964.055784]  Memory cgroup out of memory: Killed process 177739 (adb) total-vm:24896kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:320kB oom_score_adj:0
	[73964.085773]  Memory cgroup out of memory: Killed process 177740 (sh) total-vm:8064kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:320kB oom_score_adj:0
	[73964.115468]  Memory cgroup out of memory: Killed process 177742 (adb) total-vm:24896kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:320kB oom_score_adj:0
	[73964.145375]  watchdog: BUG: soft lockup - CPU#20 stuck for 26s! [node-linux-arm6:159076]
	[73964.145376]  CPU#20 Utilization every 4s during lockup:
	[73964.145377]  	#1:   4% system,	  0% softirq,	  0% hardirq,	  0% idle
	[73964.145379]  	#2:   4% system,	  0% softirq,	  0% hardirq,	  0% idle
	[73964.145380]  	#3:   4% system,	  0% softirq,	  0% hardirq,	  0% idle
	[73964.145380]  	#4:   4% system,	  0% softirq,	  0% hardirq,	  0% idle
	[73964.145381]  	#5:   6% system,	  0% softirq,	  0% hardirq,	  0% idle
	[73964.145382]  Modules linked in: vhost_vsock(E) ghes_edac(E) bpf_preload(E) tls(E) tcp_diag(E) inet_diag(E) sch_fq(E) act_gact(E) cls_bpf(E) ipmi_ssif(E) ipmi_devintf(E) crct10dif_ce(E) sm3_ce(E) sm3(E) sha3_ce(E) nvidia_cspmu(E) sha512_ce(E) sha512_arm64(E) arm_smmuv3_pmu(E) arm_cspmu_module(E) coresight_trbe(E) arm_spe_pmu(E) coresight_stm(E) coresight_tmc(E) ipmi_msghandler(E) coresight_etm4x(E) stm_core(E) coresight_funnel(E) spi_tegra210_quad(E) coresight(E) cppc_cpufreq(E) sch_fq_codel(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) efivarfs(E)
	[73964.145407]  CPU: 20 UID: 0 PID: 159076 Comm: node-linux-arm6 Kdump: loaded Tainted: G            E      
	[73964.145409]  Tainted: [E]=UNSIGNED_MODULE
	[73964.145410]  Hardware name: .....
	[73964.145411]  pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
	[73964.145412]  pc : console_flush_all+0x3bc/0x550
	[73964.145418]  lr : console_flush_all+0x3b4/0x550
	[73964.145419]  sp : ffff80014baceff0
	[73964.145420]  x29: ffff80014bacf040 x28: 0000000000000001 x27: ffff800082d2d008
	[73964.145421]  x26: 0000000000000000 x25: 0000000000000000 x24: ffff00017bd00000
	[73964.145422]  x23: 0000000000000000 x22: 0000000000000084 x21: ffff80008253a718
	[73964.145423]  x20: ffff800082526b08 x19: ffff80014bacf0ac x18: 0000000000000018
	[73964.145424]  x17: 0000000000000058 x16: 00000000ffffffff x15: 0000000000003638
	[73964.145425]  x14: 0000000000000001 x13: 65636f7270206465 x12: 6c6c694b203a7972
	[73964.145427]  x11: ffff800083086000 x10: ffff80008308625c x9 : 00000000ffffffff
	[73964.145428]  x8 : 0000000000000000 x7 : 205d383634353131 x6 : 312e34363933375b
	[73964.145429]  x5 : ffff8000830864f7 x4 : ffff80014bacee7f x3 : ffff8000803ce390
	[73964.145430]  x2 : 00000000000000aa x1 : 0000000000000000 x0 : 0000000000000000
	[73964.145432]  Call trace:
	[73964.145433]   console_flush_all+0x3bc/0x550 (P)
	[73964.145434]   console_unlock+0x90/0x188
	[73964.145437]   vprintk_emit+0x3c8/0x560
	[73964.145439]   vprintk_default+0x3c/0x50
	[73964.145441]   vprintk+0x2c/0x40
	[73964.145443]   _printk+0x50/0x68
	[73964.145444]   __oom_kill_process+0x36c/0x5f8
	[73964.145445]   oom_kill_memcg_member+0x54/0xb8
	[73964.145447]   mem_cgroup_scan_tasks+0xa4/0x190
	[73964.145449]   oom_kill_process+0x124/0x290
	[73964.145450]   out_of_memory+0x194/0x4b8
	[73964.145451]   mem_cgroup_out_of_memory+0xcc/0x110
	[73964.145452]   __mem_cgroup_charge+0x5d8/0x9e0
	[73964.145454]   filemap_add_folio+0x44/0xe0
	[73964.145456]   alloc_extent_buffer+0x2a8/0xaa8
	[73964.145458]   read_block_for_search+0x204/0x308
	[73964.145460]   btrfs_search_slot+0x5bc/0x998
	[73964.145463]   btrfs_lookup_file_extent+0x44/0x58
	[73964.145465]   btrfs_get_extent+0x130/0x900
	[73964.145467]   btrfs_do_readpage+0x2d8/0x798
	[73964.145469]   btrfs_readahead+0x64/0x198
	[73964.145470]   read_pages+0x58/0x370
	[73964.145471]   page_cache_ra_unbounded+0x218/0x260
	[73964.145473]   page_cache_ra_order+0x2d0/0x338
	[73964.145474]   filemap_fault+0x418/0xe68
	[73964.145475]   __do_fault+0xb0/0x270
	[73964.145476]   do_pte_missing+0x73c/0x1148
	[73964.145477]   handle_mm_fault+0x2c8/0xeb8
	[73964.145478]   do_translation_fault+0x250/0x820
	[73964.145479]   do_mem_abort+0x40/0xc8
	[73964.145481]   el0_ia+0x60/0x100
	[73964.145483]   el0t_64_sync_handler+0xe8/0x100
	[73964.145484]   el0t_64_sync+0x168/0x170
	[73964.145486]  Kernel panic - not syncing: softlockup: hung tasks
	[73964.145487]  CPU: 20 UID: 0 PID: 159076 Comm: node-linux-arm6 Kdump: loaded Tainted: G            EL    
	[73964.145489]  Tainted: [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP
	[73964.145490]  Call trace:
	[73964.145491]   show_stack+0x1c/0x30 (C)
	[73964.145492]   dump_stack_lvl+0x38/0x80
	[73964.145494]   panic+0x11c/0x370
	[73964.145496]   watchdog_timer_fn+0x5c0/0x968
	[73964.145498]   __hrtimer_run_queues+0x100/0x310
	[73964.145500]   hrtimer_interrupt+0x110/0x400
	[73964.145502]   arch_timer_handler_phys+0x34/0x50
	[73964.145503]   handle_percpu_devid_irq+0x88/0x1c0
	[73964.145505]   generic_handle_domain_irq+0x48/0x80
	[73964.145506]   gic_handle_irq+0x4c/0x108
	[73964.145507]   call_on_irq_stack+0x24/0x30
	[73964.145509]   do_interrupt_handler+0x50/0x78
	[73964.145511]   el1_interrupt+0x30/0x48
	[73964.145512]   el1h_64_irq_handler+0x14/0x20
	[73964.145513]   el1h_64_irq+0x6c/0x70
	[73964.145514]   console_flush_all+0x3bc/0x550 (P)
	[73964.145515]   console_unlock+0x90/0x188
	[73964.145517]   vprintk_emit+0x3c8/0x560
	[73964.145517]   vprintk_default+0x3c/0x50
	[73964.145519]   vprintk+0x2c/0x40
	[73964.145520]   _printk+0x50/0x68
	[73964.145521]   __oom_kill_process+0x36c/0x5f8
	[73964.145522]   oom_kill_memcg_member+0x54/0xb8
	[73964.145524]   mem_cgroup_scan_tasks+0xa4/0x190
	[73964.145525]   oom_kill_process+0x124/0x290
	[73964.145526]   out_of_memory+0x194/0x4b8
	[73964.145527]   mem_cgroup_out_of_memory+0xcc/0x110
	[73964.145528]   __mem_cgroup_charge+0x5d8/0x9e0
	[73964.145530]   filemap_add_folio+0x44/0xe0
	[73964.145531]   alloc_extent_buffer+0x2a8/0xaa8
	[73964.145532]   read_block_for_search+0x204/0x308
	[73964.145534]   btrfs_search_slot+0x5bc/0x998
	[73964.145536]   btrfs_lookup_file_extent+0x44/0x58
	[73964.145538]   btrfs_get_extent+0x130/0x900
	[73964.145540]   btrfs_do_readpage+0x2d8/0x798
	[73964.145542]   btrfs_readahead+0x64/0x198
	[73964.145544]   read_pages+0x58/0x370
	[73964.145545]   page_cache_ra_unbounded+0x218/0x260
	[73964.145546]   page_cache_ra_order+0x2d0/0x338
	[73964.145547]   filemap_fault+0x418/0xe68
	[73964.145548]   __do_fault+0xb0/0x270
	[73964.145549]   do_pte_missing+0x73c/0x1148
	[73964.145549]   handle_mm_fault+0x2c8/0xeb8
	[73964.145550]   do_translation_fault+0x250/0x820
	[73964.145552]   do_mem_abort+0x40/0xc8
	[73964.145554]   el0_ia+0x60/0x100
	[73964.145556]   el0t_64_sync_handler+0xe8/0x100
	[73964.145557]   el0t_64_sync+0x168/0x170
	[73964.145560]  SMP: stopping secondary CPUs
	[73964.145803]  Starting crashdump kernel...
	[73964.145804]  Bye!

> Have you tried making __oom_kill_process
> not chatty? Basically instead of dumping to serial directly, use local
> buffer and then dump once it is full.

Not sure I followed you here. __oom_kill_process is doing the following:

  static void __oom_kill_process(struct task_struct *victim, const char *message)
  {
	...
        pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",


Would you use a buffer to print to, and them flush it at the same time
(with pr_err()?)

> Anyways, that would be a bit more involved and until then this seems
> fine to me.

Agree. Thanks for the review.