linux-kernel - Re: [PATCH 1/4] memcg: use mod_node_page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYxDkkDI4mk3r011@hyeyoo>
Date: Wed, 11 Feb 2026 17:53:38 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Dev Jain <dev.jain@....com>
Cc: Shakeel Butt <shakeel.butt@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Muchun Song <muchun.song@...ux.dev>, Qi Zheng <qi.zheng@...ux.dev>,
        Vlastimil Babka <vbabka@...e.cz>, linux-mm@...ck.org,
        cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
        Meta kernel team <kernel-team@...a.com>
Subject: Re: [PATCH 1/4] memcg: use mod_node_page_state to update stats

On Wed, Feb 11, 2026 at 01:07:40PM +0530, Dev Jain wrote:
> 
> On 10/02/26 9:59 pm, Shakeel Butt wrote:
> > On Tue, Feb 10, 2026 at 01:08:49PM +0530, Dev Jain wrote:
> > [...]
> >>> Oh so it is arm64 specific issue. I tested on x86-64 machine and it solves
> >>> the little regression it had before. So, on arm64 all this_cpu_ops i.e. without
> >>> double underscore, uses LL/SC instructions. 
> >>>
> >>> Need more thought on this. 
> >>>
> >>>>> Also can you confirm whether my analysis of the regression was correct?
> >>>>>  Because if it was, then this diff looks wrong - AFAIU preempt_disable()
> >>>>>  won't stop an irq handler from interrupting the execution, so this
> >>>>>  will introduce a bug for code paths running in irq context.
> >>>>>
> >>>> I was worried about the correctness too, but this_cpu_add() is safe
> >>>> against IRQs and so the stat will be _eventually_ consistent?
> >>>>
> >>>> Ofc it's so confusing! Maybe I'm the one confused.
> >>> Yeah there is no issue with proposed patch as it is making the function
> >>> re-entrant safe.
> >> Ah yes, this_cpu_add() does the addition in one shot without read-modify-write.
> >>
> >> I am still puzzled whether the original patch was a bug fix or an optimization.
> > The original patch was a cleanup patch. The memcg stats update functions
> > were already irq/nmi safe without disabling irqs and that patch did the
> > same for the numa stats. Though it seems like that is causing regression
> > for arm64 as this_cpu* ops are expensive on arm64. 
> >
> >> The patch description says that node stat updation uses irq unsafe interface.
> >> Therefore, we had foo() calling __foo() nested with local_irq_save/restore. But
> >> there were code paths which directly called __foo() - so, your patch fixes a bug right
> > No, those places were already disabling irqs and should be fine.
> 
> Please correct me if I am missing something here. Simply putting an
> if (!irqs_disabled()) -> dump_stack() in __lruvec_stat_mod_folio, before
> calling __mod_node_page_state, reveals:
> 
> [ 6.486375] Call trace:
> [ 6.486376] show_stack+0x20/0x38 (C)
> [ 6.486379] dump_stack_lvl+0x74/0x90
> [ 6.486382] dump_stack+0x18/0x28
> [ 6.486383] __lruvec_stat_mod_folio+0x160/0x180
> [ 6.486385] folio_add_file_rmap_ptes+0x128/0x480
> [ 6.486388] set_pte_range+0xe8/0x320
> [ 6.486389] finish_fault+0x260/0x508
> [ 6.486390] do_fault+0x2d0/0x598
> [ 6.486391] __handle_mm_fault+0x398/0xb60
> [ 6.486393] handle_mm_fault+0x15c/0x298
> [ 6.486394] __get_user_pages+0x204/0xb88
> [ 6.486395] populate_vma_page_range+0xbc/0x1b8
> [ 6.486396] __mm_populate+0xcc/0x1e0
> [ 6.486397] __arm64_sys_mlockall+0x1d4/0x1f8
> [ 6.486398] invoke_syscall+0x50/0x120
> [ 6.486399] el0_svc_common.constprop.0+0x48/0xf0
> [ 6.486400] do_el0_svc+0x24/0x38
> [ 6.486400] el0_svc+0x34/0xf0
> [ 6.486402] el0t_64_sync_handler+0xa0/0xe8
> [ 6.486404] el0t_64_sync+0x198/0x1a0
> 
> Indeed finish_fault() takes a PTL spin lock without irq disablement.

That indeed looks incorrect to me.
I was assuming __foo() is always called with IRQs disabled!

> > I am working on adding batched stats update functionality in the hope
> > that will fix the regression.
> 
> Thanks! FYI, I have zeroed in the issue on to preempt_disable(). Dropping this
> from _pcpu_protect_return solves the regression.

That's interesting, why is the cost of preempt disable/enable so high?

> Unlike x86, arm64 does a preempt_disable
> when doing this_cpu_*. On a cursory look it seems like this is unnecessary - since we
> are doing preempt_enable() immediately after reading the pointer, CPU migration is
> possible anyways, so there is nothing to be gained by reading pcpu pointer with
> preemption disabled. I am investigating whether we can simply drop this in general.

Let me quote an old email from Mark Rutland [1]:
> We also thought that initially, but there's a sbutle race that can
> occur, and so we added code to disable preemption in commit:
> 
>   f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe")
> 
> The problem on arm64 is that our atomics take a single base register,
> and we have to generate the percpu address with separate instructions
> from the atomic itself. That means we can get preempted between address
> generation and the atomic, which is problematic for sequences like:
> 
> 	// Thread-A			// Thread-B
> 
> 	this_cpu_add(var)
> 					local_irq_disable(flags)
> 					...
> 					v = __this_cpu_read(var);
> 					v = some_function(v);
> 					__this_cpu_write(var, v);
> 					...
> 					local_irq_restore(flags)
> 
> ... which can unexpectedly race as:
> 
> 
> 	// Thread-A			// Thread-B
> 
> 	< generate CPU X addr >
> 	< preempted >
> 
> 					< scheduled on CPU X >
> 					local_irq_disable(flags);
> 					v = __this_cpu_read(var);
> 
> 	< scheduled on CPU Y >
> 	< add to CPU X's var >
> 					v = some_function(v);
> 					__this_cpu_write(var, v);
> 					local_irq_restore(flags);
> 
> ... and hence we lose an update to a percpu variable.

... so, removing preempt disable _in general_ is probably not a good idea.

[1] https://lore.kernel.org/all/20190311164837.GD24275@lakrids.cambridge.arm.com

-- 
Cheers,
Harry / Hyeonggon