[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <12b84e27-f9b1-477e-8e56-4b7c6727e86b@oracle.com>
Date: Thu, 3 Jul 2025 16:35:42 -0700
From: Libo Chen <libo.chen@...cle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: "Chen, Yu C" <yu.c.chen@...el.com>, Michal Hocko <mhocko@...e.com>,
Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org,
Jirka Hladky <jhladky@...hat.com>,
Srikanth Aithal <Srikanth.Aithal@....com>,
Suneeth D <Suneeth.D@....com>
Subject: Re: [PATCH] sched/numa: Fix NULL pointer access to mm_struct durng
task swap
On 7/3/25 07:18, Peter Zijlstra wrote:
> On Thu, Jul 03, 2025 at 06:57:04AM -0700, Libo Chen wrote:
>>
>>
>> On 7/3/25 05:36, Peter Zijlstra wrote:
>>> On Thu, Jul 03, 2025 at 05:20:47AM -0700, Libo Chen wrote:
>>>
>>>> I agree. The other parts, schedstat and vmstat, are still quite helpful.
>>>> Also tracepoints are more expensive than counters once enabled, I think
>>>> that's too much for just counting numbers.
>>>
>>> I'm not generally a fan of eBPF, but supposedly it is really good for
>>> stuff like this.
>>>
>>
>> Yeah but not nearly as good as, for example, __schedstat_inc(var) which
>> probably only takes a few CPU cycles if var is in the right place. eBPF
>> is gonna take a whole bunch of sequences to even get to updating an eBPF
>> map which itself is much more expensive than __schedstat_inc(var).
>>
>> For one, __migrate_swap_task() happens when dst node is fully busy (most
>> likely src node is full as well), so the overhead of ebpf could be quite
>> noticeable.
>
> But that overhead is only paid if you actually care about the numbers;
> most people don't.
>
> We already stick static branches in many of the accounting paths --
> because we know they hurt.
>
> But look at this:
>
> __schedstat_inc(p->stats.numa_task_swapped);
> count_vm_numa_event(NUMA_TASK_SWAP);
> count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
>
> that is _3_ different counters, 3 cachelines touched. For what?
>
> Would not a single:
>
> trace_numa_task_swap_tp(p);
>
> be much saner? It translates into a single no-op; no lines touched. Only
> when someone wants the numbers do we attach to the tracepoint and start
> collecting things.
>
> Is the collecting more expensive; maybe. But the rest of us will be
> better of, no?
Probably not as bad as you may think. Systems with one NUMA node or NUMA
balancing disabled (which will be most of the machines) won't be affected
by this at all , task_numa_migrate() is also ratelimited so it doesn't get
touched nearly as often as most of other scheduler events.
If this is on a really hot and critical path that most of us have to take,
such as wakeup, I won't argue with you at all. I don't want to be too
persistent here, it's fine to use eBPF with the existing tracepoints. I
just think this is convenient and doesn't really hurt those who don't care
about these numbers.
Powered by blists - more mailing lists