linux-kernel - Re: [PATCH] sched/fair: properly serialize the cfs_rq h

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPjX3FfDTdUvMCDJCP8tAeNeaYSWj9mSsrMmE=VP0kWAdJTSVQ@mail.gmail.com>
Date: Fri, 22 Nov 2024 18:33:31 +0100
From: Daniel Vacek <neelx@...e.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: properly serialize the cfs_rq h_load calculation

On Fri, Nov 22, 2024 at 4:42 PM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Fri, Nov 22, 2024 at 04:28:55PM +0100, Daniel Vacek wrote:
> > Make sure the given cfs_rq's h_load is always correctly updated. This
> > prevents a race between more CPUs eventually updating the same hierarchy
> > of h_load_next return pointers.
>
> Is there an actual problem observed?

Well, that depends. Do we care about correct (exact) load calculation
every time?
If it's not a big deal we may just drop this patch.
I am not sure what (if any real) problems this can cause. I did not
observe any I'm aware of. Actually I should have labeled this [RFC],
but I forgot :-/

This is being called from `try_to_wake_up` => `select_task_rq_fair`.
If two (or more) CPUs race to wake up => `task_h_load()` *different*
tasks on the same rq (I mean the same target CPU), they may get a
wrong result if the tasks are in different cgroups. Well, wrong in a
sense the `cfs_rq->h_load` may not be up to date and the old, former
value is used for all but one of the racing cgroups (cfs_rqs).

I could detect the race collisions almost every minute on my lightly
loaded laptop (using bpftrace which admittedly opened the window a
bit, but for sure it can happen). Though I am not sure if it's a big
deal?
The `cfs_rq->h_load` will get updated the next time when the race does
not happen again. So very likely right the next time.
And we may be pretty fine eventually using the old value from time to
time. The question is are we fine with that or are we not? I guess we
are and this patch can be dropped, right?

It almost looks like the function is specifically designed this way as
we really do not care about unlikely failures because the worst can
happen is a bit older value is kept in `h_load`. It may not be even
that different to the correct value I guess and it will (most)
definitely get fixed/updated the next time.

If that is really the intention of the current design, let's just drop
this patch.

I understand that this is adding another lock into the scheduler which
is always to be well considered. But on the other hand the race is
limited to once per jiffy for a given CPU otherwise the condition
bails out early. By the nature of this race the contention should be
unlikely most of the time. With that respect I was considering just
using the rq lock, but using a dedicated one actually looked simpler
to me after all. Also the scope of the lock is clear this way. It
serves only this one purpose. Which we may not need or do not care
about after all.

Hence I'm wondering what is your opinion with regards to this?
Would we benefit from guaranteed correct calculation every time in
exchange for a little overhead?
Perhaps, can you suggest a stress test or benchmark or any workload
which heavily exercises task wake ups so that I can try to quantify
the added overhead?

--nX