[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f8dec04-db47-d043-694f-601baa2ea615@redhat.com>
Date: Fri, 27 Jan 2023 14:09:01 -0500
From: Waiman Long <longman@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>,
Will Deacon <will@...nel.org>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Tejun Heo <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>,
Johannes Weiner <hannes@...xchg.org>,
linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
Lai Jiangshan <jiangshanlai@...il.com>, qperret@...gle.com,
Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided
cpumask in sched_setaffinity()
On 1/27/23 13:36, Peter Zijlstra wrote:
> On Tue, Jan 17, 2023 at 04:08:26PM +0000, Will Deacon wrote:
>> Hi Waiman,
>>
>> On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
>>> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
>>> Introduce task_struct::user_cpus_ptr to track requested affinity"). It
>>> is currently used only by arm64 arch due to possible asymmetric CPU
>>> setup. This patch extends its usage to save user provided cpumask
>>> when sched_setaffinity() is called for all arches. With this patch
>>> applied, user_cpus_ptr, once allocated after a successful call to
>>> sched_setaffinity(), will only be freed when the task exits.
>>>
>>> Since user_cpus_ptr is supposed to be used for "requested
>>> affinity", there is actually no point to save current cpu affinity in
>>> restrict_cpus_allowed_ptr() if sched_setaffinity() has never been called.
>>> Modify the logic to set user_cpus_ptr only in sched_setaffinity() and use
>>> it in restrict_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>>> if defined but not changing it.
>>>
>>> This will be some changes in behavior for arm64 systems with asymmetric
>>> CPUs in some corner cases. For instance, if sched_setaffinity()
>>> has never been called and there is a cpuset change before
>>> relax_compatible_cpus_allowed_ptr() is called, its subsequent call will
>>> follow what the cpuset allows but not what the previous cpu affinity
>>> setting allows.
>>>
>>> Signed-off-by: Waiman Long <longman@...hat.com>
>>> ---
>>> kernel/sched/core.c | 82 ++++++++++++++++++++------------------------
>>> kernel/sched/sched.h | 7 ++++
>>> 2 files changed, 44 insertions(+), 45 deletions(-)
>> We've tracked this down as the cause of an arm64 regression in Android and I've
>> reproduced the issue with mainline.
>>
>> Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
>> the command-line, then the arch code will (amongst other things) call
>> force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>> when exec()'ing a 32-bit or a 64-bit task respectively.
>>
>> If you consider a system where everything is 64-bit but the cmdline option
>> above is present, then the call to relax_compatible_cpus_allowed_ptr() isn't
>> expected to do anything in this case, and the old code made sure of that:
>>
>>> @@ -3055,30 +3032,21 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
>>>
>>> /*
>>> * Restore the affinity of a task @p which was previously restricted by a
>>> - * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
>>> - * @p->user_cpus_ptr.
>>> + * call to force_compatible_cpus_allowed_ptr().
>>> *
>>> * It is the caller's responsibility to serialise this with any calls to
>>> * force_compatible_cpus_allowed_ptr(@p).
>>> */
>>> void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>>> {
>>> - struct cpumask *user_mask = p->user_cpus_ptr;
>>> - unsigned long flags;
>>> + int ret;
>>>
>>> /*
>>> - * Try to restore the old affinity mask. If this fails, then
>>> - * we free the mask explicitly to avoid it being inherited across
>>> - * a subsequent fork().
>>> + * Try to restore the old affinity mask with __sched_setaffinity().
>>> + * Cpuset masking will be done there too.
>>> */
>>> - if (!user_mask || !__sched_setaffinity(p, user_mask))
>>> - return;
>> ... since it returned early here if '!user_mask' ...
>>
>>> -
>>> - raw_spin_lock_irqsave(&p->pi_lock, flags);
>>> - user_mask = clear_user_cpus_ptr(p);
>>> - raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>>> -
>>> - kfree(user_mask);
>>> + ret = __sched_setaffinity(p, task_user_cpus(p));
>>> + WARN_ON_ONCE(ret);
>> ... however, now we end up going down into __sched_setaffinity() with
>> task_user_cpus(p) giving us the 'cpu_possible_mask'! This can lead to a mixture
>> of WARN_ON()s and incorrect affinity masks (for example, a newly exec'd task
>> ends up with the affinity mask of the online CPUs at the point of exec() and is
>> unable to run on anything onlined later).
>>
>> I've had a crack at fixing the code above to restore the old behaviour, and it
>> seems to work for my basic tests (still pending confirmation from others):
> This seems to cure things... cpuset is insane and insists on limiting
> things to online CPUs for no real reason. It is perfectly fine to have
> offline CPUs in the allowed mask (in fact, that's the default
> behaviour).
>
> With this on and "relax_compatible_cpus_allowed_ptr(current);" added to
> the exec() path things seem to work as expected for me.
>
> I'll clean up and post properly tomorrow (I think there's a simpler
> version hiding in there)...
>
> ---
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index a29c0b13706b..7a63416a46f3 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -498,19 +498,33 @@ static inline bool partition_is_populated(struct cpuset *cs,
> *
> * Call with callback_lock or cpuset_rwsem held.
> */
> -static void guarantee_online_cpus(struct task_struct *tsk,
> - struct cpumask *pmask)
> +static void guarantee_cs_cpus(struct task_struct *tsk, struct cpumask *pmask, bool online)
> {
> - const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> + const struct cpumask *task_possible_mask = task_cpu_possible_mask(tsk);
> + const struct cpumask *possible_mask = cpu_possible_mask;
> + const struct cpumask *cs_cpus;
> struct cpuset *cs;
>
> - if (WARN_ON(!cpumask_and(pmask, possible_mask, cpu_online_mask)))
> - cpumask_copy(pmask, cpu_online_mask);
> + if (online)
> + possible_mask = cpu_online_mask;
> +
> + if (WARN_ON(!cpumask_and(pmask, task_possible_mask, possible_mask)))
> + cpumask_copy(pmask, possible_mask);
>
> rcu_read_lock();
> cs = task_cs(tsk);
>
> - while (!cpumask_intersects(cs->effective_cpus, pmask)) {
> + if (!parent_cs(cs)) {
> + cs_cpus = cpu_possible_mask;
> + if (online)
> + cs_cpus = cpu_online_mask;
> + } else {
> + cs_cpus = cs->cpus_allowed;
> + if (online)
> + cs_cpus = cs->effective_cpus;
This may not be the right thing to do to use cpus_allowed directly in
the case of cgroup v2. In v2, cpus_allowed starts as empty and
effective_cpus inherit from its parent. So we may have to go up the
cpuset hierarchy to arrive at the proper cpus_allowed to use. We may
need another helper to do that.
Cheers,
Longman
Powered by blists - more mailing lists