linux-kernel - Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <jhja6vmxthb.mognet@arm.com>
Date:   Thu, 12 Nov 2020 19:31:12 +0000
From:   Valentin Schneider <valentin.schneider@....com>
To:     Qian Cai <cai@...hat.com>
Cc:     Peter Zijlstra <peterz@...radead.org>, tglx@...utronix.de,
        mingo@...nel.org, linux-kernel@...r.kernel.org,
        bigeasy@...utronix.de, qais.yousef@....com, swood@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, vincent.donnefort@....com,
        tj@...nel.org, ouwen210@...mail.com
Subject: Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()


On 12/11/20 18:01, Qian Cai wrote:
> On Thu, 2020-11-12 at 17:26 +0000, Valentin Schneider wrote:
>> On 12/11/20 16:38, Qian Cai wrote:
>> > Some syscall fuzzing from an unprivileged user starts to trigger this below
>> > since this commit first appeared in the linux-next today. Does it ring any
>> > bells?
>> >
>>
>> What's the .config? I'm interested in
>> CONFIG_PREEMPT
>> CONFIG_PREEMPT_RT
>> CONFIG_SMP
>
> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config
>
> # CONFIG_PREEMPT is not set
> CONFIG_SMP=y
>

So that's CONFIG_PREEMPT_NONE=y

> Also, I have been able to reproduce this on powerpc as well just now.
>
[...]
>
> [12065.101987][ T1310] __switch_to (arch/arm64/kernel/process.c:580)
> [12065.106227][ T1310] __schedule (kernel/sched/core.c:4272 kernel/sched/core.c:5019)
> [12065.110505][ T1310] schedule (./arch/arm64/include/asm/current.h:19 (discriminator 1) ./arch/arm64/include/asm/preempt.h:53 (discriminator 1) kernel/sched/core.c:5099 (discriminator 1))
> [12065.114562][ T1310] schedule_timeout (kernel/time/timer.c:1848)
> [12065.119275][ T1310] wait_for_completion (kernel/sched/completion.c:85 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138)
> [12065.124257][ T1310] affine_move_task (./include/linux/instrumented.h:101 ./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 ./include/linux/refcount.h:315 ./include/linux/refcount.h:333 kernel/sched/core.c:2263)
> [12065.129013][ T1310] __set_cpus_allowed_ptr (kernel/sched/core.c:2353)
> [12065.134248][ T1310] sched_setaffinity (kernel/sched/core.c:6460)
> [12065.139088][ T1310] __arm64_sys_sched_setaffinity (kernel/sched/core.c:6511 kernel/sched/core.c:6500 kernel/sched/core.c:6500)
> [12065.144972][ T1310] do_el0_svc (arch/arm64/kernel/syscall.c:36 arch/arm64/kernel/syscall.c:48 arch/arm64/kernel/syscall.c:159 arch/arm64/kernel/syscall.c:205)
> [12065.149165][ T1310] el0_sync_handler (arch/arm64/kernel/entry-common.c:236 arch/arm64/kernel/entry-common.c:254)
> [12065.153876][ T1310] el0_sync (arch/arm64/kernel/entry.S:741)
>

Thanks!

One thing I don't get: that trace shows refcount_dec_and_test()
(kernel/sched/core.c:2263) happening before the wait_for_completion(). It's
not the case in the below trace.

> == powerpc ==
> [18060.020301][ T676] [c000200014227670] [c000000000a6d1e8] __func__.5350+0x1220e0/0x181338 unreliable
> [18060.020333][ T676] [c000200014227850] [c00000000001a278] __switch_to (arch/powerpc/kernel/process.c:1273)
> [18060.020351][ T676] [c0002000142278c0] [c0000000008f3e94] __schedule (kernel/sched/core.c:4269 kernel/sched/core.c:5019)
> [18060.020377][ T676] [c000200014227990] [c0000000008f4638] schedule (./include/asm-generic/preempt.h:59 (discriminator 1) kernel/sched/core.c:5099 (discriminator 1))
> [18060.020394][ T676] [c0002000142279c0] [c0000000008fbd34] schedule_timeout (kernel/time/timer.c:1847)
> [18060.020420][ T676] [c000200014227ac0] [c0000000008f6398] wait_for_completion (kernel/sched/completion.c:85 kernel/sched/completion.c:106 kernel/sched/completion.c:117 kernel/sched/completion.c:138)
> [18060.020455][ T676] [c000200014227b30] [c000000000100fd4] affine_move_task (kernel/sched/core.c:2261)
> [18060.020481][ T676] [c000200014227c90] [c000000000101444] __set_cpus_allowed_ptr (kernel/sched/core.c:2353)
> [18060.020507][ T676] [c000200014227d00] [c000000000106eac] sched_setaffinity (kernel/sched/core.c:6460)
> [18060.020533][ T676] [c000200014227d70] [c000000000107134] sys_sched_setaffinity (kernel/sched/core.c:6511 kernel/sched/core.c:6500)
> [18060.020559][ T676] [c000200014227dc0] [c00000000002a6d8] system_call_exception (arch/powerpc/kernel/syscall_64.c:111)
> [18060.020585][ T676] [c000200014227e20] [c00000000000d0a8] system_call_common (arch/powerpc/kernel/entry_64.S:302)

I take back what I said in that previous email; we could have gone through
the task_running() stop_one_cpu() and *then* hit the

  wait_for_completion(&pending->done);

and that is actually the only case that makes sense to me here, because the
!task_running() one will do the completion before waiting (that kernel has
no way to make a task Migration Disabled).

I think what is happening here is:

  affine_move_task()
      // task_running() case
      stop_one_cpu()
      wait_for_completion(&pending->done);

and this is !PREEMPT, so the stopper can very well hit:

  migration_cpu_stop()
    // Task moved between unlocks and scheduling the stopper
    task_rq(p) != rq &&
    // task_running() case
    dest_cpu >= 0

    => no complete_all(), ever :(

This is an annoying case because we didn't have to bother about it before;
a rq mismatch meant the task was fine, because we modified
->cpus_allowed_mask prior.

With migrate_disable(), we have to chase after the bloody task because
we have to preempt it to get a stable is_migration_disabled() reading. It
could have been Migration Disabled, got some pending installed, got out of
Migration Disabled, moved around, gone Migration Disabled again and got
some more pending before we get to run the stopper :(

a) Do you also get this on CONFIG_PREEMPT=y?
b) Could you try the below?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 02076e6d3792..fad0a8e62aca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1923,7 +1923,7 @@ static int migration_cpu_stop(void *data)
 		else
 			p->wake_cpu = dest_cpu;
 
-	} else if (dest_cpu < 0) {
+	} else if (dest_cpu < 0 || pending) {
 		/*
 		 * This happens when we get migrated between migrate_enable()'s
 		 * preempt_enable() and scheduling the stopper task. At that
@@ -1933,6 +1933,17 @@ static int migration_cpu_stop(void *data)
 		 * more likely.
 		 */
 
+		/*
+		 * The task moved before the stopper got to run. We're holding
+		 * ->pi_lock, so the allowed mask is stable - if it got
+		 * somewhere allowed, we're done.
+		 */
+		if (pending && cpumask_test_cpu(task_cpu(p), p->cpus_ptr)) {
+			p->migration_pending = NULL;
+			complete = true;
+			goto out;
+		}
+
 		/*
 		 * When this was migrate_enable() but we no longer have an
 		 * @pending, a concurrent SCA 'fixed' things and we should be