lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2ddaf5e732c48f74a3758484869c81195adbc47a.camel@linux.intel.com>
Date:   Wed, 05 Jan 2022 11:19:04 -0800
From:   Tim Chen <tim.c.chen@...ux.intel.com>
To:     Dongli Zhang <dongli.zhang@...cle.com>, x86@...nel.org
Cc:     tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, peterz@...radead.org,
        rafael.j.wysocki@...el.com, vkuznets@...hat.com,
        alison.schofield@...el.com, boris.ostrovsky@...cle.com,
        joe.jin@...cle.com, longpeng2@...wei.com, bigeasy@...utronix.de,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND 1/1] x86/smpboot: check cpu_initialized_mask
 first after returning from schedule()

On Thu, 2021-12-23 at 13:03 -0800, Dongli Zhang wrote:
> To online a new CPU, the master CPU signals the secondary and waits
> for at
> most 10 seconds until cpu_initialized_mask is set by the secondary
> CPU.
> There is a case that the master CPU fails the online when it takes
> 10+
> seconds for schedule() to return (although the cpu_initialized_mask
> is
> already set by the secondary CPU).
> 
> 1. The master CPU signals the secondary CPU in do_boot_cpu().
> 
> 2. As the cpu_initialized_mask is still not set, the master CPU
> reschedules
> via schedule().
> 
> 3. Suppose it takes 10+ seconds until schedule() returns due to
> performance
> issue. The secondary CPU sets the cpu_initialized_mask during those
> 10+
> seconds.

The patch seems reasonable. But do you know what other task got run
and caused such long delay (>10 sec), preventing switch
back to in the master CPU?  It seems like there could be some problem
causing the long delay. It is better if we understand the root cause
of that.

Tim

> 
> 4. Once the schedule() at the master CPU returns, although the
> cpu_initialized_mask is set, the time_before(jiffies, timeout) fails.
> As a
> result, the master CPU regards this as failure.
> 
> [   51.983296] smpboot: do_boot_cpu failed(-1) to wakeup CPU#4
> 
> 5. Since the secondary CPU is stuck at state CPU_UP_PREPARE, any
> further
> online operation will fail by cpu_check_up_prepare(), unless the
> below
> patch set is applied.
> 
> https://lore.kernel.org/all/20211206152034.2150770-1-bigeasy@linutronix.de/
> 
> This issue is resolved by always first checking whether the secondary
> CPU
> has set cpu_initialized_mask.
> 
> Cc: Longpeng(Mike) <longpeng2@...wei.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> Cc: Joe Jin <joe.jin@...cle.com>
> Signed-off-by: Dongli Zhang <dongli.zhang@...cle.com>
> ---
> To resend by Cc linux-kernel@...r.kernel.org as well.
> 
>  arch/x86/kernel/smpboot.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 617012f4619f..faad4fcf67eb 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1145,7 +1145,7 @@ static int do_boot_cpu(int apicid, int cpu,
> struct task_struct *idle,
>  		 */
>  		boot_error = -1;
>  		timeout = jiffies + 10*HZ;
> -		while (time_before(jiffies, timeout)) {
> +		while (true) {
>  			if (cpumask_test_cpu(cpu,
> cpu_initialized_mask)) {
>  				/*
>  				 * Tell AP to proceed with
> initialization
> @@ -1154,6 +1154,10 @@ static int do_boot_cpu(int apicid, int cpu,
> struct task_struct *idle,
>  				boot_error = 0;
>  				break;
>  			}
> +
> +			if (time_after_eq(jiffies, timeout))
> +				break;
> +
>  			schedule();
>  		}
>  	}

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ