lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140604152155.08e15821@nial.usersys.redhat.com>
Date:	Wed, 4 Jun 2014 15:21:55 +0200
From:	Igor Mammedov <imammedo@...hat.com>
To:	linux-kernel@...r.kernel.org
Cc:	tglx@...utronix.de, mingo@...hat.com, x86@...nel.org
Subject: Re: [PATCH v5 0/4] x86: fix hang when AP bringup is too slow

On Mon,  5 May 2014 22:49:47 +0200
Igor Mammedov <imammedo@...hat.com> wrote:

> changes since v4:
>  * merge "[PATCH v4 1/5] x86: fix list corruption on CPU hotplug"
>    and "[PATCH v4 2/5] x86: fix memory corruption in acpi_unmap_lsapic()"
>    together
>  * "x86: initialize secondary CPU only if master CPU will wait for it:
>     - add 10 seconds timeout description into commit message
>     - add smp_mb() after clearing cpu_initialized_mask
> 
> changes since v3:
>  * put simple bugfixes first
>  * move common part of syncing with master CPU in cpu_init()
>    for x32/64 variant into helper function
>  * cpu_init(): WARN_ON if cpu_initialized_mask is set
>  * fix panic on CPU unplug, caused by erroneous removing
>    of "pr->dev = dev;" in drivers/acpi/acpi_processor.c
> 
> --
> Hang is observed on virtual machines during CPU hotplug,
> especially in big guests with many CPUs. (It happens more
> often if host is over-committed).
> 
> Hang happens because master CPU timeouts on waiting till
> AP boots and 'cancels' CPU online operation assuming AP
> is not functional but AP may continue run wild later
> causing various hangs or panics in running kernel that
> is assuming that AP was offline.
> 
> This is an alternative approach, that instead of canceling
> in-progress AP bringup (https://lkml.org/lkml/2014/3/6/257),
> removes timeouts so that AP bringup won't be affected by
> poor timing and syncs AP with master CPU at early startup
> making sure that AP won't run wild if master CPU doesn't
> expect AP to come online.
> 
> Series also fixes 3 bugs found during testing CPU bringup
> failure case.

since 3.16 merge window is open now,
ping

> --
> Below is the detailed description of a more often happening hang:
> ---
> Master CPU may timeout before cpu_callin_mask is set and cancel
> booting CPU, but being onlined CPU still continues to boot, sets
> cpu_active_mask (CPU_STARTING notifiers) and spins in
> check_tsc_sync_target() for master cpu to arrive. Following attempt
> to online another cpu hangs in stop_machine, initiated from here:
> smp_callin ->
>   smp_store_cpu_info ->
>     identify_secondary_cpu ->
>       mtrr_ap_init -> set_mtrr_from_inactive_cpu
> 
> stop_machine waits on completion of stop_work on all CPUs from
> cpu_active_mask including a failed CPU that spins in check_tsc_sync_target().
> 
> Igor Mammedov (4):
>   x86: fix list/memory corruption on CPU hotplug
>   acpi_processor: do not mark present at boot but not onlined CPU as
>     onlined
>   x86: log error on secondary CPU wakeup failure at ERR level
>   x86: initialize secondary CPU only if master CPU will wait for it
> 
>  arch/x86/kernel/cpu/common.c  |   27 ++++++----
>  arch/x86/kernel/smpboot.c     |  104 +++++++++++++----------------------------
>  drivers/acpi/acpi_processor.c |    1 -
>  3 files changed, 48 insertions(+), 84 deletions(-)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ