lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <878tkjn7uo.fsf@linux.vnet.ibm.com>
Date:   Thu, 22 Jun 2017 16:24:47 -0300
From:   Thiago Jung Bauermann <bauerman@...ux.vnet.ibm.com>
To:     Michael Ellerman <mpe@...erman.id.au>
Cc:     linuxppc-dev@...ts.ozlabs.org, linux-kernel@...r.kernel.org,
        John Allen <jallen@...ux.vnet.ibm.com>,
        Michael Bringmann <mwb@...ux.vnet.ibm.com>,
        Nathan Fontenot <nfont@...ux.vnet.ibm.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd


Michael Ellerman <mpe@...erman.id.au> writes:

> Thiago Jung Bauermann <bauerman@...ux.vnet.ibm.com> writes:
>
>> Michael Ellerman <mpe@...erman.id.au> writes:
>>> Thiago Jung Bauermann <bauerman@...ux.vnet.ibm.com> writes:
>>>
>>>> Calling arch_update_cpu_topology from a CPU hotplug state machine callback
>>>> hits a deadlock because the function tries to get a read lock on
>>>> cpu_hotplug_lock while the state machine still holds a write lock on it.
>>>>
>>>> Since all callers of arch_update_cpu_topology except rtasd already hold
>>>> cpu_hotplug_lock, this patch changes the function to use
>>>> stop_machine_cpuslocked and creates a separate function for rtasd which
>>>> still tries to obtain the lock.
>>>>
>>>> Michael Bringmann investigated the bug and provided a detailed analysis
>>>> of the deadlock on this previous RFC for an alternate solution:
>>>>
>>>> https://patchwork.ozlabs.org/patch/771293/
>>>
>>> Do we know when this broke? Or has it never worked?
>>
>> It's been broken since at least v4.4, I think. I don't know about
>> earlier versions.
>
> OK.
>
> Just to be clear, this is happening on a 4.12-rcX system with no other
> patches?
>
> The code in arch_update_cpu_topology() has used stop_machine() since 
> 30c05350c39d ("powerpc/pseries: Use stop machine to update cpu maps")
> which went into v3.10, about 4 years ago.
>
> Prior to that it used get/put_online_cpus(), since 9eff1a38407c
> ("powerpc/pseries: Poll VPA for topology changes and update NUMA maps"),
> which was 2.6.38 in 2010.
>
> I wouldn't rule out the possibility it's been broken for 7 years, but I
> wonder if something else has changed to cause it to break.
>
> We really need to work it out before we backport anything.

Michael Bringmann provided this information:

    We need at least one patch to show the issue in the latest 4.12
    codebase:

    [PATCH V6 2/2] powerpc/numa: Update CPU topology when VPHN enabled
    https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-June/159341.html

    Reason: Prior to this patch we were not exercising the PowerPC VPHN
    hcall nor the associated path through the kernel/sched code that
    encounters the problem. All CPUs, whether present at boot or
    hot-added, are added to node 0 without this patch. This
    representation of the topology is incorrect in many/most cases.

    cpu_hotplug_begin() + get_online_cpus() have potentially been broken
    since the implementation of multithreading for _cpu_up() and
    _cpu_down().

    Reason: 'cpu_hotplug.active_writer' check in get_online_cpus() is
    dependent upon the nested routines that call get_online_cpus() to
    execute in the same thread as the one that invokes
    'cpu_hotplug_begin'.

    PowerPC's version of arch_update_cpu_topology() has used
    stop_machine() or get_online_cpus() for years, since at least 2011.

    Practically speaking, until the recent patch, in the 4.12 codebase
    PowerPC CPUs are being added only to nodes that were online at boot,
    and the topology did not change enough to trigger the paths through
    'stop_machine'.

    I can't say for certain about the earlier code bases where
    'arch_update_cpu_topology' used 'get_online_cpus'/'put_online_cpus'
    directly.

>>> Should it go to stable? (can't in its current form AFAICS)
>>
>> It's not hard to backport both this patch and commit fe5595c07400
>> ("stop_machine: Provide stop_machine_cpuslocked()") from branch
>> smp/hotplug in tip.git for stable.
>
> Yeah but it's not really my business backporting that unfortunately.

Sorry, I wasn't clear. I was offering to provide backported patches for
the relevant stable branches.

Though that will only be necessary if we also backport the topology
fixes as well.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ