linux-kernel - Re: [PATCH] workqueue: Ensure that cpumask set for pools created after boot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <736f7f6e-8d47-eaea-acc6-8ed75014a287@linux.vnet.ibm.com>
Date:   Mon, 12 Jun 2017 09:47:31 -0500
From:   Michael Bringmann <mwb@...ux.vnet.ibm.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     Lai Jiangshan <jiangshanlai@...il.com>,
        linux-kernel@...r.kernel.org,
        Nathan Fontenot <nfont@...ux.vnet.ibm.com>
Subject: Re: [PATCH] workqueue: Ensure that cpumask set for pools created
 after boot

Hello:

On 06/06/2017 01:09 PM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jun 06, 2017 at 11:18:36AM -0500, Michael Bringmann wrote:
>> On 05/25/2017 10:30 AM, Michael Bringmann wrote:
>>> I will try that patch shortly.  I also updated my patch to be conditional
>>> on whether the pool's cpumask attribute was empty.  You should have received
>>> V2 of that patch by now.
>>
>> Let's try this again.
>>
>> The hotplug problem goes away with the changes that you provided earlier, and
> 
> So, that means we're ending up in situations where NUMA online is a
> proper superset of NUMA possible.
> 
>> shown in the patch below.  I kept this change to get_unbound_pool' as a just
>> in case to explain the crash in the event that it occurs again:
>>
>>     if (!cpumask_weight(pool->attrs->cpumask))
>>         cpumask_copy(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
>>
>> I could also insert 
>>
>>     BUG(!cpumask_weight(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
>>
>> at that place, but I really prefer not to crash the system if there is a workaround.
> 
> I'm not sure because it doesn't make any logical sense and it's not
> right in terms of correctness.  The above would be able to enable CPUs
> which are explicitly excluded from a workqueue.  The only fallback
> which makes sense is falling back to the default pwq.

What would that look like?  Are you sure that would always be valid?
In a system that is hot-adding and hot-removing CPUs?

>>> Can you please post the messages with the debug patch from the prev
>>> thread?  In fact, let's please continue on that thread.  I'm having a
>>> hard time following what's going wrong with the code.
>>
>> Are these the failure logs that you requested?
>>
>>
>> Red Hat Enterprise Linux Server 7.3 (Maipo)
>> Kernel 4.12.0-rc1.wi91275_debug_03.ppc64le+ on an ppc64le
>>
>> ltcalpine2-lp20 login: root
>> Password: 
>> Last login: Wed May 24 18:45:40 from oc1554177480.austin.ibm.com
>> [root@...alpine2-lp20 ~]# numactl -H
>> available: 2 nodes (0,6)
>> node 0 cpus:
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
>> node 6 size: 19858 MB
>> node 6 free: 16920 MB
>> node distances:
>> node   0   6 
>>   0:  10  40 
>>   6:  40  10 
>> [root@...alpine2-lp20 ~]# numactl -H
>> available: 2 nodes (0,6)
>> node 0 cpus:
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
>> node 6 size: 19858 MB
>> node 6 free: 16362 MB
>> node distances:
>> node   0   6 
>>   0:  10  40 
>>   6:  40  10 
>> [root@...alpine2-lp20 ~]# [  321.310943] workqueue:get_unbound_pool has empty cpumask for pool attrs
>> [  321.310961] ------------[ cut here ]------------
>> [  321.310997] WARNING: CPU: 184 PID: 13201 at kernel/workqueue.c:3375 alloc_unbound_pwq+0x5c0/0x5e0
>> [  321.311005] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag sg pseries_rng ghash_generic gf128mul xts vmx_crypto binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
>> [  321.311097] CPU: 184 PID: 13201 Comm: cpuhp/184 Not tainted 4.12.0-rc1.wi91275_debug_03.ppc64le+ #8
>> [  321.311106] task: c000000408961080 task.stack: c000000406394000
>> [  321.311113] NIP: c000000000116c80 LR: c000000000116c7c CTR: 0000000000000000
>> [  321.311121] REGS: c0000004063977b0 TRAP: 0700   Not tainted  (4.12.0-rc1.wi91275_debug_03.ppc64le+)
>> [  321.311128] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>
>> [  321.311150]   CR: 28000082  XER: 00000000
>> [  321.311159] CFAR: c000000000a2dc80 SOFTE: 1 
>> [  321.311159] GPR00: c000000000116c7c c000000406397a30 c0000000013ae900 000000000000003b 
>> [  321.311159] GPR04: c000000408961a38 0000000000000006 00000000a49e41e5 ffffffffa4a5a483 
>> [  321.311159] GPR08: 00000000000062cc 0000000000000000 0000000000000000 c000000408961a38 
>> [  321.311159] GPR12: 0000000000000000 c00000000fb38c00 c00000000011e858 c00000040a902ac0 
>> [  321.311159] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
>> [  321.311159] GPR20: c000000406394000 0000000000000002 c000000406394000 0000000000000000 
>> [  321.311159] GPR24: c000000405075400 c000000404fc0000 0000000000000110 c0000000015a4c88 
>> [  321.311159] GPR28: 0000000000000000 c0000004fe256000 c0000004fe256008 c0000004fe052800 
>> [  321.311290] NIP [c000000000116c80] alloc_unbound_pwq+0x5c0/0x5e0
>> [  321.311298] LR [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0
>> [  321.311305] Call Trace:
>> [  321.311310] [c000000406397a30] [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0 (unreliable)
>> [  321.311323] [c000000406397ad0] [c000000000116e30] wq_update_unbound_numa+0x190/0x270
>> [  321.311334] [c000000406397b60] [c000000000118eb0] workqueue_offline_cpu+0xe0/0x130
>> [  321.311345] [c000000406397bf0] [c0000000000e9f20] cpuhp_invoke_callback+0x240/0xcd0
>> [  321.311355] [c000000406397cb0] [c0000000000eab28] cpuhp_down_callbacks+0x78/0xf0
>> [  321.311365] [c000000406397d00] [c0000000000eae6c] cpuhp_thread_fun+0x18c/0x1a0
>> [  321.311376] [c000000406397d30] [c0000000001251cc] smpboot_thread_fn+0x2fc/0x3b0
>> [  321.311386] [c000000406397dc0] [c00000000011e9c0] kthread+0x170/0x1b0
>> [  321.311397] [c000000406397e30] [c00000000000b4f4] ret_from_kernel_thread+0x5c/0x68
>> [  321.311406] Instruction dump:
>> [  321.311413] 3d42fff0 892ac565 2f890000 40fefd98 39200001 3c62ff89 3c82ff6c 3863d590 
>> [  321.311437] 38847cb0 992ac565 48916fc9 60000000 <0fe00000> 4bfffd70 60000000 60420000 
> 
> The only way offlining can lead to this failure is when wq numa
> possible cpu mask is a proper subset of the matching online mask.  Can
> you please print out the numa online cpu and wq_numa_possible_cpumask
> masks and verify that online stays within the possible for each node?
> If not, the ppc arch init code needs to be updated so that cpu <->
> node binding is establish for all possible cpus on boot.  Note that
> this isn't a requirement coming solely from wq.  All node affine (thus
> percpu) allocations depend on that.

The ppc arch init code already records all nodes used by the CPUs visible in
the device-tree at boot time into the possible and online node bindings.  The
problem here occurs when we hot-add new CPUs to the powerpc system -- they may
require nodes that are mentioned by the VPHN hcall, but which were not used
at boot time.

I will run a test that dumps these masks later this week to try to provide
the information that you are interested in.

Right now we are having a discussion on another thread as to how to properly
set the possible node mask at boot given that there is no mechanism to hot-add
nodes to the system.  The latest idea appears to be adding another property
or two to define the maximum number of nodes that should be added to the
possible / online node masks to allow for dynamic growth after boot.

> 
> Thanks.
> 

Thanks.

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@...ux.vnet.ibm.com