netdev - Re: [EXT] [PATCH] bnx2x: Prevent load reordering in tx completion processing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <247dddf2-9a46-cb02-f7a6-228274ac81d6@linux.vnet.ibm.com>
Date:   Thu, 18 Jul 2019 13:26:37 -0500
From:   Brian King <brking@...ux.vnet.ibm.com>
To:     Manish Chopra <manishc@...vell.com>,
        GR-everest-linux-l2 <GR-everest-linux-l2@...vell.com>
Cc:     Sudarsana Reddy Kalluru <skalluru@...vell.com>,
        Ariel Elior <aelior@...vell.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [EXT] [PATCH] bnx2x: Prevent load reordering in tx completion
 processing

On 7/18/19 5:12 AM, Manish Chopra wrote:
>> -----Original Message-----
>> From: Brian King <brking@...ux.vnet.ibm.com>
>> Sent: Tuesday, July 16, 2019 3:12 AM
>> To: GR-everest-linux-l2 <GR-everest-linux-l2@...vell.com>
>> Cc: Sudarsana Reddy Kalluru <skalluru@...vell.com>; Ariel Elior
>> <aelior@...vell.com>; netdev@...r.kernel.org; Brian King
>> <brking@...ux.vnet.ibm.com>
>> Subject: [EXT] [PATCH] bnx2x: Prevent load reordering in tx completion
>> processing
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>> This patch fixes an issue seen on Power systems with bnx2x which results in
>> the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb pointer
>> getting loaded in bnx2x_free_tx_pkt prior to the hw_cons load in
>> bnx2x_tx_int. Adding a read memory barrier resolves the issue.
>>
>> Signed-off-by: Brian King <brking@...ux.vnet.ibm.com>
>> ---
>>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
>> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
>> index 656ed80..e2be5a6 100644
>> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
>> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
>> @@ -285,6 +285,9 @@ int bnx2x_tx_int(struct bnx2x *bp, struct
>> bnx2x_fp_txdata *txdata)
>>  	hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
>>  	sw_cons = txdata->tx_pkt_cons;
>>
>> +	/* Ensure subsequent loads occur after hw_cons */
>> +	smp_rmb();
>> +
>>  	while (sw_cons != hw_cons) {
>>  		u16 pkt_cons;
>>
>> --
>> 1.8.3.1
> 
> Could you please explain a bit in detail what could have caused skb to NULL exactly ?
> Curious that if skb would have been NULL for some reason it did not cause NULL pointer dereference in bnx2x_free_tx_pkt() on below call -
> 
> prefetch(&skb->end);
> 
> Which is prior to the said WARN_ON(!skb) in bnx2x_free_tx_pkt().

Right. In this case, that would end up passing an invalid address to prefetch. On a
Power processor, that turns into a dcbt instruction (data cache block touch), which
is a hint to the process that the subsequent code may access that data cache block.
Passing an invalid address to dcbt causes no harm. I just built a userspace program
to validate that doing something very similar to what is happening here, and the dcbt
executed with no errors, no segfault to the userspace process.

This is the scenario I think is occurring. 

CPU[0]
bnx2x_start_xmit
[1] tx_buf->skb = skb; /* store skb pointer */
[2] ...
[3] wmb(); 
[4] DOORBELL_RELAXED

CPU[1]
bnx2x_tx_int
[5]  hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
[6]  sw_cons = txdata->tx_pkt_cons;
[7]  while (sw_cons != hw_cons) {
[8]  pkt_cons = TX_BD(sw_cons);
[9]  bnx2x_free_tx_pkt
[10]  tx_buf = &txdata->tx_buf_ring[pkt_cons];
[11]  skb = tx_buf->skb;
[12]  prefetch(&skb->end);
[13]  ...
[14]  WARN_ON(!skb);

On CPU0 we are in the process of sending a TX buffer to the adapter. We have a wmb at [3]
to ensure that all stores to cacheable storage are coherent with respect to the
non-cacheable store at [4] to tell the adapter about the new TX buffer.

On CPU1, we have a potential race condition if the processor is aggressively reordering
loads. If we find ourselves in bnx2x_tx_int, while still in bn2x_start_xmit on CPU0,
its possible the processor could begin speculatively executing well into [11]. Since there
is no read barrier of any sort to tell the processor that the load of the skb pointer
cannot happen until the load of hw_cons has occurred, we could end up speculatively
loading the skb pointer before the write in [1] has completed with respect to CPU1. 

Adding the smp_rmb between 6 and 7 ensures that the load at 5 occurs prior to the
load at 11.

This was reproducing consistently on a 16 socket Power 9 system built of four, four socket
nodes, with 10 bnx2x adapters and 26TB of memory. The NUMA effects can get to be exaggerated
a bit on these systems. With this patch, the system runs without issues. We've now been able to
run the workload multiple times with no issues.

Thanks,

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center