netdev - Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff275d8f-6f13-7b78-1136-b4a5a18046ba@microchip.com>
Date:   Mon, 10 Dec 2018 10:34:31 +0000
From:   <Claudiu.Beznea@...rochip.com>
To:     <anssi.hannula@...wise.fi>
CC:     <Nicolas.Ferre@...rochip.com>, <davem@...emloft.net>,
        <netdev@...r.kernel.org>
Subject: Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers



On 07.12.2018 14:00, Anssi Hannula wrote:
> On 6.12.2018 16:14, Claudiu.Beznea@...rochip.com wrote:
>> Hi Anssi,
> 
> Hi!
> 
>> On 05.12.2018 16:00, Anssi Hannula wrote:
>>> On 5.12.2018 14:37, Claudiu.Beznea@...rochip.com wrote:
>>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>>> When reading buffer descriptors on RX or on TX completion, an
>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>>> been populated. However, there are no memory barriers to ensure that the
>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>>> that bit.
>>>>>
>>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>>
>>>>> I did not observe any actual issues caused by these being missing,
>>>>> though.
>>>>>
>>>>> Tested on a ZynqMP based system.
>>>>>
>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@...wise.fi>
>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>>> Cc: Nicolas Ferre <nicolas.ferre@...rochip.com>
>>>>> ---
>>>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>>  
>>>>>  			/* First, update TX stats if needed */
>>>>>  			if (skb) {
>>>>> +				/* Ensure all of desc is at least as up-to-date
>>>>> +				 * as ctrl (TX_USED bit)
>>>>> +				 */
>>>>> +				dma_rmb();
>>>>> +
>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>>> data specific to this descriptor was completed. The TX descriptors for next
>>>> data to be send is updated under a locked spinlock.
>>> The previous rmb() is before the TX_USED check, so my understanding is
>>> that the following could happen in theory:
>> We are using this IP on and ARM architecture, so, with regards to rmb(), I
>> understand from [1] that dsb completes when:
>> "All explicit memory accesses before this instruction complete.
>> All Cache, Branch predictor and TLB maintenance operations before this
>> instruction complete."
>>
>>> 1. rmb().
>> According to [1] this should end after all previous instructions (loads,
>> stores) ends.
>>
>>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>>> are crossed.
>> But, as per [1], no onward instruction will be reached until all
>> instruction prior to dsb ends, so, after rmb() all descriptor's members
>> should be updated, right?
> 
> The descriptor that triggered the TX interrupt should be visible now, yes.
> However, the controller may be writing to any other descriptors at the
> same time as the loop is processing through them, as there are multiple
> TX buffers.

Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at
the beginning of loop intended for that... In the 2nd loop you are
operating with the same descriptor which was read in the first loop.

> 
>>> 3. HW writes timestamp and sets TX_USED (or they become visible).
>> I expect hardware to set TX_USED and timestamp before raising TX complete
>> interrupt. If so, there should be no on-flight updates of this descriptor,
>> right? Hardware raised a TX_USED bit read interrupt when it reads a
>> descriptor like this and hangs TX.
> 
> For the first iteration of the loop, that is correct - there should be
> no in-flight writes from controller as it already raised the interrupt.
> However, the following iterations of the loop process descriptors that
> may or may not have the interrupt raised yet, and therefore may still
> have in-flight writes.

I expect the hardware to give up ownership of the descriptor just at the
end of TX operation, so that the word containing TX_USED to be the last one
updated. If you reads the descriptor and TX_USED is not there the first
loop breaks, queue->tx_tail is updated with the last processed descriptor
in the queue (see "queue->tx_tail = tail;") then the next interrupt will
start processing with this last one descriptor.

> 
>>> 4. Code checks TX_USED.
>>> 5. Code operates on timestamp that is actually garbage.
>>>
>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>>> than dma_rmb(), though.
>> If you still think this scenario could happen why not calling a dsb in
>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.
> 
> OK, I will move it there. Unless we arrive at a conclusion that it is
> unnecessary altogether, of course :)
> 
>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
>> timestamp was placed in the descriptor. But, again, I expect the timestamp
>> and TX_USED to be set by hardware before raising TX complete interrupt.
> 
> Yes, but since my concern is that without barriers in between,
> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
> set even though ts_1 is not yet an actual timestamp. And per above, all
> this may occur before the TX complete interrupt is raised for the
> descriptor in question.

If so, why not placing the barrier when reading timestamps? From my point
of view the place of "dma_rmb()" in this patch doesn't guarantees that
ts1/ts2 were read after the execution of barrier (correct me if I'm wrong).

> 
> I agree that this TX case seems somewhat unlikely to be triggered in
> real life at least at present, though, as the ts_1/ts_2 read is so far
> after the desc->ctrl read, and behind function calls, so unlikely to be
> reordered before desc->ctrl read.
> But the similar RX cases below seem more problematic as the racy loads
> in question are right after each other in code.
> 
>> [1]
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>>
>>>>>  				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
>>>>>  					/* skb now belongs to timestamp buffer
>>>>>  					 * and will be removed later
>>>>> @@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
>>>>>  		rmb();
>>>>>  
>>>>>  		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>>>>> -		addr = macb_get_addr(bp, desc);
>>>>> -		ctrl = desc->ctrl;
>>>>>  
>>>>>  		if (!rxused)
>>>>>  			break;
>>>>>  
>>>>> +		/* Ensure other data is at least as up-to-date as rxused */
>>>>> +		dma_rmb();
>>>> Same here, wouldn't previous rmb() should do this job?
>>> The scenario I'm concerned about here (and in the last hunk) is:
>>>
>>> 1. rmb().
>>> 2. ctrl is read (i.e. ctrl read reordered before addr read).
>> Same here with regards to [1]. All prior loads, stores should be finished
>> when dsb ends.
> 
> Yes, but the problematic loads are *after* the dsb.
> 
>>> 3. HW updates to ctrl and addr become visible.
>>> 4. RX_USED check.
>>> 5. code operates on garbage ctrl.
>> If this is happen then the data will be read on next interrupt.
> 
> As long as the RX_USED bit is set, the code in gem_rx() will either drop
> the frame or process it, depending on if it seems valid or not.

Yes, I wanted to say that if the RX_USED is not set when first reading ctrl
then the frame will not be processed at that moment but on the next interrupt.

It will
> not be processed again on next interrupt.
> 
> I'll try to rephrase what I meant:
> 
> 1. rmb(). This does nothing for loads that occur after it.
> 2. ctrl is loaded. It does not contain valid data yet.
> 3. HW receives a new frame and writes ctrl, addr, and they become
> visible to processor.
> 4. addr is loaded. It contains the RX_USED=1 as written in step 3.
> 5. Code does the RX_USED check, it sees 1 and continues processing the
> frame.
> 6. Code operates on ctrl, but as it was loaded before step 3, it
> contains old data.
> 
> The old ctrl may e.g. cause the frame to be dropped due to the
> RX_SOF/RX_EOF check.

Yes, agree, but placing dma_rmb() before "ctrl = desc->ctrl" will not solve
this case, right? What you want is to have ctrl loaded after you execute
next instruction which refers to it. According to [1] you will have to
place the dma_rmb() after the "ctrl = desc->ctrl" instruction to ensure
next time you refer to it, it contains the proper value.

> 
> 
>>
>> dma_rmb() is a dmb. According to [1]:
>> "Data Memory Barrier acts as a memory barrier. It ensures that all explicit
>> memory accesses that appear in program order before the DMB instruction are
>> observed before any explicit memory accesses that appear in program order
>> after the DMB instruction. It does not affect the ordering of any other
>> instructions executing on the processor."
>>
>> and your code is:
>>
>> 		/* Ensure other data is at least as up-to-date as rxused */
>> 		dma_rmb();
>>
>> 		addr = macb_get_addr(bp, desc);
>> 		ctrl = desc->ctrl;
>>
>> I understand from this that you want to wait for instructions prior to
>> dma_rmb() to be finished?
> 
> I want the load of "desc->addr" on the "rxused = (desc->addr &
> MACB_BIT(RX_USED)) ? true : false;" line to be observed before the later
> loads, such as "desc->ctrl".
> 
> 
>>> I think it may be OK to move the earlier rmb() outside the loop so that
>>> there is an rmb() only before and after the RX loop, as I don't at least
>>> immediately see any hard requirement to do it on each loop pass (unlike
>>> the added dma_rmb()). But my intent was to fix issues instead of
>>> optimization so I didn't look too closely into that.
>> But you said you did not see any issues with the code as it was previously.
> 
> I meant that I have not observed any issues in practice (though I
> haven't really tried to), but I do see theoretical issues which this
> patch addresses.
> 
> Especially the issues addressed by the two RX hunks seem entirely
> possible - the gem_rx() one requires the compiler to just reorder the
> two adjacent loads, and the macb_rx() one does not even require any
> reordering - desc->ctrl is read before desc->addr in the code.

In gem_rx() I agree that a barrier might be necessary, but placing it after
desc->addr will not guarantee you that the load of addr and ctrl will be
consistent. According to [1] the barrier should be after reading of addr
and ctrl.

Thank you,
Claudiu Beznea

> 
>> Thank you,
>> Claudiu Beznea
>>
>>>>> +
>>>>> +		addr = macb_get_addr(bp, desc);
>>>>> +		ctrl = desc->ctrl;
>>>>> +
>>>>>  		queue->rx_tail++;
>>>>>  		count++;
>>>>>  
>>>>> @@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
>>>>>  		/* Make hw descriptor updates visible to CPU */
>>>>>  		rmb();
>>>>>  
>>>>> -		ctrl = desc->ctrl;
>>>>> -
>>>>>  		if (!(desc->addr & MACB_BIT(RX_USED)))
>>>>>  			break;
>>>>>  
>>>>> +		/* Ensure other data is at least as up-to-date as addr */
>>>>> +		dma_rmb();
>>>> Ditto
>>>>
>>>>> +
>>>>> +		ctrl = desc->ctrl;
>>>>> +
>>>>>  		if (ctrl & MACB_BIT(RX_SOF)) {
>>>>>  			if (first_frag != -1)
>>>>>  				discard_partial_frame(queue, first_frag, tail);
>>>>>
>