[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9306d73e-38c3-94ad-5b08-4955ef5aee0f@microchip.com>
Date: Thu, 13 Dec 2018 10:48:55 +0000
From: <Claudiu.Beznea@...rochip.com>
To: <anssi.hannula@...wise.fi>
CC: <Nicolas.Ferre@...rochip.com>, <davem@...emloft.net>,
<netdev@...r.kernel.org>
Subject: Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
On 12.12.2018 13:27, Anssi Hannula wrote:
> On 12.12.2018 12:58, Claudiu.Beznea@...rochip.com wrote:
>>
>> On 11.12.2018 15:21, Anssi Hannula wrote:
>>> On 10.12.2018 12:34, Claudiu.Beznea@...rochip.com wrote:
>>>> On 07.12.2018 14:00, Anssi Hannula wrote:
>>>>> On 6.12.2018 16:14, Claudiu.Beznea@...rochip.com wrote:
>>>>>> Hi Anssi,
>>>>> Hi!
>>>>>
>>>>>> On 05.12.2018 16:00, Anssi Hannula wrote:
>>>>>>> On 5.12.2018 14:37, Claudiu.Beznea@...rochip.com wrote:
>>>>>>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>>>>>>> When reading buffer descriptors on RX or on TX completion, an
>>>>>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>>>>>>> been populated. However, there are no memory barriers to ensure that the
>>>>>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>>>>>>> that bit.
>>>>>>>>>
>>>>>>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>>>>>>
>>>>>>>>> I did not observe any actual issues caused by these being missing,
>>>>>>>>> though.
>>>>>>>>>
>>>>>>>>> Tested on a ZynqMP based system.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@...wise.fi>
>>>>>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>>>>>>> Cc: Nicolas Ferre <nicolas.ferre@...rochip.com>
>>>>>>>>> ---
>>>>>>>>> drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>>>>>> 1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>>>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>>>>>>
>>>>>>>>> /* First, update TX stats if needed */
>>>>>>>>> if (skb) {
>>>>>>>>> + /* Ensure all of desc is at least as up-to-date
>>>>>>>>> + * as ctrl (TX_USED bit)
>>>>>>>>> + */
>>>>>>>>> + dma_rmb();
>>>>>>>>> +
>>>>>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>>>>>>> data specific to this descriptor was completed. The TX descriptors for next
>>>>>>>> data to be send is updated under a locked spinlock.
>>>>>>> The previous rmb() is before the TX_USED check, so my understanding is
>>>>>>> that the following could happen in theory:
>>>>>> We are using this IP on and ARM architecture, so, with regards to rmb(), I
>>>>>> understand from [1] that dsb completes when:
>>>>>> "All explicit memory accesses before this instruction complete.
>>>>>> All Cache, Branch predictor and TLB maintenance operations before this
>>>>>> instruction complete."
>>>>>>
>>>>>>> 1. rmb().
>>>>>> According to [1] this should end after all previous instructions (loads,
>>>>>> stores) ends.
>>>>>>
>>>>>>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>>>>>>> are crossed.
>>>>>> But, as per [1], no onward instruction will be reached until all
>>>>>> instruction prior to dsb ends, so, after rmb() all descriptor's members
>>>>>> should be updated, right?
>>>>> The descriptor that triggered the TX interrupt should be visible now, yes.
>>>>> However, the controller may be writing to any other descriptors at the
>>>>> same time as the loop is processing through them, as there are multiple
>>>>> TX buffers.
>>>> Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at
>>>> the beginning of loop intended for that... In the 2nd loop you are
>>>> operating with the same descriptor which was read in the first loop.
>>> I'm concerned about the 2nd iteration of the first for loop in
>>> macb_tx_interrupt().
>>> That operates on a different descriptor due to tail++, and that
>>> different descriptor may have in-flight writes from controller as the
>>> interrupt may or may not already be raised for it.
>> I agree with that, there may be in-flights updates, but since ownership bit
>> should be updated at the end of the TX (I expect at this moment to be
>> updated all the descriptor's parts, including timestamps), if you read a
>> descriptor and it hasn't been treated by the hardware the descriptor is own
>> by the hardware and there is this check after ctrl is read:
>>
>> if (!(ctrl & MACB_BIT(TX_USED)))
>> break;
>>
>> Next time a new tx interrupt will be raised this descriptor would be also
>> treated.
>>
>> In case the above instruction doesn't break then it means the descriptor
>> was transmitted. And since ownership should be updated at the end of the
>> TX, this means no in-flights updates should be done on this descriptor, so,
>> when you will read the timestamps these should be the good ones (even it is
>> the first one, the 3rd, and so on).
>
> Agreed with above, the only problem is if the ts1/ts2 load is reordered
> before ctrl load.
>
>> Moreover, this new barrier barrier is in the 2nd loop, you're placing a
>> memory barrier after every step in the 2nd loop for the same descriptor
>> read in the 1st loop (it was already loaded in a processor register).
>
> The barrier is inside "if (skb)", and skb is only set for the last frame
Agree, this escaped me.
> as indicated by the code at the end of the loop:
>
> /* skb is set only for the last buffer of the frame.
> * WARNING: at this point skb has been freed by
> * macb_tx_unmap().
> */
> if (skb)
> break;
>
>
>> To avoid reordering of ctrl and ts1/ts2 loads wouldn't be enough to have
>> memory barrier after:
>> ctrl = desc->ctrl;
>
> Yes, it would.
> My v2 moved it into the other direction, into the ts code, though, which
> has the added benefit of no barrier when timestamps are disabled (which
> IIRC is the default).
> But I can move it to just after ctrl if you so prefer.
It's ok in gem_ptp_txstamp().
>
>>> Or am I missing something?
>>>
>>>>>>> 3. HW writes timestamp and sets TX_USED (or they become visible).
>>>>>> I expect hardware to set TX_USED and timestamp before raising TX complete
>>>>>> interrupt. If so, there should be no on-flight updates of this descriptor,
>>>>>> right? Hardware raised a TX_USED bit read interrupt when it reads a
>>>>>> descriptor like this and hangs TX.
>>>>> For the first iteration of the loop, that is correct - there should be
>>>>> no in-flight writes from controller as it already raised the interrupt.
>>>>> However, the following iterations of the loop process descriptors that
>>>>> may or may not have the interrupt raised yet, and therefore may still
>>>>> have in-flight writes.
>>>> I expect the hardware to give up ownership of the descriptor just at the
>>>> end of TX operation, so that the word containing TX_USED to be the last one
>>>> updated. If you reads the descriptor and TX_USED is not there the first
>>>> loop breaks, queue->tx_tail is updated with the last processed descriptor
>>>> in the queue (see "queue->tx_tail = tail;") then the next interrupt will
>>>> start processing with this last one descriptor.
>>> Agreed, that is how it works. Issues only occur if we somehow operate on
>>> data before ownership was transferred.
>> Could this happens? Shouldn't this:
>> if (!(ctrl & MACB_BIT(TX_USED)))
>> break;
>>
>> assure that this could not happen and the fact that the ownership is passed
>> to CPU at the end of descriptor update?
>
> I don't see how that prevents desc->ctrl and desc->ts_1/ts_2 reordering,
> which is the only issue that I can see.
> By "data" above I meant ts_1/ts_2, AFAICS there is nothing else we read
> from the descriptors aside from ctrl itself.
>
>> If load reordering b/w desc->ctrl and desc->ts_1/ts_2 wouldn't be enough to
>> have a barrier after
>> ctrl = desc->ctrl instruction?
>
> Yes, per above.
>
>>>>>>> 4. Code checks TX_USED.
>>>>>>> 5. Code operates on timestamp that is actually garbage.
>>>>>>>
>>>>>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>>>>>>> than dma_rmb(), though.
>>>>>> If you still think this scenario could happen why not calling a dsb in
>>>>>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.
>>>>> OK, I will move it there. Unless we arrive at a conclusion that it is
>>>>> unnecessary altogether, of course :)
>>>>>
>>>>>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
>>>>>> timestamp was placed in the descriptor. But, again, I expect the timestamp
>>>>>> and TX_USED to be set by hardware before raising TX complete interrupt.
>>>>> Yes, but since my concern is that without barriers in between,
>>>>> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
>>>>> set even though ts_1 is not yet an actual timestamp. And per above, all
>>>>> this may occur before the TX complete interrupt is raised for the
>>>>> descriptor in question.
>>>> If so, why not placing the barrier when reading timestamps? From my point
>>>> of view the place of "dma_rmb()" in this patch doesn't guarantees that
>>>> ts1/ts2 were read after the execution of barrier (correct me if I'm wrong).
>>> If the timestamp ts1/ts2 reads are done after dma_rmb(), and dev->ctrl
>>> is read before the barrier, my understanding is that they are guaranteed
>>> to be ordered so that dev->ctrl is read before ts1/ts2.
>>>
>>> Per [1], "[DMB] ensures that all explicit memory accesses that appear in
>>> program order before the |DMB| instruction are observed before any
>>> explicit memory accesses that appear in program order after the |DMB|
>>> instruction".
>>> And the compiler barrier ("memory" asm clobber) included within
>>> dma_rmb() ensures that in program order, ctrl is loaded before the
>>> barrier, and ts1/ts2 after it.
>>>
>>> But as mentioned, it makes sense to have it closer to the ts1/ts2 reads
>>> so I'll make that change.
>>>
>>> [1]
>>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>>>
>>>
>>>>> I agree that this TX case seems somewhat unlikely to be triggered in
>>>>> real life at least at present, though, as the ts_1/ts_2 read is so far
>>>>> after the desc->ctrl read, and behind function calls, so unlikely to be
>>>>> reordered before desc->ctrl read.
>>>>> But the similar RX cases below seem more problematic as the racy loads
>>>>> in question are right after each other in code.
>>>>>
> [...]
>
Powered by blists - more mailing lists