netdev - Re: [PATCH 0/1] Alternative, restart tx after tx used bit read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8074717519a77fdc5af6c86965beeb5ba0e07d61.camel@calian.com>
Date:   Wed, 12 Apr 2023 16:17:14 +0000
From:   Robert Hancock <robert.hancock@...ian.com>
To:     "ingo.rohloff@...terbach.com" <ingo.rohloff@...terbach.com>,
        "tomas.melin@...sala.com" <tomas.melin@...sala.com>
CC:     "Nicolas.Ferre@...rochip.com" <Nicolas.Ferre@...rochip.com>,
        "claudiu.beznea@...rochip.com" <claudiu.beznea@...rochip.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "harini.katakam@...inx.com" <harini.katakam@...inx.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH 0/1] Alternative, restart tx after tx used bit read

On Wed, 2023-04-12 at 09:27 +0300, Tomas Melin wrote:
> Hi,
> On 08/04/2023 00:33, Ingo Rohloff wrote:
> > [You don't often get email from ingo.rohloff@...terbach.com. Learn
> > why this is important at
> > https://urldefense.com/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!IOGos0k!iFvgajE8vnJRZmpTVg9e4dOxJyPnniaWpfcD4DPEWlcvL2TH5ATnrbPJEzOmuYtxPdatijpc4zTDtStxJ-VSTcAhXw$
> >   ]
> > 
> > I am sorry; this is a long E-Mail.
> > 
> > I am referring to this problem:
> > 
> > Robert Hancock wrote:
> > > On Wed, 2022-03-23 at 08:43 -0700, Jakub Kicinski wrote:
> > > > On Wed, 23 Mar 2022 10:08:20 +0200 Tomas Melin wrote:
> > > > > > From: Claudiu Beznea <claudiu.beznea@...rochip.com>
> > > > > > 
> > > > > > On some platforms (currently detected only on SAMA5D4) TX
> > > > > > might stuck
> > > > > > even the pachets are still present in DMA memories and TX
> > > > > > start was
> > > > > > issued for them.
> > > > > > ...
> > > > > On Xilinx Zynq the above change can cause infinite interrupt
> > > > > loop
> > > > > leading to CPU stall.  Seems timing/load needs to be
> > > > > appropriate for
> > > > > this to happen, and currently with 1G ethernet this can be
> > > > > triggered
> > > > > normally within minutes when running stress tests on the
> > > > > network
> > > > > interface.
> > > > > ...
> > > > Which kernel version are you using?  Robert has been working on
> > > > macb +
> > > > Zynq recently, adding him to CC.
> > > ...
> > > I haven't looked at the TX ring descriptor and register setup on
> > > this core
> > > in that much detail, but the fact the controller gets into this
> > > "TX used
> > > bit read" state in the first place seems unusual.  I'm wondering
> > > if
> > > something is being done in the wrong order or if we are missing a
> > > memory
> > > barrier etc?
> > 
> > I am developing on a ZynqMP (Ultrascale+) SoC from AMD/Xilinx.
> > I have seen the same issue before commit 4298388574dae6168 ("net:
> > macb:
> > restart tx after tx used bit read")
> 
> Since you mention before that commit this triggered, have you been
> able
> to reproduce the problem with that commit applied?
> 
> > 
> > The scenario which sometimes triggers it for me:
> > 
> > I have an application running on the PC.
> > The application sends a short command (via TCP) to the ZynqMP.
> > The ZynqMP answers with a long stream of bytes via TCP
> > (around 230KiB).
> > The PC knows the amount of data and waits to receive the data
> > completely.
> > The PC gets stuck, because the last TCP segment of the transfer
> > gets
> > stuck in the ZynqMP and is not transmitted.
> > You can re-trigger the TX Ring by pinging the ZynqMP:
> > The Ping answer will re-trigger the TX ring, which in turn will
> > also
> > then send the stuck IP/TCP packet.
> > 
> > Unfortunately triggering this problem seems to be hard; at least I
> > am
> > not able to reproduce it easily.
> > 
> > So: If anyone has a more reliable way to trigger the problem,
> > please tell me.
> > This is to check if my proposed alternative works under all
> > circumstances.
> > 
> > I have an alternate implementation, which does not require to turn
> > on
> > the "TX USED BIT READ" (TUBR) interrupt.
> > The reason why I think this alternative might be better is, because
> > I
> > believe the TUBR interrupt happens at the wrong time; so I am not
> > sure
> > that the current implementation works reliably.
> > 
> > Analysis:
> > Commit 404cd086f29e867f ("net: macb: Allocate valid memory for TX
> > and RX BD
> > prefetch") mentions:
> > 
> >     GEM version in ZynqMP and most versions greater than r1p07
> > supports
> >     TX and RX BD prefetch. The number of BDs that can be prefetched
> > is a
> >     HW configurable parameter. For ZynqMP, this parameter is 4.
> > 
> > I think what happens is this:
> > Example Scenario (SW == linux kernel, HW == cadence ethernet IP).
> > 1) SW has written TX descriptors 0..7
> > 2) HW is currently transmitting TX descriptor 6.
> >    HW has already prefetched TX descriptors 6,7,8,9.
> > 3) SW writes TX descriptor 8 (clearing TX_USED)
> > 4) SW writes the TSTART bit.
> >    HW ignores this, because it is still transmitting.
> > 5) HW transmits TX descriptor 7.
> > 6) HW reaches descriptor 8; because this descriptor
> >    has already been prefetched, HW sees a non-active
> >    descriptor (TX_USED set) and stops transmitting.
> > 
> > From debugging the code it seems that the TUBR interrupt happens,
> > when
> > a descriptor is prefetched, which has a TX_USED bit set, which is
> > before
> > it is processed by the rest of the hardware:
> > When looking at the end of a transfer it seems I get a TUBR
> > interrupt,
> > followed by some more TX COMPLETE interrupts.
> > 
> I recall that the documentation for the TUBR is rather sparse, so to
> be
> sure about the semantics how this is supposed to work, internal
> documentation would indeed be valuable.

It would be nice if Cadence would just make the documentation for this
core fully public.

CCing Harini from Xilinx who added some previous workarounds related to
descriptor prefetch. Do you have some more visibility into how this
mechanism is supposed to work? Given that the driver is potentially
modifying descriptors to add new entries right in front of the
hardware's current position in the ring, I'm not sure how it can be
avoided that it ends up processing stale descriptor values and ending
up in this TXUBR state, which it seems may sometimes not be handled
properly. Maybe if there was a way to invalidate the prefetched entries
after modifying the TX ring, but that might still be racy. If the
performance impact is not that much, it might be easiest to just
disable TX descriptor prefetch if there was some way to do so.


> 
> Thanks,
> Tomas
> 
> 
> > Additionally that means at the time the TUBR interrupt happens, it
> > is too early to write the TSTART bit again, because the hardware is
> > still actively transmitting.
> > 
> > The alternative I implemented is to check in macb_tx_complete() if
> > 
> > 1) The TX Queue is non-empty (there are pending TX descriptors)
> > 2) The hardware indicates that it is not transmitting any more
> > 
> > If this situation is detected, the TSTART bit will be written to
> > restart the TX ring.
> > 
> > I know for sure, that I hit the code path, which restarts the
> > transmission in macb_tx_complete(); that's why I believe the
> > "Example Scenario" I described above is correct.
> > 
> > I am still not sure if what I implemented is enough:
> > macb_tx_complete() should at least see all completed TX
> > descriptors.
> > I still believe there is a (very short) time window in which there
> > might be a race:
> > 1) HW completes TX descriptor 7 and sets the TX_USED bit
> >    in TX descriptor 7.
> >    TX descriptor 8 was prefetched with a set TX_USED bit.
> > 2) SW sees that TX descriptor 7 is completed
> >    (TX_USED bit now is set).
> > 3) SW sees that there still is a pending TX descriptor 8.
> > 4) SW checks if the TGO bit is still set, which it is.
> >    So the SW does nothing at this point.
> > 5) HW processes the prefetched,set TX_USED bit in
> >    TX descriptor 8 and stops transmission (clearing the TGO bit).
> > 
> > I am not sure if it is guaranteed that 5) cannot happen after 4). 
> > If 5)
> > happens after 4) as described above, then the controller still gets
> > stuck.
> > The only idea I can come up with, is to re-check the TGO bit
> > a second time a little bit later, but I am not sure how to
> > implement this.
> > 
> > Is there anyone who has access to hardware documentation, which
> > sheds some light onto the way the descriptor prefetching works?
> > 
> > so long
> >   Ingo
> > 
> > 
> > Ingo Rohloff (1):
> >   net: macb: A different way to restart a stuck TX descriptor ring.
> > 
> >  drivers/net/ethernet/cadence/macb.h      |  1 -
> >  drivers/net/ethernet/cadence/macb_main.c | 67 +++++++++-----------
> > ----
> >  2 files changed, 24 insertions(+), 44 deletions(-)
> > 
> > --
> > 2.17.1
> > 

-- 
Robert Hancock
Senior Hardware Designer, Calian Advanced Technologies
www.calian.com