netdev - Re: tg3 (5720) PTP sync problems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CACKFLikGdN9XPtWk-fdrzxdcD=+bv-GHBvfVfSpJzHY7hrW39g@mail.gmail.com>
Date:   Fri, 16 Sep 2022 11:45:56 -0700
From:   Michael Chan <michael.chan@...adcom.com>
To:     Simon White <Simon.White@...visolutions.com>
Cc:     "davem@...emloft.net" <davem@...emloft.net>,
        "richardcochran@...il.com" <richardcochran@...il.com>,
        Stephen Hill <Stephen.Hill@...visolutions.com>,
        Netdev <netdev@...r.kernel.org>,
        Pavan Chebbi <pavan.chebbi@...adcom.com>
Subject: Re: tg3 (5720) PTP sync problems

CC netdev instead of lkml and converting to plain text email

On Fri, Sep 16, 2022 at 8:54 AM Simon White
<Simon.White@...visolutions.com> wrote:
>
> In a running setup PTP sync problems were observed when the server providing the PTP grand master performed other high load network transmissions.  Sync errors ranging in the 10s of milli seconds could be experienced by the PTP slaves.

Thanks for reporting the issue.  One of my colleagues will look into this.

>
>
>
> Simplifying the setup and test conditions to two servers (Dell R7527 dual socket servers with 64 core Milans) utilising iperf, we were able to replicate the problem.  Multiple TX rings were tried, where the PTP traffic only was given its own TX ring and set to use a high priority, however that made no difference.  Examination of the problem led to the following code:
>
>
>
> static void tg3_tx(struct tg3_napi *tnapi)
>
> {
>
> [snip]
>
>                 if (tnapi->tx_ring[sw_idx].len_flags & TXD_FLAG_HWTSTAMP) {
>
>                         struct skb_shared_hwtstamps timestamp;
>
>                         u64 hwclock = tr32(TG3_TX_TSTAMP_LSB);
>
>                         hwclock |= (u64)tr32(TG3_TX_TSTAMP_MSB) << 32;
>
>
>
>                         tg3_hwclock_to_timestamp(tp, hwclock, &timestamp);
>
>
>
>                         skb_tstamp_tx(skb, &timestamp);
>
>                 }
>
>
>
> This assumes that the timestamp will have been updated by the time this descriptor in the tx ring has been marked as consumed.  We observe when the interface is under TX load that this nolonger holds true.  Changing tg3_start_xmit to record the timestamp where TXD_FLAG_HWTSTAMP is set and spinning in the above code to ensure the timestamp had updated appears to address the PTP delay calculation.  A patch covering the change described has been attached for reference but am not suggesting it as the solution to the problem.
>
>
>
> Adding printks to record the spinning loop duration showed it could take around 150us for the timestamp to update after the descriptor was marked as being consumed.  It can be speculated how this could come about from BCM5718 Family Programmer’s Reference Guide (broadcom.com) figure 30 (Transmit Flow Diagram) on page 132, however could it be confirmed whether the assumption the tg3.c code makes is correct?
>
>
>
> Part:
>
>
>
> [   24.311626] tg3 0000:e1:00.0 eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address xxxxxx
>
> [   24.311630] tg3 0000:e1:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
>
>
>
> Kind Regards,
>
> Simon White

Download attachment "smime.p7s" of type "application/pkcs7-signature" (4209 bytes)