netdev - RE: bnxt_en: Incorrect tx timestamp report

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CO1PR11MB5089FF56F6991F88F5E4E8A8D6AC2@CO1PR11MB5089.namprd11.prod.outlook.com>
Date: Tue, 1 Apr 2025 20:17:29 +0000
From: "Keller, Jacob E" <jacob.e.keller@...el.com>
To: Pavan Chebbi <pavan.chebbi@...adcom.com>, Kamil Zaripov
	<zaripov-kamil@...ide.ai>
CC: Vadim Fedorenko <vadim.fedorenko@...ux.dev>, Michael Chan
	<michael.chan@...adcom.com>, Linux Netdev List <netdev@...r.kernel.org>
Subject: RE: bnxt_en: Incorrect tx timestamp report



> -----Original Message-----
> From: Pavan Chebbi <pavan.chebbi@...adcom.com>
> Sent: Thursday, March 27, 2025 6:17 AM
> To: Kamil Zaripov <zaripov-kamil@...ide.ai>
> Cc: Vadim Fedorenko <vadim.fedorenko@...ux.dev>; Michael Chan
> <michael.chan@...adcom.com>; Keller, Jacob E <jacob.e.keller@...el.com>;
> Linux Netdev List <netdev@...r.kernel.org>
> Subject: Re: bnxt_en: Incorrect tx timestamp report
> 
> On Wed, Mar 26, 2025 at 7:20 PM Kamil Zaripov <zaripov-kamil@...ide.ai> wrote:
> >
> >
> >
> > > On 25 Mar 2025, at 12:41, Vadim Fedorenko <vadim.fedorenko@...ux.dev>
> wrote:
> > >
> > > On 25/03/2025 10:13, Kamil Zaripov wrote:
> > >>
> > >> I guess I don’t understand how does it work. Am I right that if userspace
> program changes frequency of PHC devices 0,1,2,3 (one for each port present in
> NIC) driver will send PHC frequency change 4 times but firmware will drop 3 of
> these frequency change commands and will pick up only one? How can I
> understand which PHC will actually represent adjustable clock and which one is
> phony?
> > >
> > > It can be any of PHC devices, mostly the first to try to adjust will be used.
> >
> > I believe that randomly selecting one of the PHC clock to control actual PHC in
> NIC and directing commands received on other clocks to the /dev/null is quite
> unexpected behavior for the userspace applications.
> >
> > >> Another thing that I cannot understand is so-called RTC and non-RTC mode.
> Is there any documentation that describes it? Or specific parts of the driver that
> change its behavior on for RTC and non-RTC mode?
> > >
> > > Generally, non-RTC means free-running HW PHC clock with timecounter
> > > adjustment on top of it. With RTC mode every adjfine() call tries to
> > > adjust HW configuration to change the slope of PHC.
> >
> > Just to clarify:
> >
> > Am I right that in RTC mode:
> > 1.1. All 64 bits of the PHC counter are stored on the NIC (both the “readable” 0–
> 47 bits and the higher 48–63 bits).
> In both RTC and non-RTC modes, the driver will use the lower 48b from
> HW as cycles to feed to the timecounter that driver has mapped to the
> PHC.
> 
> > 1.2. When userspace attempts to change the PHC counter value (using adjtime
> or settime), these changes are propagated to the NIC via the
> PORT_MAC_CFG_REQ_ENABLES_PTP_ADJ_PHASE and
> FUNC_PTP_CFG_REQ_ENABLES_PTP_SET_TIME requests.
> True.
> 
> > 1.3. If one port of a four-port NIC is updated, the change is propagated to all
> other ports via the
> ASYNC_EVENT_CMPL_PHC_UPDATE_EVENT_DATA1_FLAGS_PHC_RTC_UPDATE
> event. As a result, all four instances of the bnxt_en driver receive the event with
> the high 48–63 bits of the counter in payload. They then asynchronously read the
> 0–47 bits and update the timecounter struct’s nsec field.
> Not true in the latest Firmware.
> 
> > 1.4. If we ignore the bug related to unsynchronized reading of the higher (48–
> 63) and lower (0–47) bits of the PHC counter, the time across each timecounter
> instance should remain in sync.
> Well, no. It won't be very accurate. We designed non-RTC mode for such
> use cases. But yes, your use case is not exactly what non-RTC caters
> for.
> 
> > 1.5. When userspace calls adjfine, it triggers the
> PORT_MAC_CFG_REQ_ENABLES_PTP_FREQ_ADJ_PPB request, causing the PHC
> tick rate to change.
> Correct. But only the first ever port that made the freq adj will
> continue to make further freq adjustments. This was a policy decision,
> not exactly random. There is an option in our tools to see which is
> the interface that is currently making freq adjustments.
> 
> >
> > In non-RTC mode:
> > 2.1. Only the lower 0–47 bits are stored on the NIC. The higher 48–63 bits are
> stored only in the timecounter struct.
> > 2.2. When userspace tries to change the PHC counter via adjtime or settime, the
> change is reflected only in the timecounter struct.
> Correct.
> 
> > 2.3. Each timecounter instance may have its own nsec field value, potentially
> leading to different timestamps read from /dev/ptp[0-3].
> Basically each of the timecounters is independent.
> 
> > 2.4. When userspace calls adjfine, it only modifies the mul field in the
> cyclecounter struct, which means no real changeoccurs to the PHC tick rate on the
> hardware.
> Correct.
> 
> >
> > And about issue in general:
> > 3.1. Firmware versions 230+ operate in non-RTC mode in all environments.
> No, the driver makes the choice of when to shift to non-RTC from RTC.
> Currently this happens only in the multi-host environment, where each
> port is used to synchronize a different Linux system clock.
> But 230+ version has the change that will not track the rollover in
> FW, and the
> ASYNC_EVENT_CMPL_PHC_UPDATE_EVENT_DATA1_FLAGS_PHC_RTC_UPDATE
> deprecated.
> 
> > 3.2. Firmware version 224 uses RTC mode because older driver versions were
> not designed to track overflows (the higher 48–63 bits of the PHC counter) on the
> driver side.
> >
> >
> > >>> The latest driver handles the rollover on its own and we don't need the
> firmware to tell us.
> > >>> I checked with the firmware team and I gather that the version you are using
> is very old.
> > >>> Firmware version 230.x onwards, you should not receive this event for
> rollovers.
> > >>> Is it possible for you to update the firmware? Do you have access to a more
> recent (230+) firmware?
> > >> Yes, I can update firmware if you can tell where can I find the latest firmware
> and the update instructions?
> > >
> > > Broadcom's web site has pretty easy support portal with NIC firmware
> > > publicly available. Current version is 232 and it has all the
> > > improvements Pavan mentioned.
> >
> > Yes, I have found the "Broadcom BCM57xx Fwupg Tools” archive with some
> precompiled binaries for x86_64 platform. The problem is that our hosts are
> aarch64 and uses the Nix as a package manager, it will take some time to make it
> work in our setup. I just hoped that there is firmware binary itself that I can pass
> to ethtool —-flash.
> >
> >
> >
> > > On 25 Mar 2025, at 14:24, Pavan Chebbi <pavan.chebbi@...adcom.com>
> wrote:
> > >
> > >>> Yes, I can update firmware if you can tell where can I find the latest firmware
> and the update instructions?
> > >>>
> > >>
> > >> Broadcom's web site has pretty easy support portal with NIC firmware
> > >> publicly available. Current version is 232 and it has all the
> > >> improvements Pavan mentioned.
> > >>
> > > Thanks Vadim for chiming in. I guess you answered all of Kamil's questions.
> >
> > Yes, thank you for help. Without your explanation, I would have spent a lot
> more time understanding it on my own.
> >
> > > I am curious about Kamil's use case of running PTP on 4 ports (in a
> > > single host?) which seem to be using RTC mode.
> > > Like Vadim pointed out earlier, this cannot be an accurate config
> > > given we run a shared PHC.
> > > Can Kamil give details of his configuration?
> >
> > I have a system equipped with a BCM57502 NIC that functions as a PTP
> grandmaster in a small local network. Four PTP clients — each connected to one
> of the NIC’s four ports — synchronize their time with the grandmaster using the
> PTP L2P2P protocol. To support this configuration, I run four ptp4l instances (one
> for each port) and a single phc2sys daemon to synchronize system time and PHC
> time by adjusting the PHC. Because the bnxt_en driver reports different PHC
> device indexes for each NIC port, the phc2sys daemon treats each PHC device as
> independent and adjusts their times separately.
> >
> If you are using Broadcom NIC, and have only one system time to
> update, I don't see why we should have 4 PTP clients. Just one
> instance of ptp4l running on one of the ports and one phc2sys is going
> to be valid (and is sufficient?)
> I am thinking out loud, the phc2sys daemon could be picking up all the
> available clocks, but I think that needs to be modified, unless we
> decide to stop exposing multiple clocks for the same PHC in our
> design.
> Of course, I am not sure if you have a requirement of 4 GMs to sync with.
> 
> > We also have a similar setup with a different network card, the Intel E810-C,
> which has four ports as well. However, its ice driver exposes only one PHC device
> and probably read PHC counter in a different way. I do not remember similar
> issues with this setup.
> >
>  I think on the Intel NIC, this problem itself would not arise,
> because you will run only 1 client each of ptp4l and phc2sys, right?
> But I am not sure how you can run 4 GMs on Intel NIC if you are
> running that.

You can run one ptp4l instance connected to all 4 ports as a boundary clock. If you try to run separate instances of ptp4l on each port, you'll run into issues with each port trying to synchronize, unless you explicitly configure the ptp4l to be source only and never go into the sink/slave state.