netdev - RE: [PATCH iwl-net] igc: Ensure PTM request is completed before timeout has started

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <PR0P264MB44642D545FA8C5701F318DA2C8A22@PR0P264MB4464.FRAP264.PROD.OUTLOOK.COM>
Date: Tue, 16 Jul 2024 08:13:20 +0000
From: Rodrigo CADORE CATALDO <rodrigo.cadore@...coustics.com>
To: Vinicius Costa Gomes <vinicius.gomes@...el.com>, Rodrigo Cataldo via B4
 Relay <devnull+rodrigo.cadore.l-acoustics.com@...nel.org>, Jesse Brandeburg
	<jesse.brandeburg@...el.com>, Tony Nguyen <anthony.l.nguyen@...el.com>,
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Richard
 Cochran <richardcochran@...il.com>, Kurt Kanzenbach
	<kurt.kanzenbach@...utronix.de>, "Christopher S. Hall"
	<christopher.s.hall@...el.com>
CC: "intel-wired-lan@...ts.osuosl.org" <intel-wired-lan@...ts.osuosl.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH iwl-net] igc: Ensure PTM request is completed before
 timeout has started

Vinicius Costa Gomes <vinicius.gomes@...el.com> writes: 

> From: Vinicius Costa Gomes <vinicius.gomes@...el.com>
> 
> Rodrigo Cataldo via B4 Relay
> <devnull+rodrigo.cadore.l-acoustics.com@...nel.org> writes:
> 
> > From: Rodrigo Cataldo <rodrigo.cadore@...coustics.com>
> >
> > When a PTM is requested via wr32(IGC_PTM_STAT), the operation may only
> > be completed by the next read operation (flush). Unfortunately, the next
> > read operation in the PTM request loop happens after we have already
> > started evaluating the response timeout.
> >
> > Thus, the following behavior has been observed::
> >
> >   phc2sys-1655  [010]   103.233752: funcgraph_entry:                    |
> igc_ptp_getcrosststamp() {
> >   phc2sys-1655  [010]   103.233754: funcgraph_entry:                    |
> igc_phc_get_syncdevice_time() {
> >   phc2sys-1655  [010]   103.233755: funcgraph_entry:                    |      igc_rd32()
> {
> >   phc2sys-1655  [010]   103.233931: preempt_disable:
> caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
> >   phc2sys-1655  [010]   103.233932: local_timer_entry: vector=236
> >   phc2sys-1655  [010]   103.233932: hrtimer_cancel:
> hrtimer=0xffff8edeef526118
> >   phc2sys-1655  [010]   103.233933: hrtimer_expire_entry:
> hrtimer=0xffff8edeef526118 now=103200127876
> function=tick_nohz_handler/0x0
> >
> >   ... tick handler ...
> >
> >   phc2sys-1655  [010]   103.233971: funcgraph_exit:       !  215.559 us |      }
> >   phc2sys-1655  [010]   103.233972: funcgraph_entry:                    |      igc_rd32()
> {
> >   phc2sys-1655  [010]   103.234135: funcgraph_exit:       !  164.370 us |      }
> >   phc2sys-1655  [010]   103.234136: funcgraph_entry:         1.942 us   |
> igc_rd32();
> >   phc2sys-1655  [010]   103.234147: console:              igc 0000:03:00.0 enp3s0:
> Timeout reading IGC_PTM_STAT register
> >
> > Based on the (simplified) code::
> >
> >       ctrl = rd32(IGC_PTM_CTRL);
> >         /* simplified: multiple writes here */
> >       wr32(IGC_PTM_STAT, IGC_PTM_STAT_VALID);
> >
> >       err = readx_poll_timeout(rd32, IGC_PTM_STAT, stat,
> >                                stat, IGC_PTM_STAT_SLEEP,
> >                                IGC_PTM_STAT_TIMEOUT);
> >       if (err < 0) {
> >               netdev_err(adapter->netdev, "Timeout reading IGC_PTM_STAT
> register\n");
> >               return err;
> >       }
> >
> > Where readx_poll_timeout() starts the timeout evaluation before calling
> > the rd32() parameter (rd32() is a macro for igc_rd32()).
> >
> > In the trace shown, the read operation of readx_poll_timeout() (second
> > igc_rd32()) took so long that the timeout (IGC_PTM_STAT_VALID) has expired
> > and no sleep has been performed.
> >
> > With this patch, a write flush is added (which is an additional
> > igc_rd32() in practice) that can wait for the write before the timeout
> > is evaluated::
> >
> >   phc2sys-1615  [010]    74.517954: funcgraph_entry:                    |
> igc_ptp_getcrosststamp() {
> >   phc2sys-1615  [010]    74.517956: funcgraph_entry:                    |
> igc_phc_get_syncdevicetime() {
> >   phc2sys-1615  [010]    74.517957: funcgraph_entry:                    |      igc_rd32() {
> >   phc2sys-1615  [010]    74.518127: preempt_disable:
> caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
> >   phc2sys-1615  [010]    74.518128: local_timer_entry: vector=236
> >   phc2sys-1615  [010]    74.518128: hrtimer_cancel:
> hrtimer=0xffff96466f526118
> >   phc2sys-1615  [010]    74.518128: hrtimer_expire_entry:
> hrtimer=0xffff96466f526118 now=74484007229 function=tick_nohz_handler/0x0
> >
> >   ... tick handler ...
> >
> >   phc2sys-1615  [010]    74.518180: funcgraph_exit:       !  222.282 us |      }
> >   phc2sys-1615  [010]    74.518181: funcgraph_entry:                    |      igc_rd32() {
> >   phc2sys-1615  [010]    74.518349: funcgraph_exit:       !  168.160 us |      }
> >   phc2sys-1615  [010]    74.518349: funcgraph_entry:         1.970 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518352: hrtimer_init: hrtimer=0xffffa6f9413a3940
> clockid=CLOCK_MONOTONIC mode=0x0
> >   phc2sys-1615  [010]    74.518352: preempt_disable:
> caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
> >   phc2sys-1615  [010]    74.518353: hrtimer_start: hrtimer=0xffffa6f9413a3940
> function=hrtimer_wakeup/0x0 expires=74484232878 softexpires=74484231878
> >
> >   .. hrtimer setup and return ...
> >
> >   kworker/10:1-242   [010]    74.518382: sched_switch: kworker/10:1:242 [120]
> W ==> phc2sys:1615 [120]
> >   phc2sys-1615  [010]    74.518383: preempt_enable: caller=schedule+0x36
> parent=schedule+0x36
> >   phc2sys-1615  [010]    74.518384: funcgraph_entry:      !  100.088 us |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518484: funcgraph_entry:         1.958 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518488: funcgraph_entry:         2.019 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518490: funcgraph_entry:         1.956 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518492: funcgraph_entry:         1.980 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518494: funcgraph_exit:       !  539.386 us |    }
> >
> > Now the sleep is called as expected, and the operation succeeds.
> > Therefore, regardless of how long it will take for the write to be
> > completed, we will poll+sleep at least for the time specified in
> > IGC_PTM_STAT_TIMEOUT.
> >
> > Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()")
> > Signed-off-by: Rodrigo Cataldo <rodrigo.cadore@...coustics.com>
> > ---
> > I have been searching for the proper solution for this PTM issue for a long
> > time. The issue was observed on a 13700 (Raptor Lake). We also use a 8500
> > (Coffee Lake) that is much less susceptible for this issue, but still
> > happens. Both are using I225-LM cards.
> >
> > For a 24 hours test, idle system, I have observed with 13700::
> >
> >       number of timeouts in 86400 seconds: 2509
> >
> > The same test on a 8500::
> >
> >         number of timeouts in 86400 seconds: 9
> >
> > Where one PTM request is sent per second. Test was done this script::
> >
> >   record_multiple_timeout_param()
> >   {
> >       local taskset_cpu=$1
> >       local cur_limit=$((SECONDS + LIMIT_SECONDS ))
> >       local timeouts=0
> >
> >       while [ $SECONDS -lt $cur_limit ]
> >       do
> >               REL_TO=$((cur_limit - SECONDS))
> >
> >               timeout $REL_TO taskset $taskset_cpu \
> >                       phc2sys -c $ITF_NAME -s CLOCK_REALTIME -O 37 -m 1>/dev/null
> >               if [ $? -eq 255 ]; then
> >                       timeouts=$((timeouts + 1))
> >               fi
> >       done
> >       printf "\tnumber of timeouts in %s seconds: %d\n" $LIMIT_SECONDS
> $timeouts
> >   }
> >
> >   record_multiple_timeout_param $NON_ISOLCPU_MASK
> >
> > Firmware version for the cards::
> >
> >   # lshw -class network -json | jq '.[0].product,.[0].configuration.firmware'
> >   "Ethernet Controller I225-LM"
> >   "1057:8754"
> >
> >   # lshw -class network -json | jq '.[2].product,.[2].configuration.firmware'
> >   "Ethernet Controller I225-LM"
> >   "1057:8754
> >
> > A couple of attempts were made that did not lead to solving the
> > issue (disabling ASPM in kernel and boot, using periodic tick), and a couple
> > of solutions that worked but that were subpar:
> >
> > 1. The issue was not observed for a phc2sys(8) running on a fully
> >    isolated nohz_full core. We do not have the luxury of dedicating a a
> >    core for it.
> 
> This one is interesting. Was it because the isolated CPU never got to
> sleep and readx_poll_timeout() became a busy loop? I am trying to fit
> this one on my mental model.

Hello Vinicius! thank you for the reply.
Actually, in the isolation case, it is the opposite! the CPU always sleeps.
Which is good in this scenario.. More specifically, this is because the 
request always arrives 'quickly' in the device with the rd32() after 
readx_poll_timeout(). As long as the rd32() is 'quick', you have time to sleep.

This 'quickly' reference is about the 'power-saving' hypothesis I mention
below. Since the CPU is isolated, it is never interrupted, so I do not observe
the 'slow' rd32() in readx_poll_timeout().

In the end, the idea of this patch is to insert the write flush for avoiding this 
'quickly' or 'slowly' rd32() inside readx_poll_timeout().

> 
> > 2. Bumping the IGC_PTM_STAT_TIMEOUT value. Other machines may need
> >    another value though.
> 
> This one is not horrible as well. Some value like 400us comes to mind.
> 
> Your proposed fix looks fine as well, I was thinking a bit about it, and
> even if it's more a like a timing issue than "the absolutely correct
> fix", there are little chances for bad side effects.
> 
> If there aren't any other ideas, I am fine with this:
> 
> Acked-by: Vinicius Costa Gomes <vinicius.gomes@...el.com>

Thank you for the Ack!
I can test with a 400us. It should work fine.

(I talked with some of your intel colleagues before sending the patch, and there is 
supposed to be a patch response here with (another?) PTM solution. We will see :)

> 
> > 3. Retry (multiple times) readx_poll_timeout() in case of failure. This may
> >    significantly increase the function latency, because each
> >    readx_poll_timeout() can take more than 100 us.
> > 4. Disabling preemption during the PTM request. Horrible.
> >
> 
> I wonder if you tried moving the I225 to the graphics card PCIe slot, or
> is it an onboard card? IIRC there are different PCIe controllers for the
> graphics slot vs. the other ones, at least on Coffee Lake.
> 

I was only able to come to the lab yesterday, hence why I did not answer before.

It is true, I was not using the graphic card slot. So, I changed from the PCIe root
1c.4/1c.7 to 01 (for both machines)

I'm running the 24-hour test again.  Here are the results

For Raptor Lake (13700), PCIe root 01 (width x16, 32GT/s)::
    # dmesg | grep "Timeout reading IGC_PTM_STAT register" | wc -l
    514
    $ ps -eo user,pid,lstart,comm | grep 1797
    root        1797 Mon Jul 15 11:32:03 2024 ptm_timeout.sh
    $ date 
    Tue Jul 16 10:02:48 AM CEST 2024

514 timeouts for ~23h. The issue persists.

For Coffee Lake (8500), I was surprised to see this::
    # dmesg | grep PTM
    [    0.412210] pci 0000:00:1c.0: PTM enabled (root), 4ns granularity
    [    0.413769] pci 0000:00:1c.7: PTM enabled (root), 4ns granularity
    [    0.415250] pci 0000:00:1d.0: PTM enabled (root), 4ns granularity
    [    1.353468] igc 0000:01:00.0: PCIe PTM not supported by PCIe bus/controller

It seems PCIe root 01 does not support PTM.

There is something else I want to mention. We aim to support redundancy mode
for AVB, where we will likely use two PCIe slots; so, one of them would not use the
graphic card slot.

> > For the Coffee Lake machine, the issue tends to be avoided because the
> > read does not take so long. Here is basically the same trace using the
> > Cofee Lake machine::
> >
> >   phc2sys-1204  [002]  1778.220288: funcgraph_entry:                    |
> igc_ptp_getcrosststamp() {
> >   phc2sys-1204  [002]  1778.220290: funcgraph_entry:                    |
> igc_phc_get_syncdevicetime() {
> >   phc2sys-1204  [002]  1778.220291: funcgraph_entry:                    |      igc_rd32()
> {
> >   phc2sys-1204  [002]  1778.220373: preempt_disable:
> caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
> >   phc2sys-1204  [002]  1778.220374: local_timer_entry: vector=236
> >   phc2sys-1204  [002]  1778.220375: hrtimer_cancel:
> hrtimer=0xffff894027326118
> >   phc2sys-1204  [002]  1778.220376: hrtimer_expire_entry:
> hrtimer=0xffff894027326118 now=1778228034802
> function=tick_nohz_handler/0x0
> >
> >   ... tick handler ...
> >
> >   phc2sys-1204  [002]  1778.220411: funcgraph_exit:       !  119.843 us |      }
> >   phc2sys-1204  [002]  1778.220412: funcgraph_entry:                    |      igc_rd32()
> {
> >   phc2sys-1204  [002]  1778.220492: funcgraph_exit:       + 80.094 us   |      }
> >   phc2sys-1204  [002]  1778.220493: funcgraph_entry:        2.951 us    |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220497: hrtimer_init: hrtimer=0xffffa504c0d83aa0
> clockid=CLOCK_MONOTONIC mode=0x0
> >   phc2sys-1204  [002]  1778.220498: preempt_disable:
> caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
> >   phc2sys-1204  [002]  1778.220499: hrtimer_start:
> hrtimer=0xffffa504c0d83aa0 function=hrtimer_wakeup/0x0
> expires=1778228158866 softexpires=1778228157866
> >
> >   ... timer setup ....
> >
> >   phc2sys-1204  [002]  1778.220509: preempt_enable:
> caller=_raw_spin_unlock_irqrestore+0x2b parent=hrtimer_start_range_ns+0x12d
> >   phc2sys-1204  [002]  1778.220511: funcgraph_entry:        7.338 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220519: funcgraph_entry:        2.769 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220522: funcgraph_entry:        2.798 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220525: funcgraph_entry:        2.736 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220529: funcgraph_entry:        2.750 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220532: funcgraph_exit:       !  242.656 us |    }
> >
> > For both machines, I observed that the first igc_rd32() after an idle
> > period (more than 10us) tends to take significantly more time. I assume
> > this is a hardware power-saving technique, but I could not find any
> > mention in the manuals. This is very easily observable with an idle
> > system running phc2sys, since it will request only once every second.
> >
> > This is the typical trace of the operation::
> >
> >   phc2sys-1204  [002]  1749.209397: funcgraph_entry:                   |
> igc_ptp_getcrosststamp() {
> >   phc2sys-1204  [002]  1749.209398: funcgraph_entry:                   |
> igc_phc_get_syncdevicetime() {
> >   phc2sys-1204  [002]  1749.209400: funcgraph_entry:      + 81.491 us  |
> igc_rd32();
> >   phc2sys-1204  [002]  1749.209482: funcgraph_entry:        3.691 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1749.209487: funcgraph_entry:        2.942 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1749.209490: hrtimer_init: hrtimer=0xffffa504c0d83a00
> clockid=CLOCK_MONOTONIC mode=0x0
> >   phc2sys-1204  [002]  1749.209491: preempt_disable:
> caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
> >   ... timer setup and it goes on like before ...
> >
> > The preemption needs to happen for this issue, so that this power-saving
> > mode is triggered, making the igc_rd32 'slow enough' so that all of the
> > timeout is consumed before the card has time to answer.
> >
> > I believe flushing the write solves the issue definitely, since the
> > write should be completed before the timeout has started. So that, even
> > if the timeout is consumed by a slow read operation, the write has been
> > received before and the card had time to process the request.
> > ---
> >  drivers/net/ethernet/intel/igc/igc_ptp.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/intel/igc/igc_ptp.c
> b/drivers/net/ethernet/intel/igc/igc_ptp.c
> > index 1bb026232efc..d7269e4f1a21 100644
> > --- a/drivers/net/ethernet/intel/igc/igc_ptp.c
> > +++ b/drivers/net/ethernet/intel/igc/igc_ptp.c
> > @@ -1005,6 +1005,10 @@ static int igc_phc_get_syncdevicetime(ktime_t
> *device,
> >                * VALID bit.
> >                */
> >               wr32(IGC_PTM_STAT, IGC_PTM_STAT_VALID);
> > +             /* Ensure the hardware receives the ptm request before the
> > +              * response timeout starts.
> > +              */
> > +             wrfl();
> >
> >               err = readx_poll_timeout(rd32, IGC_PTM_STAT, stat,
> >                                        stat, IGC_PTM_STAT_SLEEP,
> >
> > ---
> > base-commit: 0005b2dc43f96b93fc5b0850d7ca3f7aeac9129c
> > change-id: 20240705-igc-flush-ptm-request-before-timeout-6-10-f6e02c96f6d4
> >
> > Best regards,
> > --
> > Rodrigo Cataldo <rodrigo.cadore@...coustics.com>
> >
> >
> 
> 
> Cheers,
> --
> Vinicius