netdev - Re: [PATCH net 0/4][pull request] igb: fix igb_msix_other() handling for PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mrw3tpwsravsaibkcpptdkko3ff6qtk6w6ernqvjisk4l7owok@q6hmxkzcdkey>
Date: Wed, 5 Feb 2025 17:08:35 -0300
From: Wander Lairson Costa <wander@...hat.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Tony Nguyen <anthony.l.nguyen@...el.com>, davem@...emloft.net, 
	kuba@...nel.org, pabeni@...hat.com, edumazet@...gle.com, andrew+netdev@...n.ch, 
	netdev@...r.kernel.org, rostedt@...dmis.org, clrkwllms@...nel.org, jgarzik@...hat.com, 
	yuma@...hat.com, linux-rt-devel@...ts.linux.dev
Subject: Re: [PATCH net 0/4][pull request] igb: fix igb_msix_other() handling
 for PREEMPT_RT

On Wed, Feb 05, 2025 at 10:48:18AM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-02-04 09:52:36 [-0800], Tony Nguyen wrote:
> > Wander Lairson Costa says:
> > 
> > This is the second attempt at fixing the behavior of igb_msix_other()
> > for PREEMPT_RT. The previous attempt [1] was reverted [2] following
> > concerns raised by Sebastian [3].
> 
> I still prefer a solution where we don't have the ifdef in the driver. I
> was presented two traces but I didn't get why it works in once case but
> not in the other. Maybe it was too obvious.

Copying the traces here for further explanation. Both cases are for
PREEMPT_RT.

Failure case:

kworker/-86      0...1    85.381866: function:                   igbvf_reset
kworker/-86      0...2    85.381866: function:                      e1000_reset_hw_vf
kworker/-86      0...2    85.381867: function:                         e1000_check_for_rst_vf
kworker/-86      0...2    85.381868: function:                         e1000_write_posted_mbx
kworker/-86      0...2    85.381868: function:                            e1000_write_mbx_vf
kworker/-86      0...2    85.381870: function:                            e1000_check_for_ack_vf // repeats for 2000 lines
...
kworker/-86      0.N.2    86.393782: function:                         e1000_read_posted_mbx
kworker/-86      0.N.2    86.398606: function:                      e1000_init_hw_vf
kworker/-86      0.N.2    86.398606: function:                         e1000_rar_set_vf
kworker/-86      0.N.2    86.398606: function:                            e1000_write_posted_mbx
irq/65-e-1287    0d..1    86.398609: function:             igb_msix_other
irq/65-e-1287    0d..1    86.398609: function:                igb_rd32
irq/65-e-1287    0d..2    86.398610: function:                igb_check_for_rst
irq/65-e-1287    0d..2    86.398610: function:                igb_check_for_rst_pf
irq/65-e-1287    0d..2    86.398610: function:                   igb_rd32
irq/65-e-1287    0d..2    86.398611: function:                igb_check_for_msg
irq/65-e-1287    0d..2    86.398611: function:                igb_check_for_msg_pf
irq/65-e-1287    0d..2    86.398611: function:                   igb_rd32
irq/65-e-1287    0d..2    86.398612: function:                igb_rcv_msg_from_vf
irq/65-e-1287    0d..2    86.398612: function:                   igb_read_mbx
irq/65-e-1287    0d..2    86.398612: function:                   igb_read_mbx_pf
irq/65-e-1287    0d..2    86.398612: function:                      igb_obtain_mbx_lock_pf
irq/65-e-1287    0d..2    86.398612: function:                         igb_rd32

In the above trace, observe that the ISR igb_msix_other() is only
scheduled after e1000_write_posted_mbx() fails due to a timeout.
The interrupt handler should run during the looping calls to
e1000_check_for_ack_vf(), but it is not scheduled because
preemption is disabled.

Note that e1000_check_for_ack_vf() is called 2000 times before
it finally times out.

Sucessful case:

      ip-5603    0...1  1884.710747: function:             igbvf_reset
      ip-5603    0...2  1884.710754: function:                e1000_reset_hw_vf
      ip-5603    0...2  1884.710755: function:                   e1000_check_for_rst_vf
      ip-5603    0...2  1884.710756: function:                   e1000_write_posted_mbx
      ip-5603    0...2  1884.710756: function:                      e1000_write_mbx_vf
      ip-5603    0...2  1884.710758: function:                      e1000_check_for_ack_vf
      ip-5603    0d.h2  1884.710760: function:             igb_msix_other
      ip-5603    0d.h2  1884.710760: function:                igb_rd32
      ip-5603    0d.h3  1884.710761: function:                igb_check_for_rst
      ip-5603    0d.h3  1884.710761: function:                igb_check_for_rst_pf
      ip-5603    0d.h3  1884.710761: function:                   igb_rd32
      ip-5603    0d.h3  1884.710762: function:                igb_check_for_msg
      ip-5603    0d.h3  1884.710762: function:                igb_check_for_msg_pf
      ip-5603    0d.h3  1884.710762: function:                   igb_rd32
      ip-5603    0d.h3  1884.710763: function:                igb_rcv_msg_from_vf
      ip-5603    0d.h3  1884.710763: function:                   igb_read_mbx
      ip-5603    0d.h3  1884.710763: function:                   igb_read_mbx_pf
      ip-5603    0d.h3  1884.710763: function:                      igb_obtain_mbx_lock_pf
      ip-5603    0d.h3  1884.710763: function:                         igb_rd32

Since we forced the interrupt context for igb_msix_other(), it now
runs immediately while the driver checks for an acknowledgment via
e1000_check_for_ack_vf().

> In the mean time:
> 
> igb_msg_task_irq_safe()
> -> vfs_raw_spin_lock_irqsave() // raw_spinlock_t
> -> igb_vf_reset_event()
>   -> igb_vf_reset()
>     -> igb_set_rx_mode()
>       -> igb_write_mc_addr_list()
>          -> mta_list = kcalloc(netdev_mc_count(netdev), 6, GFP_ATOMIC); // kaboom?

Perhaps the solution is to preallocate this buffer, if possible.
Doing so would significantly simplify the patch. However, this
would require knowing when the multicast (mc) count changes.
I attempted to identify this but have not succeeded so far.

> 
> By explicitly disabling preemption or using a raw_spinlock_t you need to
> pay attention not to do anything that might lead to unbounded loops
> (like iterating over many lists, polling on a bit for ages, …) and
> paying attention that the whole API underneath that it is not doing that
> is allowed to.

I unsure if I understood what you are trying to say.

> 
> Sebastian
>