[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250206115914.VfzGTwD8@linutronix.de>
Date: Thu, 6 Feb 2025 12:59:14 +0100
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Wander Lairson Costa <wander@...hat.com>
Cc: Tony Nguyen <anthony.l.nguyen@...el.com>, davem@...emloft.net,
kuba@...nel.org, pabeni@...hat.com, edumazet@...gle.com,
andrew+netdev@...n.ch, netdev@...r.kernel.org, rostedt@...dmis.org,
clrkwllms@...nel.org, jgarzik@...hat.com, yuma@...hat.com,
linux-rt-devel@...ts.linux.dev
Subject: Re: [PATCH net 0/4][pull request] igb: fix igb_msix_other() handling
for PREEMPT_RT
On 2025-02-05 17:08:35 [-0300], Wander Lairson Costa wrote:
> On Wed, Feb 05, 2025 at 10:48:18AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-02-04 09:52:36 [-0800], Tony Nguyen wrote:
> > > Wander Lairson Costa says:
> > >
> > > This is the second attempt at fixing the behavior of igb_msix_other()
> > > for PREEMPT_RT. The previous attempt [1] was reverted [2] following
> > > concerns raised by Sebastian [3].
> >
> > I still prefer a solution where we don't have the ifdef in the driver. I
> > was presented two traces but I didn't get why it works in once case but
> > not in the other. Maybe it was too obvious.
>
> Copying the traces here for further explanation. Both cases are for
> PREEMPT_RT.
>
> Failure case:
>
> kworker/-86 0...1 85.381866: function: igbvf_reset
> kworker/-86 0...2 85.381866: function: e1000_reset_hw_vf
> kworker/-86 0...2 85.381867: function: e1000_check_for_rst_vf
> kworker/-86 0...2 85.381868: function: e1000_write_posted_mbx
> kworker/-86 0...2 85.381868: function: e1000_write_mbx_vf
> kworker/-86 0...2 85.381870: function: e1000_check_for_ack_vf // repeats for 2000 lines
So it repeats because it waits for the bit. It waits for the interrupts.
> ...
> kworker/-86 0.N.2 86.393782: function: e1000_read_posted_mbx
Is this 2 the migrate-disable or preempt-disable counter? Because you
should get preempted based on that N.
> kworker/-86 0.N.2 86.398606: function: e1000_init_hw_vf
> kworker/-86 0.N.2 86.398606: function: e1000_rar_set_vf
> kworker/-86 0.N.2 86.398606: function: e1000_write_posted_mbx
> irq/65-e-1287 0d..1 86.398609: function: igb_msix_other
So the kworker leaves and immediately the interrupt gets on the CPU.
> irq/65-e-1287 0d..1 86.398609: function: igb_rd32
> irq/65-e-1287 0d..2 86.398610: function: igb_check_for_rst
> irq/65-e-1287 0d..2 86.398610: function: igb_check_for_rst_pf
> irq/65-e-1287 0d..2 86.398610: function: igb_rd32
> irq/65-e-1287 0d..2 86.398611: function: igb_check_for_msg
> irq/65-e-1287 0d..2 86.398611: function: igb_check_for_msg_pf
> irq/65-e-1287 0d..2 86.398611: function: igb_rd32
> irq/65-e-1287 0d..2 86.398612: function: igb_rcv_msg_from_vf
> irq/65-e-1287 0d..2 86.398612: function: igb_read_mbx
> irq/65-e-1287 0d..2 86.398612: function: igb_read_mbx_pf
> irq/65-e-1287 0d..2 86.398612: function: igb_obtain_mbx_lock_pf
> irq/65-e-1287 0d..2 86.398612: function: igb_rd32
>
> In the above trace, observe that the ISR igb_msix_other() is only
> scheduled after e1000_write_posted_mbx() fails due to a timeout.
> The interrupt handler should run during the looping calls to
> e1000_check_for_ack_vf(), but it is not scheduled because
> preemption is disabled.
What disables preemption? On PREEMPT_RT the spin_lock() does not disable
preemption. You shouldn't spin that long. When was interrupt scheduled.
_Why_ is the interrupt delayed that long.
> Note that e1000_check_for_ack_vf() is called 2000 times before
> it finally times out.
Exactly.
> Sucessful case:
>
> ip-5603 0...1 1884.710747: function: igbvf_reset
> ip-5603 0...2 1884.710754: function: e1000_reset_hw_vf
> ip-5603 0...2 1884.710755: function: e1000_check_for_rst_vf
> ip-5603 0...2 1884.710756: function: e1000_write_posted_mbx
> ip-5603 0...2 1884.710756: function: e1000_write_mbx_vf
> ip-5603 0...2 1884.710758: function: e1000_check_for_ack_vf
> ip-5603 0d.h2 1884.710760: function: igb_msix_other
> ip-5603 0d.h2 1884.710760: function: igb_rd32
> ip-5603 0d.h3 1884.710761: function: igb_check_for_rst
> ip-5603 0d.h3 1884.710761: function: igb_check_for_rst_pf
> ip-5603 0d.h3 1884.710761: function: igb_rd32
> ip-5603 0d.h3 1884.710762: function: igb_check_for_msg
> ip-5603 0d.h3 1884.710762: function: igb_check_for_msg_pf
> ip-5603 0d.h3 1884.710762: function: igb_rd32
> ip-5603 0d.h3 1884.710763: function: igb_rcv_msg_from_vf
> ip-5603 0d.h3 1884.710763: function: igb_read_mbx
> ip-5603 0d.h3 1884.710763: function: igb_read_mbx_pf
> ip-5603 0d.h3 1884.710763: function: igb_obtain_mbx_lock_pf
> ip-5603 0d.h3 1884.710763: function: igb_rd32
>
> Since we forced the interrupt context for igb_msix_other(), it now
> runs immediately while the driver checks for an acknowledgment via
> e1000_check_for_ack_vf().
Is this still RT or non-RT? I'm asking because igbvf_reset() is invoked
in ip's context and not in a worker. Also igb_msix_other() runs with a
'h' so it is not threaded.
I have a theory of my own, mind testing
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index d368b753a4675..6fe37b8001c36 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -912,7 +912,7 @@ static int igb_request_msix(struct igb_adapter *adapter)
struct net_device *netdev = adapter->netdev;
int i, err = 0, vector = 0, free_vector = 0;
- err = request_irq(adapter->msix_entries[vector].vector,
+ err = request_threaded_irq(adapter->msix_entries[vector].vector, NULL,
igb_msix_other, 0, netdev->name, adapter);
if (err)
goto err_out;
just to see if it solves the problem?
>
> > In the mean time:
> >
> > igb_msg_task_irq_safe()
> > -> vfs_raw_spin_lock_irqsave() // raw_spinlock_t
> > -> igb_vf_reset_event()
> > -> igb_vf_reset()
> > -> igb_set_rx_mode()
> > -> igb_write_mc_addr_list()
> > -> mta_list = kcalloc(netdev_mc_count(netdev), 6, GFP_ATOMIC); // kaboom?
>
> Perhaps the solution is to preallocate this buffer, if possible.
> Doing so would significantly simplify the patch. However, this
> would require knowing when the multicast (mc) count changes.
> I attempted to identify this but have not succeeded so far.
>
> >
> > By explicitly disabling preemption or using a raw_spinlock_t you need to
> > pay attention not to do anything that might lead to unbounded loops
> > (like iterating over many lists, polling on a bit for ages, …) and
> > paying attention that the whole API underneath that it is not doing that
> > is allowed to.
>
> I unsure if I understood what you are trying to say.
The moment you start disabling preemption/ use raw_spin_lock_t you need
to start about everything underneath/ everything within this section.
While if you keep using spinlock_t you don't have to worry *that* much
and worry if *this* will break PREEMPT_RT. Not to worry whether or not
it is okay to allocate memory or call this function because it might
break RT.
OR if netdev_mc_count() returns 1 you loop once later and this costs you
1us. If it returns 100, you loop 100 times and it costs how much
additional time?
Sebastian
Powered by blists - more mailing lists