netdev - Re: [PATCH net 0/4][pull request] igb: fix igb_msix_other() handling for PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <zy3irjybyc32hnow3ckhkfsrtfm5nev44aeovinlkkfc6tyyjv@gcblibp5ng3o>
Date: Wed, 12 Feb 2025 08:56:47 -0300
From: Wander Lairson Costa <wander@...hat.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Tony Nguyen <anthony.l.nguyen@...el.com>, davem@...emloft.net, 
	kuba@...nel.org, pabeni@...hat.com, edumazet@...gle.com, andrew+netdev@...n.ch, 
	netdev@...r.kernel.org, rostedt@...dmis.org, clrkwllms@...nel.org, jgarzik@...hat.com, 
	yuma@...hat.com, linux-rt-devel@...ts.linux.dev
Subject: Re: [PATCH net 0/4][pull request] igb: fix igb_msix_other() handling
 for PREEMPT_RT

On Thu, Feb 06, 2025 at 12:59:14PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-02-05 17:08:35 [-0300], Wander Lairson Costa wrote:
> > On Wed, Feb 05, 2025 at 10:48:18AM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2025-02-04 09:52:36 [-0800], Tony Nguyen wrote:
> > > > Wander Lairson Costa says:
> > > > 
> > > > This is the second attempt at fixing the behavior of igb_msix_other()
> > > > for PREEMPT_RT. The previous attempt [1] was reverted [2] following
> > > > concerns raised by Sebastian [3].
> > > 
> > > I still prefer a solution where we don't have the ifdef in the driver. I
> > > was presented two traces but I didn't get why it works in once case but
> > > not in the other. Maybe it was too obvious.
> > 
> > Copying the traces here for further explanation. Both cases are for
> > PREEMPT_RT.
> > 
> > Failure case:
> > 
> > kworker/-86      0...1    85.381866: function:                   igbvf_reset
> > kworker/-86      0...2    85.381866: function:                      e1000_reset_hw_vf
> > kworker/-86      0...2    85.381867: function:                         e1000_check_for_rst_vf
> > kworker/-86      0...2    85.381868: function:                         e1000_write_posted_mbx
> > kworker/-86      0...2    85.381868: function:                            e1000_write_mbx_vf
> > kworker/-86      0...2    85.381870: function:                            e1000_check_for_ack_vf // repeats for 2000 lines
> 
> So it repeats because it waits for the bit. It waits for the interrupts.
> 
> > ...
> > kworker/-86      0.N.2    86.393782: function:                         e1000_read_posted_mbx
> 
> Is this 2 the migrate-disable or preempt-disable counter? Because you
> should get preempted based on that N.
> 
> > kworker/-86      0.N.2    86.398606: function:                      e1000_init_hw_vf
> > kworker/-86      0.N.2    86.398606: function:                         e1000_rar_set_vf
> > kworker/-86      0.N.2    86.398606: function:                            e1000_write_posted_mbx
> > irq/65-e-1287    0d..1    86.398609: function:             igb_msix_other
> 
> So the kworker leaves and immediately the interrupt gets on the CPU.
> 
> > irq/65-e-1287    0d..1    86.398609: function:                igb_rd32
> > irq/65-e-1287    0d..2    86.398610: function:                igb_check_for_rst
> > irq/65-e-1287    0d..2    86.398610: function:                igb_check_for_rst_pf
> > irq/65-e-1287    0d..2    86.398610: function:                   igb_rd32
> > irq/65-e-1287    0d..2    86.398611: function:                igb_check_for_msg
> > irq/65-e-1287    0d..2    86.398611: function:                igb_check_for_msg_pf
> > irq/65-e-1287    0d..2    86.398611: function:                   igb_rd32
> > irq/65-e-1287    0d..2    86.398612: function:                igb_rcv_msg_from_vf
> > irq/65-e-1287    0d..2    86.398612: function:                   igb_read_mbx
> > irq/65-e-1287    0d..2    86.398612: function:                   igb_read_mbx_pf
> > irq/65-e-1287    0d..2    86.398612: function:                      igb_obtain_mbx_lock_pf
> > irq/65-e-1287    0d..2    86.398612: function:                         igb_rd32
> > 
> > In the above trace, observe that the ISR igb_msix_other() is only
> > scheduled after e1000_write_posted_mbx() fails due to a timeout.
> > The interrupt handler should run during the looping calls to
> > e1000_check_for_ack_vf(), but it is not scheduled because
> > preemption is disabled.
> 
> What disables preemption? On PREEMPT_RT the spin_lock() does not disable
> preemption. You shouldn't spin that long. When was interrupt scheduled.
> _Why_ is the interrupt delayed that long.
> 
When I was using trace-cmd report -l, it omitted some fields, one of
them is preempt-lazy-depth (which was something new to me), and it seems
this is what affects interrupts. It comes from here [1]. I had the logs,
but the machine went under maintenance  before I could save them. Once
it comes back, I can grab them and post here.

[1] https://elixir.bootlin.com/linux/v6.13.2/source/drivers/net/ethernet/intel/igbvf/netdev.c#L1522

> > Note that e1000_check_for_ack_vf() is called 2000 times before
> > it finally times out.
> 
> Exactly.
> 
> > Sucessful case:
> > 
> >       ip-5603    0...1  1884.710747: function:             igbvf_reset
> >       ip-5603    0...2  1884.710754: function:                e1000_reset_hw_vf
> >       ip-5603    0...2  1884.710755: function:                   e1000_check_for_rst_vf
> >       ip-5603    0...2  1884.710756: function:                   e1000_write_posted_mbx
> >       ip-5603    0...2  1884.710756: function:                      e1000_write_mbx_vf
> >       ip-5603    0...2  1884.710758: function:                      e1000_check_for_ack_vf
> >       ip-5603    0d.h2  1884.710760: function:             igb_msix_other
> >       ip-5603    0d.h2  1884.710760: function:                igb_rd32
> >       ip-5603    0d.h3  1884.710761: function:                igb_check_for_rst
> >       ip-5603    0d.h3  1884.710761: function:                igb_check_for_rst_pf
> >       ip-5603    0d.h3  1884.710761: function:                   igb_rd32
> >       ip-5603    0d.h3  1884.710762: function:                igb_check_for_msg
> >       ip-5603    0d.h3  1884.710762: function:                igb_check_for_msg_pf
> >       ip-5603    0d.h3  1884.710762: function:                   igb_rd32
> >       ip-5603    0d.h3  1884.710763: function:                igb_rcv_msg_from_vf
> >       ip-5603    0d.h3  1884.710763: function:                   igb_read_mbx
> >       ip-5603    0d.h3  1884.710763: function:                   igb_read_mbx_pf
> >       ip-5603    0d.h3  1884.710763: function:                      igb_obtain_mbx_lock_pf
> >       ip-5603    0d.h3  1884.710763: function:                         igb_rd32
> > 
> > Since we forced the interrupt context for igb_msix_other(), it now
> > runs immediately while the driver checks for an acknowledgment via
> > e1000_check_for_ack_vf().
> 
> Is this still RT or non-RT? I'm asking because igbvf_reset() is invoked
> in ip's context and not in a worker. Also igb_msix_other() runs with a
> 'h' so it is not threaded.
> 
> I have a theory of my own, mind testing 
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index d368b753a4675..6fe37b8001c36 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -912,7 +912,7 @@ static int igb_request_msix(struct igb_adapter *adapter)
>  	struct net_device *netdev = adapter->netdev;
>  	int i, err = 0, vector = 0, free_vector = 0;
>  
> -	err = request_irq(adapter->msix_entries[vector].vector,
> +	err = request_threaded_irq(adapter->msix_entries[vector].vector, NULL,
>  			  igb_msix_other, 0, netdev->name, adapter);
>  	if (err)
>  		goto err_out;
> 
> just to see if it solves the problem?
> 
I have two test cases:

1) Boot the machine with nr_cpus=1. The driver reports "PF still
resetting" message continuously. This issue is gone.

2) Run the following script:

    ipaddr_vlan=3
    nic_test=ens14f0
    vf=${nic_test}v0 # The main testing steps:
    while true; do
        ip link set ${nic_test} mtu 1500
        ip link set ${vf} mtu 1500
        ip link set $vf up
        # 3. set vlan and ip for VF
        ip link set ${nic_test} vf 0 vlan ${ipaddr_vlan}
        ip addr add 172.30.${ipaddr_vlan}.1/24 dev ${vf}
        ip addr add 2021:db8:${ipaddr_vlan}::1/64 dev ${vf}
        # 4. check the link state for VF and PF
        ip link show ${nic_test}
        if ! ip link show $vf | grep 'state UP'; then
            echo 'Error found'
            break
        fi
        ip link set $vf down
    done

This one eventually fails. It is the first time that one works and the
other fails. So far, it has been all or nothing. I didn't have time yet to
investigate why this happens.

> > 
> > > In the mean time:
> > > 
> > > igb_msg_task_irq_safe()
> > > -> vfs_raw_spin_lock_irqsave() // raw_spinlock_t
> > > -> igb_vf_reset_event()
> > >   -> igb_vf_reset()
> > >     -> igb_set_rx_mode()
> > >       -> igb_write_mc_addr_list()
> > >          -> mta_list = kcalloc(netdev_mc_count(netdev), 6, GFP_ATOMIC); // kaboom?
> > 
> > Perhaps the solution is to preallocate this buffer, if possible.
> > Doing so would significantly simplify the patch. However, this
> > would require knowing when the multicast (mc) count changes.
> > I attempted to identify this but have not succeeded so far.
> > 
> > > 
> > > By explicitly disabling preemption or using a raw_spinlock_t you need to
> > > pay attention not to do anything that might lead to unbounded loops
> > > (like iterating over many lists, polling on a bit for ages, …) and
> > > paying attention that the whole API underneath that it is not doing that
> > > is allowed to.
> > 
> > I unsure if I understood what you are trying to say.
> 
> The moment you start disabling preemption/ use raw_spin_lock_t you need
> to start about everything underneath/ everything within this section.
> While if you keep using spinlock_t you don't have to worry *that* much
> and worry if *this* will break PREEMPT_RT. Not to worry whether or not
> it is okay to allocate memory or call this function because it might
> break RT.
> OR if netdev_mc_count() returns 1 you loop once later and this costs you
> 1us. If it returns 100, you loop 100 times and it costs how much
> additional time?
> 
> Sebastian
>