linux-kernel - Re: [PATCH iwl-net 0/4] igb: fix igb_msix_other() handling for PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAq0SUnoS45Fctkzj4t4OxT=9qm9Bg8zu79=S3DUL_jcoLbC-A@mail.gmail.com>
Date: Thu, 9 Jan 2025 13:46:47 -0300
From: Wander Lairson Costa <wander@...hat.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Tony Nguyen <anthony.l.nguyen@...el.com>, 
	Przemek Kitszel <przemyslaw.kitszel@...el.com>, Andrew Lunn <andrew+netdev@...n.ch>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Clark Williams <clrkwllms@...nel.org>, Steven Rostedt <rostedt@...dmis.org>, 
	Jeff Garzik <jgarzik@...hat.com>, Auke Kok <auke-jan.h.kok@...el.com>, 
	"moderated list:INTEL ETHERNET DRIVERS" <intel-wired-lan@...ts.osuosl.org>, 
	"open list:NETWORKING DRIVERS" <netdev@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>, 
	"open list:Real-time Linux (PREEMPT_RT):Keyword:PREEMPT_RT" <linux-rt-devel@...ts.linux.dev>
Subject: Re: [PATCH iwl-net 0/4] igb: fix igb_msix_other() handling for PREEMPT_RT

On Wed, Jan 8, 2025 at 7:25 AM Sebastian Andrzej Siewior
<bigeasy@...utronix.de> wrote:
>
> On 2025-01-07 15:52:47 [-0300], Wander Lairson Costa wrote:
> > On Tue, Jan 07, 2025 at 02:51:06PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2024-12-04 08:42:23 [-0300], Wander Lairson Costa wrote:
> > > > This is the second attempt at fixing the behavior of igb_msix_other()
> > > > for PREEMPT_RT. The previous attempt [1] was reverted [2] following
> > > > concerns raised by Sebastian [3].
> > > >
> > > > The initial approach proposed converting vfs_lock to a raw_spinlock,
> > > > a minor change intended to make it safe. However, it became evident
> > > > that igb_rcv_msg_from_vf() invokes kcalloc with GFP_ATOMIC,
> > > > which is unsafe in interrupt context on PREEMPT_RT systems.
> > > >
> > > > To address this, the solution involves splitting igb_msg_task()
> > > > into two parts:
> > > >
> > > >     * One part invoked from the IRQ context.
> > > >     * Another part called from the threaded interrupt handler.
> > > >
> > > > To accommodate this, vfs_lock has been restructured into a double
> > > > lock: a spinlock_t and a raw_spinlock_t. In the revised design:
> > > >
> > > >     * igb_disable_sriov() locks both spinlocks.
> > > >     * Each part of igb_msg_task() locks the appropriate spinlock for
> > > >     its execution context.
> > >
> > > - Is this limited to PREEMPT_RT or does it also occur on PREEMPT systems
> > >   with threadirqs? And if this is PREEMPT_RT only, why?
> >
> > PREEMPT systems configured to use threadirqs should be affected as well,
> > although I never tested with this configuration. Honestly, until now I wasn't
> > aware of the possibility of a non PREEMPT_RT kernel with threaded IRQs by default.
>
> If the issue is indeed the use of threaded interrupts then the fix
> should not be limited to be PREEMPT_RT only.
>
Although I was not aware of this scenario, the patch should work for it as well,
as I am forcing it to run in interrupt context. I will test it to confirm.

> > > - What causes the failure? I see you reworked into two parts to behave
> > >   similar to what happens without threaded interrupts. There is still no
> > >   explanation for it. Is there a timing limit or was there another
> > >   register operation which removed the mailbox message?
> > >
> >
> > I explained the root cause of the issue in the last commit. Maybe I should
> > have added the explanation to the cover letter as well.  Anyway, here is a
> > partial verbatim copy of it:
> >
> > "During testing of SR-IOV, Red Hat QE encountered an issue where the
> > ip link up command intermittently fails for the igbvf interfaces when
> > using the PREEMPT_RT variant. Investigation revealed that
> > e1000_write_posted_mbx returns an error due to the lack of an ACK
> > from e1000_poll_for_ack.
>
> That ACK would have come if it would poll longer?
>
No, the service wouldn't be serviced while polling.

> > The underlying issue arises from the fact that IRQs are threaded by
> > default under PREEMPT_RT. While the exact hardware details are not
> > available, it appears that the IRQ handled by igb_msix_other must
> > be processed before e1000_poll_for_ack times out. However,
> > e1000_write_posted_mbx is called with preemption disabled, leading
> > to a scenario where the IRQ is serviced only after the failure of
> > e1000_write_posted_mbx."
>
> Where is this disabled preemption coming from? This should be one of the
> ops.write_posted() calls, right? I've been looking around and don't see
> anything obvious.

I don't remember if I found the answer by looking at the code or by
looking at the ftrace flags.
I am currently on sick leave with covid. I can check it when I come back.

> Couldn't you wait for an event instead of polling?
>
> > The call chain from igb_msg_task():
> >
> > igb_msg_task
> >       igb_rcv_msg_from_vf
> >               igb_set_vf_multicasts
> >                       igb_set_rx_mode
> >                               igb_write_mc_addr_list
> >                                       kmalloc
> >
> > Cannot happen from interrupt context under PREEMPT_RT. So this part of
> > the interrupt handler is deferred to a threaded IRQ handler.
> >
> > > > Cheers,
> > > > Wander
>
> Sebastian
>