lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAA85sZsq4JnatO0TjhN=o6S4adq7pQDC8A5dtRrBKeY9ry0NfQ@mail.gmail.com>
Date: Thu, 27 Feb 2025 15:31:13 +0100
From: Ian Kumlien <ian.kumlien@...il.com>
To: Nikolay Aleksandrov <razor@...ckwall.org>
Cc: Jakub Kicinski <kuba@...nel.org>, 
	Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: [6.12.15][be2net?] Voluntary context switch within RCU read-side
 critical section!

On Wed, Feb 26, 2025 at 11:28 PM Ian Kumlien <ian.kumlien@...il.com> wrote:
>
> On Wed, Feb 26, 2025 at 2:11 PM Nikolay Aleksandrov <razor@...ckwall.org> wrote:
> >
> > On 2/26/25 14:26, Ian Kumlien wrote:
> > > On Wed, Feb 26, 2025 at 1:00 PM Nikolay Aleksandrov <razor@...ckwall.org> wrote:
> > >>
> > >> On 2/26/25 13:52, Ian Kumlien wrote:
> > >>> On Wed, Feb 26, 2025 at 11:33 AM Nikolay Aleksandrov
> > >>> <razor@...ckwall.org> wrote:
> > >>>>
> > >>>> On 2/26/25 11:55, Ian Kumlien wrote:
> > >>>>> On Wed, Feb 26, 2025 at 10:24 AM Ian Kumlien <ian.kumlien@...il.com> wrote:
> > >>>>>>
> > >>>>>> On Wed, Feb 26, 2025 at 2:05 AM Jakub Kicinski <kuba@...nel.org> wrote:
> > >>>>>>>
> > >>>>>>> On Tue, 25 Feb 2025 11:13:47 +0100 Ian Kumlien wrote:
> > >>>>>>>> Same thing happens in 6.13.4, FYI
> > >>>>>>>
> > >>>>>>> Could you do a minor bisection? Does it not happen with 6.11?
> > >>>>>>> Nothing jumps out at quick look.
> > >>>>>>
> > >>>>>> I have to admint that i haven't been tracking it too closely until it
> > >>>>>> turned out to be an issue
> > >>>>>> (makes network traffic over wireguard, through that node very slow)
> > >>>>>>
> > >>>>>> But i'm pretty sure it was ok in early 6.12.x - I'll try to do a bisect though
> > >>>>>> (it's a gw to reach a internal server network in the basement, so not
> > >>>>>> the best setup for this)
> > >>>>>
> > >>>>> Since i'm at work i decided to check if i could find all the boot
> > >>>>> logs, which is actually done nicely by systemd
> > >>>>> first known bad: 6.11.7-300.fc41.x86_64
> > >>>>> last known ok: 6.11.6-200.fc40.x86_64
> > >>>>>
> > >>>>> Narrows the field for a bisect at least, =)
> > >>>>>
> > >>>>
> > >>>> Saw bridge, took a look. :)
> > >>>>
> > >>>> I think there are multiple issues with benet's be_ndo_bridge_getlink()
> > >>>> because it calls be_cmd_get_hsw_config() which can sleep in multiple
> > >>>> places, e.g. the most obvious is the mutex_lock() in the beginning of
> > >>>> be_cmd_get_hsw_config(), then we have the call trace here which is:
> > >>>> be_cmd_get_hsw_config -> be_mcc_notify_wait -> be_mcc_wait_compl -> usleep_range()
> > >>>>
> > >>>> Maybe you updated some tool that calls down that path along with the kernel and system
> > >>>> so you started seeing it in Fedora 41?
> > >>>
> > >>> Could be but it's pretty barebones
> > >>>
> > >>>> IMO this has been problematic for a very long time, but obviously it depends on the
> > >>>> chip type. Could you share your benet chip type to confirm the path?
> > >>>
> > >>> I don't know how to find the actual chip information but it's identified as:
> > >>> Emulex Corporation OneConnect NIC (Skyhawk) (rev 10)
> > >>>
> > >>
> > >> Good, that confirms it. The skyhawk chip falls in the "else" of the block in
> > >> be_ndo_bridge_getlink() which calls be_cmd_get_hsw_config().
> > >>
> > >>>> For the blamed commit I'd go with:
> > >>>>  commit b71724147e73
> > >>>>  Author: Sathya Perla <sathya.perla@...adcom.com>
> > >>>>  Date:   Wed Jul 27 05:26:18 2016 -0400
> > >>>>
> > >>>>      be2net: replace polling with sleeping in the FW completion path
> > >>>>
> > >>>> This one changed the udelay() (which is safe) to usleep_range() and the spinlock
> > >>>> to a mutex.
> > >>>
> > >>> So, first try will be to try without that patch then, =)
> > >>>
> > >>
> > >> That would be a good try, yes. It is not a straight-forward revert though since a lot
> > >> of changes have happened since that commit. Let me know if you need help with that,
> > >> I can prepare the revert to test.
> > >
> > > Yeah, looked at the size of it and... well... I dunno if i'd have the time =)
> > >
> >
> > Can you try the attached patch?
> > It is on top of net-next (but also applies to Linus' tree):
> >  git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
> >
> > It partially reverts the mentioned commit above (only mutex -> spinlock and usleep -> udelay)
> > because the commit does many more things.
> >
> > Also +CC original patch author which I forgot to do.
>
> Thanks, built and installed but it refuses to boot it - will have to
> check during the weekend...
> (boots the latest fedora version even if this one is the selected one
> according to grubby)

So, saw that 6.13.5 was released so, fetched that, applied the patch
and no more RCU issues in dmesg

Will check more on the suspected performance bit as well when i get
home later tonight

I also understand Sathya Perla's motivation in saving power on this
but things around it have been changed
and it no longer works as intended....

> > Thanks,
> >  Nik
> >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ