lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 13 Nov 2020 14:32:36 -0800
From:   Jacob Keller <jacob.e.keller@...el.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     netdev@...r.kernel.org, Jiri Pirko <jiri@...dia.com>,
        Michael Chan <michael.chan@...adcom.com>,
        Shannon Nelson <snelson@...sando.io>,
        Saeed Mahameed <saeedm@...dia.com>,
        Boris Pismenny <borisp@...dia.com>,
        Bin Luo <luobin9@...wei.com>
Subject: devlink userspace process appears stuck (was: Re: [net-next] devlink:
 move request_firmware out of driver)



On 11/13/2020 1:34 PM, Jacob Keller wrote:
> Well, at least with ice, the experience is pretty bad. I tried out with
> a garbage file name on one of my test systems. This was on a slightly
> older kernel without this patch applied, and the device had a pending
> update that had not yet been finalized with a reset:
> 
> $ devlink dev flash pci/0000:af:00.0 file garbage_file_does_not_exist
> Canceling previous pending update
> 
> 
> The update looks like it got stuck, but actually it failed. Somehow the
> extack error over the socket didn't get handled by the application very
> well. Something buggy in the forked process probably.
> 
> I do get this in the dmesg though:
> 
> Nov 13 13:12:57 jekeller-stp-glorfindel kernel: ice 0000:af:00.0: Direct
> firmware load for garbage_file_does_not_exist failed with error -2
> 

I think I figured out what is going on here, but I'm not sure what the
best solution is.

in userspace devlink.c:3410, the condition for exiting the while loop
that monitors the flash update process is:

(!ctx.flash_done || (ctx.not_first && !ctx.received_end))

This condition means keep looping until flash is done *OR* we've
received a message but have not yet received the end.

In the ice driver implementation, we perform a check for a pending flash
update first, which could trigger a cancellation that causes us to send
back a "cancelling previous pending flash update" status message, which
was sent *before* the devlink_flash_update_begin_notify(). Then, after
this we request the firmware object, which fails, and results in an
error code being reported back..

However, we will never send either the begin or end notification at this
point. Thus, the devlink userspace application will never quit, and
won't display the extack message.

This occurs because we sent a status notify message before we actually
sent a "begin notify". I think the ordering was done because of trying
to avoid having a complicated cleanup logic.

In some sense, this is a bug in the ice driver.. but in another sense
this is the devlink application being too strict about the requirements
on ordering of these messages..

I guess one method if we really want to remain strict is moving the
"begin" and "end" notifications outside of the driver into core so that
it always sends a begin before calling the .flash_update handler, and
always sends an end before exiting.

I guess we could simply relax the restriction on "not first" so that
we'll always exit in the case where the forked process has quit on us,
even if we haven't received a proper flash end notification...

Thoughts?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ