netdev - devlink userspace process appears stuck (was: Re: [net-next] devlink: move request

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6352e9d3-02af-721e-3a54-ef99a666be29@intel.com>
Date:   Fri, 13 Nov 2020 14:32:36 -0800
From:   Jacob Keller <jacob.e.keller@...el.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     netdev@...r.kernel.org, Jiri Pirko <jiri@...dia.com>,
        Michael Chan <michael.chan@...adcom.com>,
        Shannon Nelson <snelson@...sando.io>,
        Saeed Mahameed <saeedm@...dia.com>,
        Boris Pismenny <borisp@...dia.com>,
        Bin Luo <luobin9@...wei.com>
Subject: devlink userspace process appears stuck (was: Re: [net-next] devlink:
 move request_firmware out of driver)

On 11/13/2020 1:34 PM, Jacob Keller wrote:
> Well, at least with ice, the experience is pretty bad. I tried out with
> a garbage file name on one of my test systems. This was on a slightly
> older kernel without this patch applied, and the device had a pending
> update that had not yet been finalized with a reset:
> 
> $ devlink dev flash pci/0000:af:00.0 file garbage_file_does_not_exist
> Canceling previous pending update
> 
> 
> The update looks like it got stuck, but actually it failed. Somehow the
> extack error over the socket didn't get handled by the application very
> well. Something buggy in the forked process probably.
> 
> I do get this in the dmesg though:
> 
> Nov 13 13:12:57 jekeller-stp-glorfindel kernel: ice 0000:af:00.0: Direct
> firmware load for garbage_file_does_not_exist failed with error -2
> 

I think I figured out what is going on here, but I'm not sure what the
best solution is.

in userspace devlink.c:3410, the condition for exiting the while loop
that monitors the flash update process is:

(!ctx.flash_done || (ctx.not_first && !ctx.received_end))

This condition means keep looping until flash is done *OR* we've
received a message but have not yet received the end.

In the ice driver implementation, we perform a check for a pending flash
update first, which could trigger a cancellation that causes us to send
back a "cancelling previous pending flash update" status message, which
was sent *before* the devlink_flash_update_begin_notify(). Then, after
this we request the firmware object, which fails, and results in an
error code being reported back..

However, we will never send either the begin or end notification at this
point. Thus, the devlink userspace application will never quit, and
won't display the extack message.

This occurs because we sent a status notify message before we actually
sent a "begin notify". I think the ordering was done because of trying
to avoid having a complicated cleanup logic.

In some sense, this is a bug in the ice driver.. but in another sense
this is the devlink application being too strict about the requirements
on ordering of these messages..

I guess one method if we really want to remain strict is moving the
"begin" and "end" notifications outside of the driver into core so that
it always sends a begin before calling the .flash_update handler, and
always sends an end before exiting.

I guess we could simply relax the restriction on "not first" so that
we'll always exit in the case where the forked process has quit on us,
even if we haven't received a proper flash end notification...

Thoughts?