netdev - Re: devlink userspace process appears stuck (was: Re: [net-next] devlink: move request

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <baf44b88-156f-7b34-5e8d-9fe3bc2e2c40@intel.com>
Date:   Fri, 13 Nov 2020 14:51:36 -0800
From:   Jacob Keller <jacob.e.keller@...el.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     netdev@...r.kernel.org, Jiri Pirko <jiri@...dia.com>,
        Michael Chan <michael.chan@...adcom.com>,
        Shannon Nelson <snelson@...sando.io>,
        Saeed Mahameed <saeedm@...dia.com>,
        Boris Pismenny <borisp@...dia.com>,
        Bin Luo <luobin9@...wei.com>
Subject: Re: devlink userspace process appears stuck (was: Re: [net-next]
 devlink: move request_firmware out of driver)



On 11/13/2020 2:32 PM, Jacob Keller wrote:
> 
> 
> On 11/13/2020 1:34 PM, Jacob Keller wrote:
>> Well, at least with ice, the experience is pretty bad. I tried out with
>> a garbage file name on one of my test systems. This was on a slightly
>> older kernel without this patch applied, and the device had a pending
>> update that had not yet been finalized with a reset:
>>
>> $ devlink dev flash pci/0000:af:00.0 file garbage_file_does_not_exist
>> Canceling previous pending update
>>
>>
>> The update looks like it got stuck, but actually it failed. Somehow the
>> extack error over the socket didn't get handled by the application very
>> well. Something buggy in the forked process probably.
>>
>> I do get this in the dmesg though:
>>
>> Nov 13 13:12:57 jekeller-stp-glorfindel kernel: ice 0000:af:00.0: Direct
>> firmware load for garbage_file_does_not_exist failed with error -2
>>
> 
> I think I figured out what is going on here, but I'm not sure what the
> best solution is.
> 
> in userspace devlink.c:3410, the condition for exiting the while loop
> that monitors the flash update process is:
> 
> (!ctx.flash_done || (ctx.not_first && !ctx.received_end))
> 

FWIW changing this to

(!ctx.flash_done && !ctx.received_end)

works for this problem, but I suspect that the original condition
intended to try and catch the case where flash update has exited but we
haven't yet processed all of the status messages?

I mean in some sense we could just wait for !ctx.flash_done only. Then
we'd always loop until the initial request exits.

There's a slight issue with the netlink extack message not being
displayed on its own line, but I think that just needs us to add a
pr_out("\n") somewhere to fix it.


> This condition means keep looping until flash is done *OR* we've
> received a message but have not yet received the end.
> 
> In the ice driver implementation, we perform a check for a pending flash
> update first, which could trigger a cancellation that causes us to send
> back a "cancelling previous pending flash update" status message, which
> was sent *before* the devlink_flash_update_begin_notify(). Then, after
> this we request the firmware object, which fails, and results in an
> error code being reported back..
> 
> However, we will never send either the begin or end notification at this
> point. Thus, the devlink userspace application will never quit, and
> won't display the extack message.
> 
> This occurs because we sent a status notify message before we actually
> sent a "begin notify". I think the ordering was done because of trying
> to avoid having a complicated cleanup logic.
> 
> In some sense, this is a bug in the ice driver.. but in another sense
> this is the devlink application being too strict about the requirements
> on ordering of these messages..
> 
> I guess one method if we really want to remain strict is moving the
> "begin" and "end" notifications outside of the driver into core so that
> it always sends a begin before calling the .flash_update handler, and
> always sends an end before exiting.
> 
> I guess we could simply relax the restriction on "not first" so that
> we'll always exit in the case where the forked process has quit on us,
> even if we haven't received a proper flash end notification...
> 
> Thoughts?
>