netdev - Re: [EXT] Re: [PATCH] qed: avoid spin loops in _qed_mcp_cmd_and

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <YXq6tTWTdiSPM/wr@cork>
Date:   Thu, 28 Oct 2021 07:59:01 -0700
From:   Jörn Engel <joern@...estorage.com>
To:     Ariel Elior <aelior@...vell.com>
Cc:     Caleb Sander <csander@...estorage.com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "GR-everest-linux-l2@...vell.com" <GR-everest-linux-l2@...vell.com>
Subject: Re: [EXT] Re: [PATCH] qed: avoid spin loops in
 _qed_mcp_cmd_and_union()

On Thu, Oct 28, 2021 at 05:47:10AM +0000, Ariel Elior wrote:
>
> Indeed this function sends messages to the management FW, and may
> be invoked both from atomic contexts and from non atomic ones.
> CAN_SLEEP indicated whether it is permissible in the context from which
> it was invoked to sleep.

That is a rather unfortunate pattern.  I understand the desire for code
reuse, but the result is often to use udelay-loops that can take
seconds.  In case of unresponsive firmware you tend to always hit the
timeouts and incur maximum latency.

Since the scheduler is blocked on the local CPU for the time of the spin
loop and won't even bother migrating high-priority threads away - the
assumption is that the current thread will not loop for a long time -
the result can be pretty bad for latency-sensitive code.  You cannot
guarantee any latencies below the timeout of those loops, essentially.

Having a flag or some other means to switch between sleeping and
spinning would help to reduce the odds.  Avoiding calls from atomic
contexts would help even more.  Ideally I would like to remove all
such calls.  The only legitimate exceptions should be those handling
with high-volume packet RX/TX and never involve long-running loops.
Anything else can be handled from a kworker or similar.  If a 1s loop is
acceptable, waiting a few ms for the scheduler must also be acceptable.

Jörn

--
If a problem has a hardware solution, and a software solution,
do it in software.
-- Arnd Bergmann