netdev - Re: [PATCH net] atlantic: fix deadlock at aq_nic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACT4oucrz5gDazdAF3BpEJX8XTRestZjiLAOxSHGAxSqf_o+LQ@mail.gmail.com>
Date:   Mon, 17 Oct 2022 09:22:58 +0200
From:   Íñigo Huguet <ihuguet@...hat.com>
To:     Andrew Lunn <andrew@...n.ch>
Cc:     irusskikh@...vell.com, dbogdanov@...vell.com, davem@...emloft.net,
        edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com,
        netdev@...r.kernel.org, Li Liang <liali@...hat.com>
Subject: Re: [PATCH net] atlantic: fix deadlock at aq_nic_stop

On Sat, Oct 15, 2022 at 5:10 PM Andrew Lunn <andrew@...n.ch> wrote:
>
> > > Maybe the lock needs to be moved closer to what actually needs to be
> > > protect? What is it protecting?
> >
> > It's protecting the operations of aq_macsec_enable and aq_macsec_work.
> > The locking was closer to them, but the idea of this patch is to move
> > the locking to an earlier moment so, in the case we need to abort, do
> > it before changing anything.
>
> aq_check_txsa_expiration() seems to be one of the issues? At least,
> the lock is taken before and released afterwards. So what in
> aq_check_txsa_expiration() requires the lock?

Basically everything in the file aq_macsec.c seems to be implicitly
protected by rtnl_lock. One group of functions are all callbacks of
the `struct macsec_ops aq_macsec_ops`, which are responsible for
configuring macsec offload, all called under rtnl_lock. The rest of
the functions in the file are called from ethtool, also protected by
rtnl_lock.

And part of the problem is that many of these operations are firmware
and/or phy configurations which I don't have documentation about how
they work. Despite this, it seems reasonable to think that they need
to be lock protected.

> I don't like the use of rtnl_trylock(). It suggests the basic design is
> wrong, or overly complex, and so probably not working correctly.
>
> https://blog.ffwll.ch/2022/07/locking-engineering.html
>
> Please try to identify what is being protected. If it is driver
> internal state, could it be replaced with a driver mutex, rather than
> RTNL? Or is it network stack as a whole state, which really does
> require RTNL? If so, how do other drivers deal with this problem? Is
> it specific to MACSEC? Does MACSEC have a design problem?

I already considered this possibility but discarded it because, as I
say above, everything else is already legitimately protected by
rtnl_lock.

The only alternative I can think of is to add a driver only mutex
(let's call it aq_macsec_mutex), as you say, and everytime that macsec
offload is to be changed both rtnl_lock and aq_macsec_mutex would be
taken. It's true that aq_macsec_mutex wouldn't be much contended
because almost always rtnl_lock needs to be acquired first. From the
workqueue and the threaded irq there wouldn't be any deadlock because
they only hold aq_macsec_mutex and ndo_stop only holds rtnl_lock. I
would also allow to put the locking close to what they protect.

I thought that this solution would be a bit overkill, but maybe it's
less overkill than the one I chose. If you're OK with this, I can
prepare an v2.

--
Íñigo Huguet