netdev - Re: [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d9a3720d-6f19-5214-463c-99debcccf9d6@nvidia.com>
Date:   Sat, 25 Sep 2021 07:31:51 -0700
From:   Roopa Prabhu <roopa@...dia.com>
To:     Nikolay Aleksandrov <nikolay@...dia.com>,
        Vladimir Oltean <vladimir.oltean@....com>,
        Florian Fainelli <f.fainelli@...il.com>
CC:     Andrew Lunn <andrew@...n.ch>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB


On 9/24/21 3:03 AM, Nikolay Aleksandrov wrote:
> On 24/09/2021 01:49, Vladimir Oltean wrote:
>>
[snip]
>> We don't, that's the premise, you didn't miss anything there. I mean in
>> the switchdev world, I see that an interrupt is the primary mechanism,
>> and we do have DSA switches which are theoretically capable of notifying
>> of new addresses, but not through interrupts but over Ethernet, of
>> course. Ocelot for example supports sending "learn frames" to the CPU -
>> these are just like flooded frames, except that a "flooded" frame is one
>> with unknown MAC DA, and a "learn" frame is one with unknown MAC SA.
>> I've been thinking whether it's worth adding support for learn frames
>> coming from Ocelot/Felix in DSA, and it doesn't really look like it.
>> Anyway, when the DSA tagging protocol receives a "learn" frame, it needs
>> to consume_skb() it, then trigger some sort of switchdev atomic notifier
>> for that MAC SA (SWITCHDEV_FDB_ADD_TO_BRIDGE), but sadly that is only
>> the beginning of a long series of complications, because we also need to
>> know when the hardware has fogotten it, and it doesn't look like
>> "forget" frames are a thing. So because the risk of having an address
>> expire in hardware while it is still present in software is kind of
>> high, the only option left is to make the hardware entry static, and
>> (a) delete it manually when the software entry expires
>> (b) set up a second alert which triggers a MAC SA has been received on a
>>      port other than the destination port which is pointed towards by an
>>      existing FDB entry. In short, "station migration alert". Because the
>>      FDB entry is static, we need to migrate it by hand, in software.
>> So it all looks kind of clunky. Whereas what we do now is basically
>> assume that the amount of frames with unknown MAC DA reaching the CPU
>> via flooding is more or less equal and equally distributed with the
>> frames with unknown MAC SA reaching the CPU. I have not yet encountered
>> a use case where the two are tragically different, in a way that could
>> be solved only with SWITCHDEV_FDB_ADD_TO_BRIDGE events and in no other way.
>>
>>
>> Anyway, huge digression, the idea was that DSA doesn't synchronize FDBs
>> and that is the reason in the first place why we have an .ndo_fdb_dump
>> implementation. Theoretically if all hardware drivers were perfect,
>> you'd only have the .ndo_fdb_dump implementation done for the bridge,
>> vxlan, things like that. So that is why I asked Roopa whether hacking up
>> rtnl_fdb_dump in this way, transforming it into a state machine even
>> more complicated than it already was, is acceptable. We aren't the
>> primary use case of it, I think.
>>
> Hi Vladimir,
> I glanced over the patches and the obvious question that comes first is have you
> tried pushing all of this complexity closer to the driver which needs it?
> I mean rtnl_fdb_dump() can already "resume" and passes all the necessary resume indices
> to ndo_fdb_dump(), so it seems to me that all of this can be hidden away. I doubt
> there will be a many users overall, so it would be nice to avoid adding the complexity
> as you put it and supporting it in the core. I'd imagine a hacked driver would simply cache
> the dump for some time while needed (that's important to define well, more below) and just
> return it for the next couple of devices which share it upon request, basically you'd have the
> same type of solution you've done here, just have the details hidden in the layer which needs it.
>
> Now the hard part seems to be figuring out when to finish in this case. Prepare should be a simple
> check if a shared fdb list is populated, finish would need to be inferred. One way to do that is
> based on a transaction/dump id which is tracked for each shared device and the last dump. Another
> is if you just pass a new argument to ndo_fdb_dump() if it's dumping a single device or doing a
> full dump, since that's the case that's difficult to differentiate. If you're doing a single
> dump - obviously you do a normal fdb dump without caching, if you're doing a full dump (all devices)
> then you need to check if the list is populated, if it is and this is the last device you need to free
> the cached shared list (finish phase). That would make the core change very small and would push the
> complexity to be maintained where it's needed. Actually you have the netlink_callback struct passed
> down to ndo_fdb_dump() so probably that can be used to infer if you're doing a single or multi- device
> dump already and there can be 0 core changes.
>
> Also one current downside with this set is the multiple walks over the net device list for a single dump,
> especially for setups with thousands of net devices. Actually, I might be missing something, but what
> happens if a dump is terminated before it's finished? Looks like the finish phase will never be reached.
> That would be a problem for both solutions, you might have to add a netlink ->done() callback anyway.

+1, the series conceptually looks good, but core fdb dump is already 
unnecessarily complex. moving this to the driver with indicators passed 
as flags

is preferred (you can possibly also mark a port as designated for shared 
fdb operations. again in the driver)