netdev - Re: [PATCH 1/6] bridge: learn dst metadata in FDB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d4265b8d-93bb-630f-8d39-5eb0a3a259b8@cumulusnetworks.com>
Date:   Thu, 17 Aug 2017 14:51:12 +0300
From:   Nikolay Aleksandrov <nikolay@...ulusnetworks.com>
To:     David Lamparter <equinox@...c24.net>
Cc:     netdev@...r.kernel.org, amine.kherbouche@...nd.com,
        roopa@...ulusnetworks.com, stephen@...workplumber.org,
        "bridge@...ts.linux-foundation.org" 
        <bridge@...ts.linux-foundation.org>
Subject: Re: [PATCH 1/6] bridge: learn dst metadata in FDB

On 17/08/17 14:39, Nikolay Aleksandrov wrote:
> On 17/08/17 14:03, David Lamparter wrote:
>> Hi Nikolay,
>>
>>
>> On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
>>> On 16/08/17 20:01, David Lamparter wrote:
>>>> This implements holding dst metadata information in the bridge layer,
>>>> but only for unicast entries in the MAC table.  Multicast is still left
>>>> to design and implement.
>>>
>>> Sorry but I do not agree with this change, adding a special case for
>>> VPLS in the bridge code
>>
>> I don't think this is specific to VPLS at all, though you're right that
>> VPLS is the only user currently.
>>
>>> and hitting the fast path for everyone in a few different places for a
>>> feature that the majority will not use does not sound acceptable to
>>> me. We've been trying hard to optimize it, trying to avoid additional
>>> cache lines, removing tests and keeping special cases to a minimum. 
>>
>> skb->dst is on the same cacheline as skb->len.
>> fdb->md_dst is on the same cacheline as fdb->dst.
>> Both will be 0 in a lot of cases, so this should be two null checks on
>> data that is hot in the cache.  Are you sure this is an actual problem?
>>
> 
> Sure - no, I haven't benchmarked it, but I don't see skb->len being on
> the same cache line as _skb_refdst assuming 64 byte cache lines.

I should've been clearer - that obviously depends on the kernel config, but
in order for them to be in the same line you need to disable either one of 
conntrack, bridge_netfilter or xfrm, these are almost always enabled (at
least in all major distributions).

> But again any special cases, in my opinion, should be handled on their own,
> it is both about the fast path and the code complexity that they bring in.
> 
>>> I understand that you want to use the fdb tables and avoid
>>> duplication, but this is not worth it. There're other similar use
>>> cases and they have their own private fdb tables, that way the user
>>> can opt out and is much cleaner and separated.
>>
>> Sure, this can be done.  I think it's a noticeable performance penalty
>> to have the entire fdb copied (multiple times for H-VPLS even), but I
>> understand that it's preferable to have the normal cases faster in
>> exchange.  As the previous paragraph notes, I still wonder if that hit
>> to the normal case exists though.
>>
>> I will leave this to Amine, he's paid to work on VPLS while I'm doing
>> this for fun ;)
>>
>> There is however another concern I have here.  As I noted in my
>> introductory mail, I'm working on the bridge MDB making similar changes.
>> And there's actually strong use cases for this in both VPLS and the
>> 802.11 code (though I'm not sure I can code the latter one up, it's
>> related to rate control and this is seriously complicated - the goal is
>> to select a multicast rate based on the now-known receiving STAs).
>>
>> I really hope you're not suggesting the entire MDB with IPv4 & IPv6
>> snooping be duplicated into both VPLS and mac80211?
>>
> 
> Code can always be shared if there are more users, no need to stuff everything in
> the bridge, but I'm not that familiar with this case, once patches are out I can
> comment further.
> 
>>> As you've noted this is only an RFC so I will not point out every issue, but there seems
>>> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU.
>>
>> Right, Thanks! ... I only thought about concurrent access, forgetting
>> about concurrent modification...  I'll replace it with an xchg I think.
>> (No need for a lock that way)
> 
> I think you can still lose references to a dst that way, what if someone changes the
> dst you read before the xchg and you xchg it ?
> 
>>
>> That said, now that I think about it, the skb_dst_set_noref() in the
>> following chunk is probably not safe if the VPLS device has a qdisc on
>> it.  I was relying on the fact that br_dev_xmit is holding RCU there,
>> but if the SKB is queued, md_dst might go away in the meantime...
>> ... probably need to change this to dst_hold() + skb_dst_set()...
>>
>>
>> -David
>>
>>>> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
>>>> index 861ae2a165f4..534cacf02f8d 100644
>>>> --- a/net/bridge/br_device.c
>>>> +++ b/net/bridge/br_device.c
>>>> @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>  		else
>>>>  			br_flood(br, skb, BR_PKT_MULTICAST, false, true);
>>>>  	} else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) {
>>>> +		struct dst_entry *md_dst = rcu_dereference(dst->md_dst);
>>>> +		if (md_dst)
>>>> +			skb_dst_set_noref(skb, md_dst);
>>>>  		br_forward(dst->dst, skb, false, true);
>>>>  	} else {
>>>>  		br_flood(br, skb, BR_PKT_UNICAST, false, true);
>>
>>
>>>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
>>>> index a79b648aac88..0751fcb89699 100644
>>>> @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
>>>>  					source->dev->name, addr, vid);
>>>>  		} else {
>>>>  			unsigned long now = jiffies;
>>>> +			struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst);
>>>>  
>>>>  			/* fastpath: update of existing entry */
>>>> -			if (unlikely(source != fdb->dst)) {
>>>> +			if (unlikely(source != fdb->dst ||
>>>> +			    dst_metadata_cmp(md_dst, ref_md))) {
>>>>  				fdb->dst = source;
>>>> +				dst_release(ref_md);
>>>> +				rcu_assign_pointer(fdb->md_dst,
>>>> +						dst_clone(md_dst));
>>>>  				fdb_modified = true;
>>>>  				/* Take over HW learned entry */
>>>>  				if (unlikely(fdb->added_by_external_learn))
>