[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <34828458-2334-d553-2d9d-808bb6111dd3@ziu.info>
Date: Sat, 14 Jul 2018 17:48:46 +0200
From: Michal Soltys <soltys@....info>
To: Mahesh Bandewar (महेश बंडेवार) <maheshb@...gle.com>
Cc: Jay Vosburgh <jay.vosburgh@...onical.com>,
Chonggang Li <chonggangli@...gle.com>,
linux-netdev <netdev@...r.kernel.org>,
David Miller <davem@...emloft.net>
Subject: Re: [BUG] bonded interfaces drop bpdu (stp) frames
On 07/13/2018 02:15 AM, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Thu, Jul 12, 2018 at 4:14 PM, Michal Soltys <soltys@....info> wrote:
>> On 2018-07-13 00:03, Jay Vosburgh wrote:
>>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>>
>>>> On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh
>>>> <jay.vosburgh@...onical.com> wrote:
>>>>> Michal Soltys <soltys@....info> wrote:
>>>>>
>>>>>> On 07/12/2018 04:51 PM, Jay Vosburgh wrote:
>>>>>>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>>>>>>
>>>>>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys <soltys@....info> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> As weird as that sounds, this is what I observed today after bumping
>>>>>>>>> kernel version. I have a setup where 2 bonds are attached to linux
>>>>>>>>> bridge and physically are connected to two switches doing MSTP (and
>>>>>>>>> linux bridge is just passing them).
>>>>>>>>>
>>>>>>>>> Initially I suspected some changes related to bridge code - but quick
>>>>>>>>> peek at the code showed nothing suspicious - and the part of it that
>>>>>>>>> explicitly passes stp frames if stp is not enabled has seen little
>>>>>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
>>>>>>>>> regular non-bonded interfaces are attached everything works fine.
>>>>>>>>>
>>>>>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with
>>>>>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
>>>>>>>>> there (with them being present just fine on active enslaved interface,
>>>>>>>>> or on the bond device in earlier kernels).
>>>>>>>>>
>>>>>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from
>>>>>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
>>>>>>>>> debian) and 4.17.3 (tested on archlinux) are failing.
>>>>>>>>>
>>>>>>>>> Unless this is already a known issue (or you have any suggestions what
>>>>>>>>> could be responsible).
>>>>>>>>>
>>>>>>>> I believe these are link-local-multicast messages and sometime back a
>>>>>>>> change went into to not pass those frames to the bonding master. This
>>>>>>>> could be the side effect of that.
>>>>>>>
>>>>>>> Mahesh, I suspect you're thinking of:
>>>>>>>
>>>>>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e
>>>>>>> Author: Chonggang Li <chonggangli@...gle.com>
>>>>>>> Date: Sun Apr 16 12:02:18 2017 -0700
>>>>>>>
>>>>>>> bonding: deliver link-local packets with skb->dev set to link that packets arrived on
>>>>>>>
>>>>>>> Michal, are you able to revert this patch and test?
>>>>>>>
>>>>>>> -J
>>>>>>>
>>>>>>> ---
>>>>>>> -Jay Vosburgh, jay.vosburgh@...onical.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Just tested - yes, reverting that patch solves the issues.
>>>>>
>>>>> Chonggang,
>>>>>
>>>>> Reading the changelog in your commit referenced above, I'm not
>>>>> entirely sure what actual problem it is fixing. Could you elaborate?
>>>>>
>>>>> As the patch appears to cause a regression, it needs to be
>>>>> either fixed or reverted.
>>>>>
>>>>> Mahesh, you signed-off on it as well, perhaps you also have some
>>>>> context?
>>>>>
>>>>
>>>> I think the original idea behind it was to pass the LLDPDUs to the
>>>> stack on the interface that they came on since this is considered to
>>>> be link-local traffic and passing to bond-master would loose it's
>>>> "linklocal-ness". This is true for LLDP and if you change the skb->dev
>>>> of the packet, then you don't know which slave link it came on in
>>>> (from LLDP consumer's perspective).
>>>>
>>>> I don't know much about STP but trunking two links and aggregating
>>>> this link info through bond-master seems wrong. Just like LLDP, you
>>>> are losing info specific to a link and the decision derived from that
>>>> info could be wrong.
>>>>
>>>> Having said that, we determine "linklocal-ness" by looking at L2 and
>>>> bondmaster shares this with lts slaves. So it does seem fair to pass
>>>> those frames to the bonding-master but at the same time link-local
>>>> traffic is supposed to be limited to the physical link (LLDP/STP/LACP
>>>> etc). Your thoughts?
>>>
>>> I agree the whole thing sounds kind of weird, but I'm curious as
>>> to what Michal's actual use case is; he presumably has some practical
>>> use for this, since he noticed that the behavior changed.
>>>
>>
>> The whole "link-local" term is a bit I don't know - at this point it
>> feels like too many things were thrown into single bag and it got
>> somewhat confusing (bpdu, lldp, pause frames, lacp, pae, qinq mulitcast
>> that afaik has its own address) - I added some examples in another reply
>> I did at the same time as you were typing this one =)
>>
>>> Michal, you mentioned MSTP and using 802.3ad (LACP) mode; how
>>> does that combination work rationally given that the bond might send and
>>> receive traffic across multiple slaves? Or does the switch side bundle
>>> the ports together into a single logical interface for MSTP purposes?
>>> On the TX side, I think the bond will likely balance all STP frames to
>>> just one slave.
>>>
>>
>> The basic concept - two "main" switches with "important" machines
>> connected to those. One switch dies and everything keeps working. With
>> no unused ports and so on.
>>
>> In more details:
>>
>> Originally I was trying MSTP daemon (on "important" machines) which
>> seems quite well and completely coded, but cannot really work correctly
>> - as afaik you can't put port (in linux bridge conext) in different
>> forwarding/blocking/etc. state per-region - itow per group of vlans (or
>> mstpd didn't know how to do that, or it wasn't implemented - I didn't
>> look too deep back then, though my interest resurfaced in recent days).
>>
>> So that option was out of the question. But any switch, real or not,
>> /must/ pass bpdu frames if it doesn't interpret them.
>
> I think it's contrary! These (STP) frames are link-local-multicast
> frames i.e. their dest-mac is multicast-mac (unlike the one that you
> would see on an interface). So a *real* switch would never pass those
> frames (because there wont be an entry in the CAM table) and would
> have to consume those frames. Just imagine a userspace LACP-daemon
> receiving LACPDUs on bonding master, it wouldn't make right decisions
> about slaves and hence the health of the aggregator would be
> questionable. Also imagine LACP frames being forwards by *real*
> switch, things wont work as intended.
>
>
>> So instead of
>> having active mstp participant, we have passive linux bridge that passes
>> the frames and the two real switches around that care of mstp, treating
>> the linux as a shared segment. The costs/priorities/etc. on the real
>> swtiches are set so one bond handles two regions, and the other bond
>> handles other two regions. If any of the real switches dies or is taken
>> down for e.g. firmware update - the bond going to the other switch
>> handles all four regions (failover is of course not as fast as with
>> active rstp/mstp participation, but works quite well none the less -
>> around 10s after some tuning).
>>
>> We could have used RSTP for that purpose as well - but that being all or
>> nothing in context of per-port blocking/forwarding, would leave half of
>> the ports unused - and we wanted to avoid that (that's why MSTP was
>> created after all).
>>
>> Instead of using 2 bonds (2 interfaces each) we could just use 4
>> interfaces directly, one per region. But two of those regions see very
>> little traffic, so we put more and less active regions in pairs.
>>
>>> As for a resolution, presuming that Michal has some reasonable
>>> use case, I'm thinking along the lines of reverting the new (leave frame
>>> attached to slave) behavior for the general case and adding a special
>>> case for LLDP and friends to get the new behavior. I'd like to avoid
>>> adding any new options to bonding.
>>>
>>
>> My use case aside, this will cause issues for anyone attaching bond
>> (instead of direct interface or veth) to a bridge and doing something
>> more complex with it - whether related to stp or to selectively passing
>> e.g. lldp using group_fwd_mask sysfs. Or having LLDP daemon (e.g.
>> systemd-resolvd to not look far away) told to do LLDP on bond device
>> (even most basic active-backup case) and remaining blind. Or anything
>> else that expects to see/pass those multicasts on/via bonded device
>> (which is just a convenient way to create virtual interface out of real
>> interfaces after all - ITOW shouldn't probably make any calls in this
>> regard).
For the record - having peeked at LLDP's 802.1AB-2016 - it also states that
LLDP agent must be able to operate both on any/all physical interfaces as well
as on bonds (section 6.8 LLDP and Link Aggregation).
>
> A real switch cannot forward link-local-multicast and if it does, then
> I would consider it as broken.
Well if we dig into details, even the standard (802.1q-2014) makes distinction
between non-tpmr bridge, nearest bridge and nearest customer bridge -
governing which of those multicasts should be relayed in "bigger"
customer/provider cases. My point being, it's not just simple "don't forward"
(section 8.6.3, among others).
As far as cheap / unmanaged real ones go, they generally pass the important
stuff they don't understand (at least all the ones I had under my hands did
so). R/MSTP expects "shared segments" to pass BPDUs as well. Bridge code in
linux follows that (and allows more with aforementioned knobs).
But we digress.
> Having said that we are taking about
> the bonding-master and as I mentioned earlier, it seems fair to pass
> those frames to bonding master because we share L2 with slaves but we
> should pass the frames to the stack as they are (current behavior!) to
> maintain the "linklocal-ness" so both master and the current-slave get
> a copy (it's multicast after all).
>
To sum up - meaning the patch will be reverted or do you plan to adjust it in
some way ? If former, I can prep the reverting patch, as I'm - well - most
interested in getting this regression sorted out.
Michal
> Thanks,
> --mahesh..
>
Powered by blists - more mailing lists