[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF2d9jijrYxkrMpWM1OVpe6r+W5Je0ffvAF1Q2fgTiOxmnZTtg@mail.gmail.com>
Date: Thu, 12 Jul 2018 17:15:40 -0700
From: Mahesh Bandewar (महेश बंडेवार)
<maheshb@...gle.com>
To: Michal Soltys <soltys@....info>
Cc: Jay Vosburgh <jay.vosburgh@...onical.com>,
Chonggang Li <chonggangli@...gle.com>,
linux-netdev <netdev@...r.kernel.org>
Subject: Re: [BUG] bonded interfaces drop bpdu (stp) frames
On Thu, Jul 12, 2018 at 4:14 PM, Michal Soltys <soltys@....info> wrote:
> On 2018-07-13 00:03, Jay Vosburgh wrote:
>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>
>>>On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh
>>><jay.vosburgh@...onical.com> wrote:
>>>> Michal Soltys <soltys@....info> wrote:
>>>>
>>>>>On 07/12/2018 04:51 PM, Jay Vosburgh wrote:
>>>>>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>>>>>
>>>>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys <soltys@....info> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> As weird as that sounds, this is what I observed today after bumping
>>>>>>>> kernel version. I have a setup where 2 bonds are attached to linux
>>>>>>>> bridge and physically are connected to two switches doing MSTP (and
>>>>>>>> linux bridge is just passing them).
>>>>>>>>
>>>>>>>> Initially I suspected some changes related to bridge code - but quick
>>>>>>>> peek at the code showed nothing suspicious - and the part of it that
>>>>>>>> explicitly passes stp frames if stp is not enabled has seen little
>>>>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
>>>>>>>> regular non-bonded interfaces are attached everything works fine.
>>>>>>>>
>>>>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with
>>>>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
>>>>>>>> there (with them being present just fine on active enslaved interface,
>>>>>>>> or on the bond device in earlier kernels).
>>>>>>>>
>>>>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from
>>>>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
>>>>>>>> debian) and 4.17.3 (tested on archlinux) are failing.
>>>>>>>>
>>>>>>>> Unless this is already a known issue (or you have any suggestions what
>>>>>>>> could be responsible).
>>>>>>>>
>>>>>>> I believe these are link-local-multicast messages and sometime back a
>>>>>>> change went into to not pass those frames to the bonding master. This
>>>>>>> could be the side effect of that.
>>>>>>
>>>>>> Mahesh, I suspect you're thinking of:
>>>>>>
>>>>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e
>>>>>> Author: Chonggang Li <chonggangli@...gle.com>
>>>>>> Date: Sun Apr 16 12:02:18 2017 -0700
>>>>>>
>>>>>> bonding: deliver link-local packets with skb->dev set to link that packets arrived on
>>>>>>
>>>>>> Michal, are you able to revert this patch and test?
>>>>>>
>>>>>> -J
>>>>>>
>>>>>> ---
>>>>>> -Jay Vosburgh, jay.vosburgh@...onical.com
>>>>>>
>>>>>
>>>>>
>>>>>Just tested - yes, reverting that patch solves the issues.
>>>>
>>>> Chonggang,
>>>>
>>>> Reading the changelog in your commit referenced above, I'm not
>>>> entirely sure what actual problem it is fixing. Could you elaborate?
>>>>
>>>> As the patch appears to cause a regression, it needs to be
>>>> either fixed or reverted.
>>>>
>>>> Mahesh, you signed-off on it as well, perhaps you also have some
>>>> context?
>>>>
>>>
>>>I think the original idea behind it was to pass the LLDPDUs to the
>>>stack on the interface that they came on since this is considered to
>>>be link-local traffic and passing to bond-master would loose it's
>>>"linklocal-ness". This is true for LLDP and if you change the skb->dev
>>>of the packet, then you don't know which slave link it came on in
>>>(from LLDP consumer's perspective).
>>>
>>>I don't know much about STP but trunking two links and aggregating
>>>this link info through bond-master seems wrong. Just like LLDP, you
>>>are losing info specific to a link and the decision derived from that
>>>info could be wrong.
>>>
>>>Having said that, we determine "linklocal-ness" by looking at L2 and
>>>bondmaster shares this with lts slaves. So it does seem fair to pass
>>>those frames to the bonding-master but at the same time link-local
>>>traffic is supposed to be limited to the physical link (LLDP/STP/LACP
>>>etc). Your thoughts?
>>
>> I agree the whole thing sounds kind of weird, but I'm curious as
>> to what Michal's actual use case is; he presumably has some practical
>> use for this, since he noticed that the behavior changed.
>>
>
> The whole "link-local" term is a bit I don't know - at this point it
> feels like too many things were thrown into single bag and it got
> somewhat confusing (bpdu, lldp, pause frames, lacp, pae, qinq mulitcast
> that afaik has its own address) - I added some examples in another reply
> I did at the same time as you were typing this one =)
>
>> Michal, you mentioned MSTP and using 802.3ad (LACP) mode; how
>> does that combination work rationally given that the bond might send and
>> receive traffic across multiple slaves? Or does the switch side bundle
>> the ports together into a single logical interface for MSTP purposes?
>> On the TX side, I think the bond will likely balance all STP frames to
>> just one slave.
>>
>
> The basic concept - two "main" switches with "important" machines
> connected to those. One switch dies and everything keeps working. With
> no unused ports and so on.
>
> In more details:
>
> Originally I was trying MSTP daemon (on "important" machines) which
> seems quite well and completely coded, but cannot really work correctly
> - as afaik you can't put port (in linux bridge conext) in different
> forwarding/blocking/etc. state per-region - itow per group of vlans (or
> mstpd didn't know how to do that, or it wasn't implemented - I didn't
> look too deep back then, though my interest resurfaced in recent days).
>
> So that option was out of the question. But any switch, real or not,
> /must/ pass bpdu frames if it doesn't interpret them.
I think it's contrary! These (STP) frames are link-local-multicast
frames i.e. their dest-mac is multicast-mac (unlike the one that you
would see on an interface). So a *real* switch would never pass those
frames (because there wont be an entry in the CAM table) and would
have to consume those frames. Just imagine a userspace LACP-daemon
receiving LACPDUs on bonding master, it wouldn't make right decisions
about slaves and hence the health of the aggregator would be
questionable. Also imagine LACP frames being forwards by *real*
switch, things wont work as intended.
> So instead of
> having active mstp participant, we have passive linux bridge that passes
> the frames and the two real switches around that care of mstp, treating
> the linux as a shared segment. The costs/priorities/etc. on the real
> swtiches are set so one bond handles two regions, and the other bond
> handles other two regions. If any of the real switches dies or is taken
> down for e.g. firmware update - the bond going to the other switch
> handles all four regions (failover is of course not as fast as with
> active rstp/mstp participation, but works quite well none the less -
> around 10s after some tuning).
>
> We could have used RSTP for that purpose as well - but that being all or
> nothing in context of per-port blocking/forwarding, would leave half of
> the ports unused - and we wanted to avoid that (that's why MSTP was
> created after all).
>
> Instead of using 2 bonds (2 interfaces each) we could just use 4
> interfaces directly, one per region. But two of those regions see very
> little traffic, so we put more and less active regions in pairs.
>
>> As for a resolution, presuming that Michal has some reasonable
>> use case, I'm thinking along the lines of reverting the new (leave frame
>> attached to slave) behavior for the general case and adding a special
>> case for LLDP and friends to get the new behavior. I'd like to avoid
>> adding any new options to bonding.
>>
>
> My use case aside, this will cause issues for anyone attaching bond
> (instead of direct interface or veth) to a bridge and doing something
> more complex with it - whether related to stp or to selectively passing
> e.g. lldp using group_fwd_mask sysfs. Or having LLDP daemon (e.g.
> systemd-resolvd to not look far away) told to do LLDP on bond device
> (even most basic active-backup case) and remaining blind. Or anything
> else that expects to see/pass those multicasts on/via bonded device
> (which is just a convenient way to create virtual interface out of real
> interfaces after all - ITOW shouldn't probably make any calls in this
> regard).
A real switch cannot forward link-local-multicast and if it does, then
I would consider it as broken. Having said that we are taking about
the bonding-master and as I mentioned earlier, it seems fair to pass
those frames to bonding master because we share L2 with slaves but we
should pass the frames to the stack as they are (current behavior!) to
maintain the "linklocal-ness" so both master and the current-slave get
a copy (it's multicast after all).
Thanks,
--mahesh..
Powered by blists - more mailing lists