[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <B38CC11A-18E4-440F-8DE1-EF80C6C9A1DB@pjd.dev>
Date: Sun, 18 Dec 2022 19:16:54 -0800
From: Peter Delevoryas <peter@....dev>
To: Alexander Duyck <alexander.duyck@...il.com>
Cc: Peter Delevoryas <peter@....dev>, sam@...dozajonas.com,
davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
pabeni@...hat.com, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net/ncsi: Always use unicast source MAC address
> On Dec 17, 2022, at 12:57 PM, Alexander Duyck <alexander.duyck@...il.com> wrote:
>
> On Fri, Dec 16, 2022 at 8:20 PM Peter Delevoryas <peter@....dev> wrote:
>>
>>
>>
>>> On Dec 16, 2022, at 10:29 AM, Alexander Duyck <alexander.duyck@...il.com> wrote:
>>>
>>> On Thu, Dec 15, 2022 at 5:08 PM Peter Delevoryas <peter@....dev> wrote:
>>>>
>>>>
>>>>
>>>>> On Dec 13, 2022, at 8:41 AM, Alexander H Duyck <alexander.duyck@...il.com> wrote:
>>>>>
>>>>> On Mon, 2022-12-12 at 16:47 -0800, Peter Delevoryas wrote:
>
> <...>
>
>>>
>>>>> My main
>>>>> concern would be that the dev_addr is not initialized for those first
>>>>> few messages so you may be leaking information.
>>>>>
>>>>>> This might have the effect of causing the NIC to learn 2 MAC addresses from
>>>>>> an NC-SI link if the BMC uses OEM Get MAC Address commands to change its
>>>>>> initial MAC address, but it shouldn't really matter. Who knows if NIC's
>>>>>> even have MAC learning enabled from the out-of-band BMC link, lol.
>>>>>>
>>>>>> [1]: https://tinyurl.com/4933mhaj
>>>>>> [2]: https://tinyurl.com/mr3tyadb
>>>>>
>>>>> The thing is the OpenBMC approach initializes the value themselves to
>>>>> broadcast[3]. As a result the two code bases are essentially doing the
>>>>> same thing since mac_addr is defaulted to the broadcast address when
>>>>> the ncsi interface is registered.
>>>>
>>>> That’s a very good point, thanks for pointing that out, I hadn’t
>>>> even noticed that!
>>>>
>>>> Anyways, let me know what you think of the traces I added above.
>>>> Sorry for the delay, I’ve just been busy with some other stuff,
>>>> but I do really actually care about upstreaming this (and several
>>>> other NC-SI changes I’ll submit after this one, which are unrelated
>>>> but more useful).
>>>>
>>>> Thanks,
>>>> Peter
>>>
>>> So the NC-SI spec says any value can be used for the source MAC and
>>> that broadcast "may" be used. I would say there are some debugging
>>> advantages to using broadcast that will be obvious in a packet trace.
>>
>> Ehhhhh yeah I guess, but the ethertype is what I filter for. But sure,
>> a broadcast source MAC is pretty unique too.
>>
>>> I wonder if we couldn't look at doing something like requiring
>>> broadcast or LAA if the gma_flag isn't set.
>>
>> What is LAA? I’m out of the loop
>
> Locally administered MAC address[4]. Basically it is a MAC address
> that is generated locally such as your random MAC address. Assuming
> the other end of the NC-SI link is using a MAC address with a vendor
> OUI there should be no risk of collisions on a point-to-point link.
> Essentially if you wanted to you could probably just generate a random
> MAC address for the NCSI protocol and then use that in place of the
> broadcast address.
>
>> But also: aren’t we already using broadcast if the gma_flag isn’t set?
>>
>> - if (nca->ndp->gma_flag == 1)
>> - memcpy(eh->h_source, nca->ndp->ndev.dev->dev_addr, ETH_ALEN);
>> - else
>> - eth_broadcast_addr(eh->h_source);
>> + memcpy(eh->h_source, nca->ndp->ndev.dev->dev_addr, ETH_ALEN);
>
> That I am not sure about. You were using this kernel without your
> patch right? With your patch it would make sense to see that behavior,
> but without I am not sure why you would see that address for any NC-SI
> commands before the gma_flag is set.
>
>>
>>> With that we could at
>>> least advertise that we don't expect this packet to be going out in a
>>> real network as we cannot guarantee the MAC is unique.
>>
>> Yeah, but it probably wouldn’t help my simulation scenario.
>>
>> I guess it sounds like this patch is not a good idea, which to be fair,
>> is totally reasonable.
>>
>> I can just add some iptables rules to tunnel these packets with a different
>> source MAC, or fix the multicast socket issue I was having. It’s really
>> not a big deal, and like you’re saying, we probably don’t want to make
>> it harder to maintain _forever_.
>
> Like I said before I would be good with either a Broadcast address OR
> a LAA address. The one thing we need to watch out for though is any
> sort of leak. One possible concern would be if for example you had 4
> ports using 4 different MAC addresses but one BMC. You don't want to
> accidently leak the MAC address from one port onto the other one. With
> a LAA address if it were to leak and screw up ARP tables somewhere it
> wouldn't be a big deal since it isn't expected to be switched in the
> first place.
>
>> I would just suggest praying for the next guy that tries to test NC-SI
>> stuff with QEMU and finds out NC-SI traffic gets dropped by bridges.
>> I had to resort to reading the source code and printing stuff with
>> BPF to identify this. Maybe it’s more obvious to other people this wouldn’t
>> work though.
>
> Well it seems like NC-SI isn't meant to be bridged based on the fact
> that it is using a broadcast MAC address as a source. If nothing else
> I suppose you could try to work with the standards committee on that
> to see what can be done to make the protocol more portable.. :-)
Well, I started preparing some of my other patches to send, and while
digging up the history for that, I happened to notice this commit
completely by chance while browsing Github:
https://github.com/facebook/openbmc-linux/commit/933b5bd024d28f48a6359e6a9db631f778ba9ea7
[openbmc.quanta][PR] FBAL:Fixed NCSI can't work when import BR function
Summary:
As title.
Pull Request resolved: https://github.com/facebookexternal/openbmc.quanta/pull/1668
GitHub Author: Peter <peter.yin@...ntatw.com>
diff --git a/meta-aspeed/recipes-kernel/linux/files/linux-aspeed-5.0/net/bridge/br_input.c b/meta-aspeed/recipes-kernel/linux/files/linux-aspeed-5.0/net/bridge/br_input.c
index 5ea7e56119c1..8ef0b627f5ec 100644
--- a/meta-aspeed/recipes-kernel/linux/files/linux-aspeed-5.0/net/bridge/br_input.c
+++ b/meta-aspeed/recipes-kernel/linux/files/linux-aspeed-5.0/net/bridge/br_input.c
@@ -220,6 +220,9 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
if (unlikely(skb->pkt_type == PACKET_LOOPBACK))
return RX_HANDLER_PASS;
+ if (skb->protocol == cpu_to_be16(ETH_P_NCSI))
+ return RX_HANDLER_PASS;
+
if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
goto drop;
Which is accomplishing the same thing I suggested in my patch, except
that it’s modifying the Linux bridge code instead of changing the NC-SI
packets’ source MAC address.
To explain what I *think* this person was doing...
Meta has a system called Zion that’s described here:
https://engineering.fb.com/2019/03/14/data-center-engineering/accelerating-infrastructure/
It consists of two chassis, “Angel's Landing” and “Emerald Pools”.
Together, it’s kinda like an Nvidia DGX A100 system, but with generic
PCIe switches, and “OCP Accelerators”. There’s like an AMD GPU or an
Intel accelerator that can fit there. Maybe an A100 can fit too? I’m
not really completely clear on how its being used compared to GrandTeton,
announced at OCP 2022, which is even closer to the DGX architecture,
but yeah.
Angel’s Landing is 4 dual-socket boards stacked together, each board
with a BMC and NIC supporting NC-SI. I think in practice we reduced
this to 1-2 dual-socket boards, each with 2 NIC’s (presumably cause
we don't need that many CPU's but still need the network bandwidth).
Emerald Pools is a single board and 8 accelerator modules, and
the board has a BMC on it. To get network connectivity to the BMC,
there’s a USB from Emerald Pools to one of the Angel’s Landing BMC's
and the Angel’s Landing BMC bridges Emerald Pools traffic through
its NIC. If this doesn’t make sense, I think this is the whole setup
(Omitting the device tree and some MAC filtering stuff):
On an Angel’s Landing BMC:
$ ip link add br0 type bridge
$ ip link set eth0 master br0
$ ip link set eth1 master br0
$ ip link set usb0 master br0
And on the Emerald Pools BMC, there’s just a usb net intf:
$ ifconfig
lo ….
usb0 Link encap:Ethernet HWaddr xxxxxxxxxxx
inet6 addr: xxxxxx Scope:Link
inet6 addr: xxxxxx Scope:Global
inet6 addr: xxxxxx Scope:Global
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:999332 errors:0 dropped:0 overruns:0 frame:0
TX packets:594253 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:211829527 (202.0 MiB) TX bytes:150569888 (143.5 MiB)
Anyways, so then my question was: is Zion actually relying on NC-SI
packets traversing a bridge?
The Emerald Pools BMC doesn’t have NC-SI enabled at all, not even a
userspace daemon or utility of any kind.
NC-SI *is* enabled and used on the Angel's Landing BMC, so I checked
to see if they traverse the bridge (in QEMU, I didn’t check on a real
system):
root@...-oob:~# tcpdump -i br0 -v "ether proto 0x88f8" &
[1] 12045
root@...-oob:~# [ 1434.520314] device br0 entered promiscuous mode
tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
ifconfig eth0 down
[ 1442.863305] br0: port 1(eth0) entered disabled state
root@...-oob:~# ifconfig eth0 up
[ 1445.978424] br0: port 1(eth0) entered blocking state
[ 1445.978743] br0: port 1(eth0) entered forwarding state
[ 1445.979131] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1445.979814] ftgmac100 1e660000.ethernet eth0: NCSI: Handler for packet type 0x82 returned -19
root@...-oob:~# tcpdump -i eth0 -v "ether proto 0x88f8" &
[2] 12258
root@...-oob:~# tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
ifcon04:58:49.464810 fa:ce:b0:02:20:22 (oui Unknown) > Broadcast, ethertype Unknown (0x88f8), length 60:
0x0000: 0001 0068 0a00 0000 0000 0000 0000 0000 ...h............
0x0010: ffff f597 0000 0000 0000 0000 0000 0000 ................
0x0020: 0000 0000 0000 0000 0000 0000 0000 ..............
04:58:49.465099 Broadcast > Broadcast, ethertype Unknown (0x88f8), length 64:
0x0000: 0001 0068 8a00 0010 0000 0000 0000 0000 ...h............
0x0010: 0000 0000 0000 0001 0000 0000 0000 0000 ................
0x0020: ffff 7586 0000 0000 0000 0000 0000 d8cd ..u.............
0x0030: c6bc ..
04:58:49.471206 fa:ce:b0:02:20:22 (oui Unknown) > Broadcast, ethertype Unknown (0x88f8), length 60:
0x0000: 0001 0069 1500 0000 0000 0000 0000 0000 ...i............
0x0010: ffff ea96 0000 0000 0000 0000 0000 0000 ................
0x0020: 0000 0000 0000 0000 0000 0000 0000 ..............
04:58:49.471432 Broadcast > Broadcast, ethertype Unknown (0x88f8), length 78:
0x0000: 0001 0069 9500 0028 0000 0000 0000 0000 ...i...(........
0x0010: 0000 0000 f1f0 f000 0000 0000 0000 0000 ................
0x0020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0030: 0000 0000 0000 8119 fffd 0765 84e0 9fa4 ...........e….
So, I’m able to see packets on eth0, but so far I haven’t really seen
anything hitting the bridge. ¯\_(ツ)_/¯
Perhaps if there’s some cross-interface NC-SI traffic (eth0 <-> eth1), then
yes this would occur. But I don’t know why that would even happen? Regular
NC-SI failover or bonding (eth0, eth1) would be the actual solution? not sure.
The original commit was very vague, so perhaps I’ll follow up with
the author and reviewer to see if this patch was actually necessary.
>
> [4]: https://macaddress.io/faq/what-are-a-universal-address-and-a-local-administered-address
Powered by blists - more mailing lists