[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b82de9da-da16-c0de-6e93-460b53179b0f@gmail.com>
Date: Fri, 7 Apr 2023 17:17:37 +0900
From: Taehee Yoo <ap420073@...il.com>
To: wangyufen <wangyufen@...wei.com>,
Eric Dumazet <edumazet@...gle.com>,
Dmitry Vyukov <dvyukov@...gle.com>,
Jarod Wilson <jarod@...hat.com>
Cc: syzbot <syzbot+60748c96cf5c6df8e581@...kaller.appspotmail.com>,
Jiří Pírko <jiri@...nulli.us>,
davem@...emloft.net, kuba@...nel.org, linux-kernel@...r.kernel.org,
netdev@...r.kernel.org, pabeni@...hat.com,
syzkaller-bugs@...glegroups.com,
Cong Wang <xiyou.wangcong@...il.com>
Subject: Re: [syzbot] kernel panic: kernel stack overflow
Hi wangyufen,
On 2023. 4. 7. 오후 4:22, wangyufen wrote:
>
>
> 在 2022/10/21 19:08, Taehee Yoo 写道:
>> Hi,
>>
>> 2022. 10. 14. 오전 12:00에 Taehee Yoo 이(가) 쓴 글:
>> > Hi,
>> >
>> > On 10/12/22 21:19, Eric Dumazet wrote:
>> > > On Wed, Oct 12, 2022 at 12:53 AM Dmitry Vyukov
<dvyukov@...gle.com>
>> > wrote:
>> > >>
>> > >> On Wed, 12 Oct 2022 at 09:48, syzbot
>> > >> <syzbot+60748c96cf5c6df8e581@...kaller.appspotmail.com> wrote:
>> > >>>
>> > >>> Hello,
>> > >>>
>> > >>> syzbot found the following issue on:
>> > >>>
>> > >>> HEAD commit: bbed346d5a96 Merge branch 'for-next/core' into
>> > for-kernelci
>> > >>> git tree:
>> > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
>> for-kernelci
>> > >>> console output:
>> > https://syzkaller.appspot.com/x/log.txt?x=14a03a2a880000
>> > >>> kernel config:
>> > https://syzkaller.appspot.com/x/.config?x=aae2d21e7dd80684
>> > >>> dashboard link:
>> > https://syzkaller.appspot.com/bug?extid=60748c96cf5c6df8e581
>> > >>> compiler: Debian clang version
>> > 13.0.1-++20220126092033+75e33f71c2da-1~exp1~20220126212112.63, GNU ld
>> > (GNU Binutils for Debian) 2.35.2
>> > >>> userspace arch: arm64
>> > >>>
>> > >>> Unfortunately, I don't have any reproducer for this issue yet.
>> > >>>
>> > >>> Downloadable assets:
>> > >>> disk image:
>> >
>>
https://storage.googleapis.com/syzbot-assets/11078f50b80b/disk-bbed346d.raw.xz
>> >
>> > >>> vmlinux:
>> >
>>
https://storage.googleapis.com/syzbot-assets/398e5f1e6c84/vmlinux-bbed346d.xz
>> >
>> > >>>
>> > >>> IMPORTANT: if you fix the issue, please add the following tag to
>> > the commit:
>> > >>> Reported-by:
>> syzbot+60748c96cf5c6df8e581@...kaller.appspotmail.com
>> > >>
>> > >> +Jiri
>> > >>
>> > >> It looks like the issue is with the team device. It seems to call
>> > >> itself infinitely.
>> > >> team_device_event was mentioned in stack overflow bugs in the
>> past:
>> > >>
>> >
>>
https://groups.google.com/g/syzkaller-bugs/search?q=%22team_device_event%22
>> > >>
>> > >
>> > >
>> > > Taehee Yoo, can you take a look ?
>> > >
>> > > Patch series of yours was supposed to limit max nest level to 8
>> > >
>> > >
>> >
>>
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=65921376425fc9c8b7ce647e1f7989f7cdf5dd70
>> >
>> > >
>> >
>> > I found a reproducer.
>> >
>> > #test_team.sh
>> > ip link add dummy0 type dummy
>> > ip link set dummy0 up
>> > for a1 in {0..1}
>> > do
>> > ip link add team$a1 type team
>> > for a2 in {0..1}
>> > do
>> > ip link add team$a1$a2 master team$a1 type team
>> > for a3 in {0..1}
>> > do
>> > ip link add team$a1$a2$a3 master team$a1$a2
>> > type team
>> > for a4 in {0..1}
>> > do
>> > ip link add team$a1$a2$a3$a4 master
>> > team$a1$a2$a3 type team
>> > for a5 in {0..1}
>> > do
>> > ip link add
>> team$a1$a2$a3$a4$a5
>> > master team$a1$a2$a3$a4 type team
>> > for a6 in {0..1}
>> > do
>> > ip link add
>> > team$a1$a2$a3$a4$a5$a6 master team$a1$a2$a3$a4$a5 type team
>> > ip link add
>> > macvlan$a1$a2$a3$a4$a5$a6 link dummy0 master team$a1$a2$a3$a4$a5$a6
>> type
>> > macvlan
>> > ip link set
>> > macvlan$a1$a2$a3$a4$a5$a6 up
>> > ip link set
>> > team$a1$a2$a3$a4$a5$a6 up
>> > done
>> > ip link set
>> team$a1$a2$a3$a4$a5 up
>> > done
>> > ip link set team$a1$a2$a3$a4 up
>> > done
>> > ip link set team$a1$a2$a3 up
>> > done
>> > ip link set team$a1$a2 up
>> > done
>> > ip link set team$a1 up
>> > done
>> >
>> > #test_ethtool.sh
>> > for a1 in {0..1}
>> > do
>> > ethtool -K team$a1 lro $1
>> > for a2 in {0..1}
>> > do
>> > ethtool -K team$a1$a2 lro $1
>> > for a3 in {0..1}
>> > do
>> > ethtool -K team$a1$a2$a3 lro $1
>> > for a4 in {0..1}
>> > do
>> > ethtool -K team$a1$a2$a3$a4 lro $1
>> > for a5 in {0..1}
>> > do
>> > ethtool -K
>> team$a1$a2$a3$a4$a5
>> > lro $1
>> > for a6 in {0..1}
>> > do
>> > ethtool -K
>> > team$a1$a2$a3$a4$a5$a6 lro $1
>> > ethtool -K
>> > macvlan$a1$a2$a3$a4$a5$a6 lro $1
>> > done
>> > done
>> > done
>> > done
>> > done
>> > done
>> >
>> > shell#1
>> > bash test_team.sh
>> > while :
>> > do
>> > bash test_ethtool.sh on
>> > done
>> > shell#2
>> > while :
>> > do
>> > bash test_ethtool.sh off
>> > done
>> >
>> > We can see a very similar call trace with the above reproducer.
>> > I think it is the same issue.
>> > Could you please test it?
>> >
>> > And, I found the fixed same issue too.
>> >
>>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.0&id=dd912306ff008891c82cd9f63e8181e47a9cb2fb
>> >
>> >
>> https://groups.google.com/g/syzkaller-bugs/c/-5OV1OW-dS4/m/o2Oq6AYSAwAJ
>> >
>>
>> I found the root cause of this issue.
>>
>> This is simpler reproducer.
>>
>> ip link add team0 type team
>> ethtool -K team0 lro on
>> for i in {1..100}
>> do
>> ip link add team$i master team0 type team
>> ethtool -K team$i lro on
>> done
>>
>> ethtool -K team0 lro off
>>
>> The above graph is like below:
>> team0
>> |
>> +------+------+-----+-----+
>> | | | | |
>> team1 team2 team3 ... team100
>>
>> int __netdev_update_features(struct net_device *dev)
>> {
>> struct net_device *upper, *lower;
>> netdev_features_t features;
>> struct list_head *iter;
>> int err = -1;
>> ...
>> sync_lower:
>> /* some features must be disabled on lower devices when
disabled
>> * on an upper device (think: bonding master or bridge)
>> */
>> netdev_for_each_lower_dev(dev, lower, iter)
>> netdev_sync_lower_features(dev, lower, features);
>> ...
>>
>>
>> static void netdev_sync_lower_features(struct net_device *upper,
>> struct net_device *lower, netdev_features_t features)
>> {
>> netdev_features_t upper_disables = NETIF_F_UPPER_DISABLES;
>> netdev_features_t feature;
>> int feature_bit;
>>
>> for_each_netdev_feature(upper_disables, feature_bit) {
>> feature = __NETIF_F_BIT(feature_bit);
>> if (!(features & feature) && (lower->features &
>> feature)) {
>> netdev_dbg(upper, "Disabling feature %pNF on
>> lower dev %s.\n",
>> &feature, lower->name);
>> lower->wanted_features &= ~feature;
>> __netdev_update_features(lower);
>>
>> if (unlikely(lower->features & feature))
>> netdev_WARN(upper, "failed to disable
>> %pNF on %s!\n",
>> &feature, lower->name);
>> else
>>
netdev_features_change(lower);<-----HERE
>> }
>> }
>> }
>>
>> void netdev_features_change(struct net_device *dev)
>> {
>> call_netdevice_notifiers(NETDEV_FEAT_CHANGE, dev);
>> }
>>
>> The code looks like an iterator.
>> But it would work recursively because of notification.
>>
>> When team0's feature(LRO) is changed with <ethtool -K team0 lro off>",
>> __netdev_update_features(team0) is called.
>> __netdev_update_features(team0) internally sends NETDEV_FEAT_CHANGE
>> event to all lower interfaces(team1, team2, ... team100).
>> team1 will receive NETDEV_FEAT_CHANGE, and it sends NETDEV_FEAT_CHANGE
>> to the upper interface(team0).
>> team0 will receive NETDEV_FEAT_CHANGE again, and it sends
>> NETDEV_FEAT_CHANGE to the all lower interfaces(team1, team2, ...
>> team100).
>> (At this point, team1 flag was already set, so it will be skipped.)
>> team2 will receive NETDEV_FEAT_CHANGE, and it sends NETDEV_FEAT_CHANGE
>> to the upper interface(team0).
>> team0 will receive NETDEV_FEAT_CHANGE again again, and it sends
>> NETDEV_FEAT_CHANGE to the all lower interfaces(team1, team2, ...
>> team100).
>> (team1, team2 skipped.)
>> ...
>> So, if there are a few lower interfaces(roughly under 30 lower
>> interfaces), it anyway works even if internally works recursively.
>> But so many lower interfaces exist, stack overflow will occur.
>> This is the root cause of this issue.
>>
>> I think synchronization direction should be one way.
>> Up or Down.
>> It means that if the team0 interface can send the NETDEV_FEAT_CHANGE
>> notification event to the lower interface,
>> the lower interfaces should be disallowed to send NETDEV_FEAT_CHANGE
>> event to the upper interface.
>>
>> bonding has same issue.
>
> Excuse me, is there a fix for this issue? I had the same issue with the
> 5.10 version of the bonding.
It is not fixed, I will fix it.
I found the problem of this issue, but I couldn't find a good solution yet.
I think It would need relatively much time for fixing it.
Thanks!
Taehee Yoo
Powered by blists - more mailing lists