lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250504172159.7358-1-kuniyu@amazon.com>
Date: Sun, 4 May 2025 10:20:48 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <edumazet@...gle.com>
CC: <davem@...emloft.net>, <dsahern@...nel.org>, <horms@...nel.org>,
	<kuba@...nel.org>, <kuni1840@...il.com>, <kuniyu@...zon.com>,
	<netdev@...r.kernel.org>, <pabeni@...hat.com>
Subject: Re: [PATCH v3 net-next 15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.

From: Eric Dumazet <edumazet@...gle.com>
Date: Sun, 4 May 2025 02:16:13 -0700
> On Thu, Apr 17, 2025 at 5:10 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> >
> > Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.
> >
> > The remaining things to do are
> >
> >   1. pass false to lwtunnel_valid_encap_type_attr()
> >   2. use rcu_dereference_rtnl() in fib6_check_nexthop()
> >   3. place rcu_read_lock() before ip6_route_info_create_nh().
> >
> > Let's complete the RTNL-free conversion.
> >
> > When each CPU-X adds 100000 routes on table-X in a batch
> > concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,
> >
> > without this series:
> >
> >   $ sudo ./route_test.sh
> >   ...
> >   added 19200000 routes (100000 routes * 192 tables).
> >   time elapsed: 191577 milliseconds.
> >
> > with this series:
> >
> >   $ sudo ./route_test.sh
> >   ...
> >   added 19200000 routes (100000 routes * 192 tables).
> >   time elapsed: 62854 milliseconds.
> >
> > I changed the number of routes in each table (1000 ~ 100000)
> > and consistently saw it finish 3x faster with this series.
> >
> > Note that now every caller of lwtunnel_valid_encap_type() passes
> > false as the last argument, and this can be removed later.
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> > ---
> >  net/ipv4/nexthop.c |  4 ++--
> >  net/ipv6/route.c   | 18 ++++++++++++------
> >  2 files changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> > index 6ba6cb1340c1..823e4a783d2b 100644
> > --- a/net/ipv4/nexthop.c
> > +++ b/net/ipv4/nexthop.c
> > @@ -1556,12 +1556,12 @@ int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg,
> >         if (nh->is_group) {
> >                 struct nh_group *nhg;
> >
> > -               nhg = rtnl_dereference(nh->nh_grp);
> > +               nhg = rcu_dereference_rtnl(nh->nh_grp);
> 
> Or add a condition about table lock being held ?

I think in this context the caller is more of an rcu reader
than a ipv6 route writer.



> 
> >                 if (nhg->has_v4)
> >                         goto no_v4_nh;
> >                 is_fdb_nh = nhg->fdb_nh;
> >         } else {
> > -               nhi = rtnl_dereference(nh->nh_info);
> > +               nhi = rcu_dereference_rtnl(nh->nh_info);
> >                 if (nhi->family == AF_INET)
> >                         goto no_v4_nh;
> >                 is_fdb_nh = nhi->fdb_nh;
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index c8c1c75268e3..bb46e724db73 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -3902,12 +3902,16 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
> >         if (IS_ERR(rt))
> >                 return PTR_ERR(rt);
> >
> > +       rcu_read_lock();
> 
> This looks bogus to me (and syzbot)
> 
> Holding rcu_read_lock() from writers is almost always wrong.

Yes, but I added it as a reader of netdev and nexthop to guarantee
that they won't go away.


> 
> In this case, this prevents any GFP_KERNEL allocations to happen
> (among other things)

Oh, I completely missed this path, thanks!

Fortunately, it seems all ->build_state() except for
ip_tun_build_state() use GFP_ATOMIC.

So, simply changing GFP_KERNEL to GFP_ATOMIC is acceptable ?


lwtunnel_state_alloc
  - kzalloc(GFP_ATOMIC)

ip_tun_build_state
  - dst_cache_init(GFP_KERNEL)

ip6_tun_build_state / mpls_build_state / xfrmi_build_state
-> no alloc other than lwtunnel_state_alloc()

bpf_build_state
  - bpf_parse_prog
    - nla_memdup(GFP_ATOMIC)

ila_build_state / ioam6_build_state / rpl_build_state / seg6_build_state
  - dst_cache_init(GFP_ATOMIC)

seg6_local_build_state
  - seg6_end_dt4_build / seg6_end_dt6_build / seg6_end_dt46_build
    -> no alloc other than lwtunnel_state_alloc()


> 
> [ BUG: Invalid wait context ]
> 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W
> -----------------------------
> syz-executor234/5832 is trying to lock:
> ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
> pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
> other info that might help us debug this:
> context-{5:5}
> 1 lock held by syz-executor234/5832:
> #0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
> include/linux/rcupdate.h:331 [inline]
> #0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
> include/linux/rcupdate.h:841 [inline]
> #0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at:
> ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913
> stack backtrace:
> CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W
> 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full)
> Tainted: [W]=WARN
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 04/29/2025
> Call Trace:
> <TASK>
> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
> print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline]
> check_wait_context kernel/locking/lockdep.c:4903 [inline]
> __lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185
> lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
> __mutex_lock_common kernel/locking/mutex.c:601 [inline]
> __mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746
> pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
> dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145
> ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687
> lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137
> fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635
> fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669
> ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866
> ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915
> inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732
> rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955
> netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
> netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
> netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339
> netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
> sock_sendmsg_nosec net/socket.c:712 [inline]
> __sock_sendmsg+0x219/0x270 net/socket.c:727
> ____sys_sendmsg+0x505/0x830 net/socket.c:2566
> ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620
> __sys_sendmsg net/socket.c:2652 [inline]
> __do_sys_sendmsg net/socket.c:2657 [inline]
> __se_sys_sendmsg net/socket.c:2655 [inline]
> __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655
> do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ