lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87v7uiexbj.fsf@cloudflare.com>
Date: Mon, 13 Jan 2025 13:02:24 +0100
From: Jakub Sitnicki <jakub@...udflare.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org,  kernel-team@...udflare.com
Subject: Re: [FYI] Input route ref count underflow since probably 6.6.22

On Tue, Jul 02, 2024 at 04:21 PM +02, Jakub Sitnicki wrote:
> On Mon, Jul 01, 2024 at 04:38 PM -07, Jakub Kicinski wrote:
>> On Fri, 28 Jun 2024 13:10:53 +0200 Jakub Sitnicki wrote:
>>> We've observed an unbalanced dst_release() on an input route in v6.6.y.
>>> First noticed in 6.6.22. Or at least that is how far back our logs go.
>>> 
>>> We have just started looking into it and don't have much context yet,
>>> except that:
>>> 
>>> 1. the issue is architecture agnostic, seen both on x86_64 and arm64;
>>> 2. the backtrace, we realize, doesn't point to the source of problem,
>>>    it's just where the ref count underflow manifests itself;
>>> 3. while have out-of-tree modules, they are for the crypto subsystem.
>>> 
>>> We will follow up as we collect more info on this, but we would
>>> appreciate any hints or pointers to potential suspects, if anything
>>> comes to mind.
>>
>> Hi! Luck would have it the same crash landed on my desk.
>> Did you manage to narrow it down?
>
> Nothing yet. Will keep you posted.

Finally circling back to this to hunt it down...

We've tweaked "imbalanced put()" warning to log the ref count, that is
the result atomic_read(&ref->refcnt) load in rcuref_put_slowpath():

---8<---
diff --git a/lib/rcuref.c b/lib/rcuref.c
index 5ec00a4a64d1..53006542e785 100644
--- a/lib/rcuref.c
+++ b/lib/rcuref.c
@@ -264,7 +264,8 @@ bool rcuref_put_slowpath(rcuref_t *ref)
         * put() operation is imbalanced. Warn, put the reference count back to
         * DEAD and tell the caller to not deconstruct the object.
         */
-       if (WARN_ONCE(cnt >= RCUREF_RELEASED, "rcuref - imbalanced put()")) {
+       if (WARN_ONCE(cnt >= RCUREF_RELEASED,
+                     "rcuref - imbalanced put(): refcnt=%#x", cnt)) {
                atomic_set(&ref->refcnt, RCUREF_DEAD);
                return false;
        }
--->8---

So far we're only seeing reports with cnt == RCUREF_DEAD, for example:

------------[ cut here ]------------
rcuref - imbalanced put(): refcnt=0xe0000000
WARNING: CPU: 105 PID: 173613 at lib/rcuref.c:267 rcuref_put_slowpath+0x6d/0x80
Modules linked in: overlay mptcp_diag xsk_diag raw_diag unix_diag af_packet_diag netlink_diag esp4 xt_hashlimit ip_set_hash_netport nft_compat nf_conntrack_netlink xfrm_interface xfrm6_tunnel sit nft_numgen nft_log nft_limit dummy ip_gre gre xfrm_user xfrm_algo mpls_gso mpls_iptunnel mpls_router fou6 ip6_tunnel tunnel6 ipip tunnel4 fou ip_tunnel ip6_udp_tunnel udp_tunnel cls_bpf nft_fwd_netdev nf_dup_netdev nft_ct nf_tables zstd zram zsmalloc sch_ingress tcp_diag udp_diag inet_diag veth tun tcp_bbr sch_fq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6table_security ip6table_nat ip6_tables xt_limit xt_LOG nf_log_syslog ipt_REJECT nf_reject_ipv4 xt_multiport xt_tcpmss iptable_filter xt_length xt_TCPMSS xt_DSCP xt_bpf xt_NFLOG nfnetlink_log xt_connbytes xt_connlabel xt_statistic xt_connmark xt_conntrack iptable_mangle xt_mark xt_nat iptable_nat nf_nat xt_owner xt_set xt_comment xt_tcpudp xt_CT nf_conntrack nf_defrag_ipv6
 nf_defrag_ipv4 iptable_raw ip_set_hash_ip ip_set_hash_net ip_set raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax nvme_fabrics 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel mlx5_core rapl acpi_ipmi xhci_pci mlxfw nvme ipmi_si ipmi_devintf tls tiny_power_button xhci_hcd ccp i2c_piix4 nvme_core psample ipmi_msghandler button fuse nfnetlink efivarfs ip_tables x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
CPU: 105 PID: 173613 Comm: napi/iconduit-g Kdump: loaded Tainted: G           O       6.6.69-cloudflare-2025.1.1 #1
Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022
RIP: 0010:rcuref_put_slowpath+0x6d/0x80
Code: eb da 80 3d 85 0c 37 02 00 74 0a c7 03 00 00 00 e0 31 c0 eb c7 89 c6 48 c7 c7 80 41 af b9 c6 05 69 0c 37 02 01 e8 83 46 9b ff <0f> 0b eb dd 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 90 90 90
RSP: 0018:ffffc90042bb7908 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88c88d886f40 RCX: 0000000000000027
RDX: ffff88c81faa0788 RSI: 0000000000000001 RDI: ffff88c81faa0780
RBP: ffffc90042bb7988 R08: 0000000000000000 R09: ffffc90042bb7798
R10: ffff88e04f2cc1a8 R11: 0000000000000003 R12: ffff88c88d886f00
R13: ffffc90042bb7a98 R14: 0000000000000000 R15: 0000000039277c9f
FS:  0000000000000000(0000) GS:ffff88c81fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc4a9f48000 CR3: 000000306fd7a004 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? rcuref_put_slowpath+0x6d/0x80
 ? __warn+0x81/0x130
 ? rcuref_put_slowpath+0x6d/0x80
 ? report_bug+0x16f/0x1a0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? prb_read_valid+0x1b/0x30
 ? handle_bug+0x53/0x90
 ? exc_invalid_op+0x17/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? rcuref_put_slowpath+0x6d/0x80
 ? rcuref_put_slowpath+0x6d/0x80
 dst_release+0x3d/0xa0
 rt_cache_route+0x6b/0xa0
 rt_set_nexthop.isra.0+0x16c/0x400
 ip_route_input_slow+0x886/0xbe0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ip_route_input_noref+0x95/0xa0
 ip_rcv_finish_core.isra.0+0xc6/0x450
 ip_sublist_rcv+0xe5/0x380
 ? __pfx_ip_rcv_finish+0x10/0x10
 ip_list_rcv+0x165/0x190
 __netif_receive_skb_list_core+0x30f/0x340
 ? srso_alias_return_thunk+0x5/0xfbef5
 netif_receive_skb_list_internal+0x1bd/0x330
 napi_complete_done+0x72/0x200
 veth_poll+0xe4/0x1d0 [veth]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? psi_group_change+0x177/0x3c0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __perf_event_task_sched_in+0x86/0x210
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? finish_task_switch.isra.0+0x85/0x280
 __napi_poll+0x2b/0x1b0
 bpf_trampoline_6442534135+0x7d/0x1000
 ? schedule+0x5e/0xd0
 __napi_poll+0x5/0x1b0
 napi_threaded_poll+0x22c/0x270
 ? __pfx_napi_threaded_poll+0x10/0x10
 kthread+0xe8/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x34/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
---[ end trace 0000000000000000 ]---

... which would suggest that we're dealing with a race to release the
last reference! At least I don't see any other explantion for this
state. On paper it could look like (wide text ahead):

refcnt                              CPU 0                                                                   CPU 1
0x00000000 (RCUREF_ONEREF)          dst_release(dst) {                                                      dst_release(dst) {
"                                     rcuref_put(&dst->__rcuref) {                                            rcuref_put(&dst->__rcuref) {
"                                       rcuref_put {                                                            rcuref_put {
"                                         preempt_disable                                                         preempt_disable
"                                         __rcuref_put {                                                          __rcuref_put {
"                                           atomic_add_negative_release(-1, &ref->refcnt)                           ...
0xFFFFFFFF (RCUREF_NOREF)                   rcuref_put_slowpath {                                                   ...
"                                             cnt = atomic_read(&ref->refcnt)                                       ...
"                                             ...                                                                   atomic_add_negative_release(-1, &ref->refcnt)
0xFFFFFFFE (RCUREF_NOREF-1)                   atomic_try_cmpxchg_release(&ref->refcnt, &cnt, RCUREF_DEAD)           ...
0xE0000000 (RCUREF_DEAD)                    } → false                                                               rcuref_put_slowpath {
"                                         } → false                                                                   cnt = atomic_read(&ref->refcnt)
"                                         preempt_enable                                                              WARN_ONCE(cnt >= RCUREF_RELEASED, "rcuref - imbalanced put()")
"                                       } → false                                                                     atomic_set(&ref->refcnt, RCUREF_DEAD)
"                                     } → false                                                                     } → false
"                                   }                                                                             } → false
"                                                                                                                 preempt_enable
"                                                                                                               } → false
"                                                                                                             } → false
"                                                                                                           }

That's the only new clue so far.

Plan is to add more logging to dump the skb that triggers it or at least
dig out the dst dev name. We need more hints.

Side note - annoyingly, rcuref_put_slowpath is notrace, same as every
*.o under lib/ by default. That makes tracing the "imbalanced put()"
branch impossible when running w/o CONFIG_KPROBE_EVENTS_ON_NOTRACE.
I will see if I can lift that restriction.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ