lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 22 Sep 2023 10:06:31 +0700
From: Bagas Sanjaya <bagasdotme@...il.com>
To: Martin Zaharinov <micron10@...il.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
	netdev <netdev@...r.kernel.org>, patchwork-bot+netdevbpf@...nel.org,
	Jakub Kicinski <kuba@...nel.org>,
	Stephen Hemminger <stephen@...workplumber.org>,
	kuba+netdrv@...nel.org, dsahern@...il.com,
	Florian Westphal <fw@...len.de>,
	Pablo Neira Ayuso <pablo@...filter.org>,
	Thorsten Leemhuis <regressions@...mhuis.info>,
	Wangyang Guo <wangyang.guo@...el.com>,
	Arjan Van De Ven <arjan.van.de.ven@...el.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Linux Regressions <regressions@...ts.linux.dev>
Subject: Re: Urgent Bug Report Kernel crash 6.5.2

On Thu, Sep 21, 2023 at 11:13:55AM +0300, Martin Zaharinov wrote:
> Hi Bagas,
> 
> 
> Its not easy to make this on production, have too many users on it.
> 
> i make checks and find with kernel 6.3.12-6.5.13 all is fine.
> on first machine that i have with kernel 6.4 and still work run kernel 6.4.2 and have problem.
> 
> in my investigation problem is start after migration to kernel 6.4.x 
> 
> in 6.4 kernel is add rcuref : 
> 
> https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.4 
> 
> commit bc9d3a9f2afca189a6ae40225b6985e3c775375e
> Author: Thomas Gleixner <tglx@...utronix.de>
> Date: Thu Mar 23 21:55:32 2023 +0100
> 
> net: dst: Switch to rcuref_t reference counting

Is it the culprit you look for? Had you done the bisection and it points
the culprit to that commit

> 
> Under high contention dst_entry::__refcnt becomes a significant bottleneck.
> 
> atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
> high retry rates on contention.
> 
> Switch the reference count to rcuref_t which results in a significant
> performance gain. Rename the reference count member to __rcuref to reflect
> the change.
> 
> The gain depends on the micro-architecture and the number of concurrent
> operations and has been measured in the range of +25% to +130% with a
> localhost memtier/memcached benchmark which amplifies the problem
> massively.
> 
> Running the memtier/memcached benchmark over a real (1Gb) network
> connection the conversion on top of the false sharing fix for struct
> dst_entry::__refcnt results in a total gain in the 2%-5% range over the
> upstream baseline.
> 
> Reported-by: Wangyang Guo <wangyang.guo@...el.com>
> Reported-by: Arjan Van De Ven <arjan.van.de.ven@...el.com>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de
> Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.de
> Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> 
> 
> and i think problem is here : 
> 
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -66,7 +66,7 @@ void dst_init(struct dst_entry *dst, str
> dst->tclassid = 0;
> #endif
> dst->lwtstate = NULL;
> - atomic_set(&dst->__refcnt, initial_ref);
> + rcuref_init(&dst->__refcnt, initial_ref);
> dst->__use = 0;
> dst->lastuse = jiffies;
> dst->flags = flags;
> @@ -162,31 +162,15 @@ EXPORT_SYMBOL(dst_dev_put);
> 
> void dst_release(struct dst_entry *dst)
> {
> - if (dst) {
> - int newrefcnt;
> -
> - newrefcnt = atomic_dec_return(&dst->__refcnt);
> - if (WARN_ONCE(newrefcnt < 0, "dst_release underflow"))
> - net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
> - __func__, dst, newrefcnt);
> - if (!newrefcnt)
> - call_rcu_hurry(&dst->rcu_head, dst_destroy_rcu);
> - }
> + if (dst && rcuref_put(&dst->__refcnt))
> + call_rcu_hurry(&dst->rcu_head, dst_destroy_rcu);
> }
> EXPORT_SYMBOL(dst_release);
> 
> void dst_release_immediate(struct dst_entry *dst)
> {
> - if (dst) {
> - int newrefcnt;
> -
> - newrefcnt = atomic_dec_return(&dst->__refcnt);
> - if (WARN_ONCE(newrefcnt < 0, "dst_release_immediate underflow"))
> - net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
> - __func__, dst, newrefcnt);
> - if (!newrefcnt)
> - dst_destroy(dst);
> - }
> + if (dst && rcuref_put(&dst->__refcnt))
> + dst_destroy(dst);
> }
> EXPORT_SYMBOL(dst_release_immediate);
> 
> 
> but this is my thinking
> 

What do you think that above causes your regression?

Confused...

[To Thorsten: I'm unsure if the reporter do the bisection and suddenly he found
the culprit commit. Should I add it to regzbot? I had dealt with this reporter
before when he reported nginx regression and he didn't respond with bisection
to the point that I had to mark it as inconclusive (see regzbot dashboard).
What advice can you provide to him?]

-- 
An old man doll... just what I always wanted! - Clara

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ