lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9ac5f4f6-36cc-cc6b-1220-f45db141656c@ssi.bg>
Date:   Mon, 30 Jan 2023 13:01:45 +0200 (EET)
From:   Julian Anastasov <ja@....bg>
To:     Zhang Changzhong <zhangchangzhong@...wei.com>
cc:     Network Development <netdev@...r.kernel.org>,
        open list <linux-kernel@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        David Ahern <dsahern@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        "Denis V. Lunev" <den@...nvz.org>,
        Nikolay Aleksandrov <razor@...ckwall.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        YueHaibing <yuehaibing@...wei.com>
Subject: Re: [Question] neighbor entry doesn't switch to the STALE state
 after the reachable timer expires


	Hello,

On Mon, 30 Jan 2023, Zhang Changzhong wrote:

> On 2023/1/30 3:43, Julian Anastasov wrote:
> > 
> >> Does anyone have a good idea to solve this problem? Or are there other scenarios that might cause
> >> such a neighbor entry?
> > 
> > 	Is the neigh entry modified somehow, for example,
> > with 'arp -s' or 'ip neigh change' ? Or is bond0 reconfigured
> > after initial setup? I mean, 4 days ago?>
> 
> So far, we haven't found any user-space program that modifies the neigh
> entry or bond0.

	Ouch, we do not need tools to hit the problem,
thanks to gc_thresh1.

> In fact, the neigh entry has been rarely used since initialization.
> 4 days ago, our machine just needed to download files from 172.16.1.18.
> However, the laddr has changed, and the neigh entry wrongly switched to
> NUD_REACHABLE state, causing the laddr to fail to update.

	Received ARP packets should be able to change
the address. But we do not refresh the entry because
its timer is scheduled days ahead.

> > 	Looking at __neigh_update, there are few cases that
> > can assign NUD_STALE without touching neigh->confirmed:
> > lladdr = neigh->ha should be called, NEIGH_UPDATE_F_ADMIN
> > should be provided. Later, as you explain, it can wrongly
> > switch to NUD_REACHABLE state for long time.
> > 
> > 	May be there should be some measures to keep
> > neigh->confirmed valid during admin modifications.
> > 
> 
> This problem can also occur if the neigh entry stays in NUD_STALE state
> for more than 99 days, even if it is not modified by the administrator.

	I see.

> > 	What is the kernel version?
> > 
> 
> We encountered this problem in 4.4 LTS, and the mainline doesn't seem
> to fix it yet.

	Yep, kernel version is irrelevant.

	Here is a change that you can comment/test but
I'm not sure how many days (100?) are needed :) Not tested.

: From: Julian Anastasov <ja@....bg>
Subject: [PATCH] neigh: make sure used and confirmed times are valid

Entries can linger without timer for days, thanks to
the gc_thresh1 limit. Later, on traffic, NUD_STALE entries
can switch to NUD_DELAY and start the timer which can see
confirmed time in the future causing switch to NUD_REACHABLE.
But then timer is started again based on the wrong
confirmed time, so days in the future. This is more
visible on 32-bit platforms.

Problem and the wrong state change reported by Zhang Changzhong:

172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE

350520 seconds have elapsed since this entry was last updated, but it is
still in the REACHABLE state (base_reachable_time_ms is 30000),
preventing lladdr from being updated through probe.

Fix it by ensuring timer is started with valid used/confirmed
times. There are also places that need used/updated times to be
validated.

Reported-by: Zhang Changzhong <zhangchangzhong@...wei.com>
Signed-off-by: Julian Anastasov <ja@....bg>
---
 net/core/neighbour.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index a77a85e357e0..f063e8b8fb7d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -269,7 +269,7 @@ static int neigh_forced_gc(struct neigh_table *tbl)
 			    (n->nud_state == NUD_NOARP) ||
 			    (tbl->is_multicast &&
 			     tbl->is_multicast(n->primary_key)) ||
-			    time_after(tref, n->updated))
+			    !time_in_range(n->updated, tref, jiffies))
 				remove = true;
 			write_unlock(&n->lock);
 
@@ -289,7 +289,13 @@ static int neigh_forced_gc(struct neigh_table *tbl)
 
 static void neigh_add_timer(struct neighbour *n, unsigned long when)
 {
+	unsigned long mint = jiffies - MAX_JIFFY_OFFSET + 86400 * HZ;
+
 	neigh_hold(n);
+	if (!time_in_range(n->confirmed, mint, jiffies))
+		n->confirmed = mint;
+	if (time_before(n->used, n->confirmed))
+		n->used = n->confirmed;
 	if (unlikely(mod_timer(&n->timer, when))) {
 		printk("NEIGH: BUG, double timer add, state is %x\n",
 		       n->nud_state);
@@ -982,12 +988,14 @@ static void neigh_periodic_work(struct work_struct *work)
 				goto next_elt;
 			}
 
-			if (time_before(n->used, n->confirmed))
+			if (time_before(n->used, n->confirmed) &&
+			    time_is_before_eq_jiffies(n->confirmed))
 				n->used = n->confirmed;
 
 			if (refcount_read(&n->refcnt) == 1 &&
 			    (state == NUD_FAILED ||
-			     time_after(jiffies, n->used + NEIGH_VAR(n->parms, GC_STALETIME)))) {
+			     !time_in_range_open(jiffies, n->used,
+						 n->used + NEIGH_VAR(n->parms, GC_STALETIME)))) {
 				*np = n->next;
 				neigh_mark_dead(n);
 				write_unlock(&n->lock);
-- 
2.39.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ