[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <99532c7f-161e-6d39-7680-ccc1f20349@ssi.bg>
Date: Sun, 29 Jan 2023 21:43:55 +0200 (EET)
From: Julian Anastasov <ja@....bg>
To: Zhang Changzhong <zhangchangzhong@...wei.com>
cc: Network Development <netdev@...r.kernel.org>,
open list <linux-kernel@...r.kernel.org>,
"David S. Miller" <davem@...emloft.net>,
Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
David Ahern <dsahern@...nel.org>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
"Denis V. Lunev" <den@...nvz.org>,
Nikolay Aleksandrov <razor@...ckwall.org>,
Daniel Borkmann <daniel@...earbox.net>,
YueHaibing <yuehaibing@...wei.com>
Subject: Re: [Question] neighbor entry doesn't switch to the STALE state
after the reachable timer expires
Hello,
On Sun, 29 Jan 2023, Zhang Changzhong wrote:
> Hi,
>
> We got the following weird neighbor cache entry on a machine that's been running for over a year:
> 172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE
confirmed time (15994171) is 13 days in the future, more likely
185 days behind (very outdated), anything above 99 days is invalid
> 350520 seconds have elapsed since this entry was last updated, but it is still in the REACHABLE
> state (base_reachable_time_ms is 30000), preventing lladdr from being updated through probe.
>
> After some analysis, we found a scenario that may cause such a neighbor entry:
>
> Entry used DELAY_PROBE_TIME expired
> NUD_STALE ------------> NUD_DELAY ------------------------> NUD_PROBE
> |
> | DELAY_PROBE_TIME not expired
> v
> NUD_REACHABLE
>
> The neigh_timer_handler() use time_before_eq() to compare 'now' with 'neigh->confirmed +
> NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME)', but time_before_eq() only works if delta < ULONG_MAX/2.
>
> This means that if an entry stays in the NUD_STALE state for more than ULONG_MAX/2 ticks, it enters
> the NUD_RACHABLE state directly when it is used again and cannot be switched to the NUD_STALE state
> (the timer is set too long).
>
> On 64-bit machines, ULONG_MAX/2 ticks are a extremely long time, but in my case (32-bit machine and
> kernel compiled with CONFIG_HZ=250), ULONG_MAX/2 ticks are about 99.42 days, which is possible in
> reality.
>
> Does anyone have a good idea to solve this problem? Or are there other scenarios that might cause
> such a neighbor entry?
Is the neigh entry modified somehow, for example,
with 'arp -s' or 'ip neigh change' ? Or is bond0 reconfigured
after initial setup? I mean, 4 days ago?
Looking at __neigh_update, there are few cases that
can assign NUD_STALE without touching neigh->confirmed:
lladdr = neigh->ha should be called, NEIGH_UPDATE_F_ADMIN
should be provided. Later, as you explain, it can wrongly
switch to NUD_REACHABLE state for long time.
May be there should be some measures to keep
neigh->confirmed valid during admin modifications.
What is the kernel version?
Regards
--
Julian Anastasov <ja@....bg>
Powered by blists - more mailing lists