lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9ebd0210-a4bb-afda-8a4d-5041b8395d78@huawei.com>
Date:   Mon, 30 Jan 2023 11:19:50 +0800
From:   Zhang Changzhong <zhangchangzhong@...wei.com>
To:     Julian Anastasov <ja@....bg>
CC:     Network Development <netdev@...r.kernel.org>,
        open list <linux-kernel@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        David Ahern <dsahern@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        "Denis V. Lunev" <den@...nvz.org>,
        Nikolay Aleksandrov <razor@...ckwall.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        YueHaibing <yuehaibing@...wei.com>,
        Zhang Changzhong <zhangchangzhong@...wei.com>
Subject: Re: [Question] neighbor entry doesn't switch to the STALE state after
 the reachable timer expires

On 2023/1/30 3:43, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Sun, 29 Jan 2023, Zhang Changzhong wrote:
> 
>> Hi,
>>
>> We got the following weird neighbor cache entry on a machine that's been running for over a year:
>> 172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE
> 
> 	confirmed time (15994171) is 13 days in the future, more likely
> 185 days behind (very outdated), anything above 99 days is invalid
> 
>> 350520 seconds have elapsed since this entry was last updated, but it is still in the REACHABLE
>> state (base_reachable_time_ms is 30000), preventing lladdr from being updated through probe.
>>
>> After some analysis, we found a scenario that may cause such a neighbor entry:
>>
>>           Entry used          	  DELAY_PROBE_TIME expired
>> NUD_STALE ------------> NUD_DELAY ------------------------> NUD_PROBE
>>                             |
>>                             | DELAY_PROBE_TIME not expired
>>                             v
>>                       NUD_REACHABLE
>>
>> The neigh_timer_handler() use time_before_eq() to compare 'now' with 'neigh->confirmed +
>> NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME)', but time_before_eq() only works if delta < ULONG_MAX/2.
>>
>> This means that if an entry stays in the NUD_STALE state for more than ULONG_MAX/2 ticks, it enters
>> the NUD_RACHABLE state directly when it is used again and cannot be switched to the NUD_STALE state
>> (the timer is set too long).
>>
>> On 64-bit machines, ULONG_MAX/2 ticks are a extremely long time, but in my case (32-bit machine and
>> kernel compiled with CONFIG_HZ=250), ULONG_MAX/2 ticks are about 99.42 days, which is possible in
>> reality.
>>
>> Does anyone have a good idea to solve this problem? Or are there other scenarios that might cause
>> such a neighbor entry?
> 
> 	Is the neigh entry modified somehow, for example,
> with 'arp -s' or 'ip neigh change' ? Or is bond0 reconfigured
> after initial setup? I mean, 4 days ago?>

So far, we haven't found any user-space program that modifies the neigh
entry or bond0.

In fact, the neigh entry has been rarely used since initialization.
4 days ago, our machine just needed to download files from 172.16.1.18.
However, the laddr has changed, and the neigh entry wrongly switched to
NUD_REACHABLE state, causing the laddr to fail to update.

> 	Looking at __neigh_update, there are few cases that
> can assign NUD_STALE without touching neigh->confirmed:
> lladdr = neigh->ha should be called, NEIGH_UPDATE_F_ADMIN
> should be provided. Later, as you explain, it can wrongly
> switch to NUD_REACHABLE state for long time.
> 
> 	May be there should be some measures to keep
> neigh->confirmed valid during admin modifications.
> 

This problem can also occur if the neigh entry stays in NUD_STALE state
for more than 99 days, even if it is not modified by the administrator.

> 	What is the kernel version?
> 

We encountered this problem in 4.4 LTS, and the mainline doesn't seem
to fix it yet.

Regards,
Changzhong Zhang

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ