linux-kernel - Find no outgoing routing table entry for CIFS reconnect?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <BYAPR21MB1270EB03413A9CDDD787D673BF519@BYAPR21MB1270.namprd21.prod.outlook.com>
Date:   Tue, 11 Jan 2022 01:29:25 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     Stephen Hemminger <sthemmin@...rosoft.com>,
        "'netdev@...r.kernel.org'" <netdev@...r.kernel.org>,
        Steven French <Steven.French@...rosoft.com>,
        "linux-cifs@...r.kernel.org" <linux-cifs@...r.kernel.org>
CC:     Long Li <longli@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "Michael Kelley (LINUX)" <mikelley@...rosoft.com>,
        "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: Find no outgoing routing table entry for CIFS reconnect?

Hi, all,
I'm investigating a Linux networking issue: inside a Linux container, the
Linux network stack fails to find an outgoing routing table entry for the
CIFS module's TCP request; however, inside the same container, I'm able to
connect to the same CIFS server by "telnet cifs-server 445"! I think the
kernel CIFS module and the userspace "telnet" program should share the
same network namespace in the same container, so they should be using the
same routing table? It's unclear why the CIFS-initiated outgoing TCP
connect fails to find a routing table entry. Anyone happens to know about
such a bug?

Here I'm unable to reproduce the issue at will, but from time to time some
container suddenly starts to hit the issue after it has been working fine
several days, and the user starts to complain that a mounted CIFS folder
becomes inaccessible due to -ENETUNREACH (-101), and only reboot
can work around the issue temporarily, and the issue might re-occur later.

Here the VM kernel is 5.4.0-1064-azure [1], and I don't know if the mainline
has the issue or not. Here I debugged the issue using ftrace and bpftrace
in a VM/container that was showing the issue, and the -ENETUNREACH
error happens this way:

tcp_v4_connect
	ip_route_connect
		__ip_route_output_key
			ip_route_output_key_hash
				ip_route_output_key_hash_rcu
					fib_lookup

static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
                             struct fib_result *res, unsigned int flags)
{
        struct fib_table *tb;
        int err = -ENETUNREACH;

        rcu_read_lock();

        tb = fib_get_table(net, RT_TABLE_MAIN);
        if (tb)
                err = fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF);

        if (err == -EAGAIN)
                err = -ENETUNREACH;

        rcu_read_unlock();

        return err;
}

The above fib_table_lookup() returne -EAGAIN (-11), which is converted
to -ENETUNREACH.

The code of fib_table_lookup() is complicated [1] and the pre-defined
tracepoint in the function doesn't reveal why the cifs kernel thread fails
to find an outgoing routing table entry while the telnet program can find
the entry:

cifsd-4809 [001] .... 94040.997416: fib_table_lookup: table 254 oif 0 iif 1 proto 6 0.0.0.0/0 -> 10.10.166.38/445 tos 0 scope 0 flags 0 ==> dev - gw 0.0.0.0/:: err -11
telnet-4195 [003] .... 94041.005634: fib_table_lookup: table 254 oif 0 iif 1 proto 6 0.0.0.0/0 -> 10.10.166.38/445 tos 16 scope 0 flags 0 ==> dev eth0 gw 10.133.162.1/:: err 0
telnet-4195 [003] .... 94041.005638: fib_table_lookup: table 254 oif 0 iif 1 proto 6 10.133.162.32/0 -> 10.10.166.38/445 tos 16 scope 0 flags 0 ==> dev eth0 gw 10.133.162.1/:: err 0
telnet-4195 [003] .... 94041.005643: fib_table_lookup: table 254 oif 0 iif 1 proto 6 10.133.162.32/41670 -> 10.10.166.38/445 tos 16 scope 0 flags 0 ==> dev eth0 gw 10.133.162.1/:: err

I was trying to check the input parameters of the related functions using
bpftrace, but unluckily I lost the repro as the VM was rebooted by accident.

It would be great to have your insights while I'm waiting for a new repro.

Thanks!
-- Dexuan

[1] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/bionic/tree/net/ipv4/fib_trie.c?h=Ubuntu-azure-5.4-5.4.0-1064.67_18.04.1#n1312