netdev - Re: [PATCH bpf] bpf, sockmap: Fix NULL deref in sk_psock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <941d6192-58dd-802c-0338-192558a3d3ea@huaweicloud.com>
Date: Thu, 3 Aug 2023 18:05:38 +0800
From: Xu Kuohai <xukuohai@...weicloud.com>
To: John Fastabend <john.fastabend@...il.com>,
 Xu Kuohai <xukuohai@...weicloud.com>, bpf@...r.kernel.org,
 netdev@...r.kernel.org
Cc: Jakub Sitnicki <jakub@...udflare.com>,
 "David S . Miller" <davem@...emloft.net>, Eric Dumazet
 <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, Daniel Borkmann <daniel@...earbox.net>,
 Alexei Starovoitov <ast@...nel.org>, Cong Wang <cong.wang@...edance.com>
Subject: Re: [PATCH bpf] bpf, sockmap: Fix NULL deref in sk_psock_backlog

On 8/2/2023 11:04 AM, John Fastabend wrote:
> Xu Kuohai wrote:
>> From: Xu Kuohai <xukuohai@...wei.com>
>>
>> sk_psock_backlog triggers a NULL dereference:
>>
>>   BUG: kernel NULL pointer dereference, address: 000000000000000e
>>   #PF: supervisor read access in kernel mode
>>   #PF: error_code(0x0000) - not-present page
>>   PGD 0 P4D 0
>>   Oops: 0000 [#1] PREEMPT SMP PTI
>>   CPU: 0 PID: 70 Comm: kworker/0:3 Not tainted 6.5.0-rc2-00585-gb11bbbe4c66e #26
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-p4
>>   Workqueue: events sk_psock_backlog
>>   RIP: 0010:0xffffffffc0205254
>>   Code: 00 00 48 89 94 24 a0 00 00 00 41 5f 41 5e 41 5d 41 5c 5d 5b 41 5b 41 5a 41 59 41 50
>>   RSP: 0018:ffffc90000acbcb8 EFLAGS: 00010246
>>   RAX: ffffffff81c5ee10 RBX: ffff888018260000 RCX: 0000000000000001
>>   RDX: 0000000000000003 RSI: ffffc90000acbd58 RDI: 0000000000000000
>>   RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000080100005
>>   R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000003
>>   R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000003
>>   FS:  0000000000000000(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   CR2: 000000000000000e CR3: 000000000b0de002 CR4: 0000000000170ef0
>>   Call Trace:
>>    <TASK>
>>    ? __die+0x24/0x70
>>    ? page_fault_oops+0x15d/0x480
>>    ? fixup_exception+0x26/0x330
>>    ? exc_page_fault+0x72/0x1d0
>>    ? asm_exc_page_fault+0x26/0x30
>>    ? __pfx_inet_sendmsg+0x10/0x10
>>    ? 0xffffffffc0205254
>>    ? inet_sendmsg+0x20/0x80
>>    ? sock_sendmsg+0x8f/0xa0
>>    ? __skb_send_sock+0x315/0x360
>>    ? __pfx_sendmsg_unlocked+0x10/0x10
>>    ? sk_psock_backlog+0xb4/0x300
>>    ? process_one_work+0x292/0x560
>>    ? worker_thread+0x53/0x3e0
>>    ? __pfx_worker_thread+0x10/0x10
>>    ? kthread+0x102/0x130
>>    ? __pfx_kthread+0x10/0x10
>>    ? ret_from_fork+0x34/0x50
>>    ? __pfx_kthread+0x10/0x10
>>    ? ret_from_fork_asm+0x1b/0x30
>>    </TASK>
>>
>> The bug flow is as follows:
>>
>> thread 1                                   thread 2
>>
>> sk_psock_backlog                           sock_close
>>    sk_psock_handle_skb                        __sock_release
>>      __skb_send_sock                            inet_release
>>        sendmsg_unlocked                           tcp_close
>>          sock_sendmsg                               lock_sock
>>                                                       __tcp_close
>>                                                     release_sock
>>                                                   sock->sk = NULL // (1)
>>            inet_sendmsg
>>              sk = sock->sk // (2)
>>              inet_send_prepare
>>                inet_sk(sk)->inet_num // (3)
> 
> We are doing a lot of hoping through calls here to find something we
> should already know. We know the psock we are sending has a protocol
> of tcp, udp, ... and could call the send directly instead of walking
> back into the sk_socket and so on. For tcp example we could simply
> call tcp_sendmsg(sk, msg, size).
> 

Sorry, the fix method in this patch is not correct:

1. though it works on tcp, it fails on udp and unix sockets due to the
    lack of sendmsg_locked callback, which only exists on tcp.

2. inet_release sets socket->sk = NULL outside lock_sock, so lock_sock
    cannot protect us from accessing a NULL socket->sk.

To fix it correctly, calling tcp/udp/unix sendmsg directly without
touching sk_socket seems a good idea, I'll try it. Thanks.

> I haven't tried it yet, but I wonder if a lot of this logic gets
> easier to reason about if we have per protocol backlog logic. Its
> just a hunch at this point though.
> 
>>
>> sock->sk is set to NULL by thread 2 at time (1), then fetched by
>> thread 1 at time (2), and used by thread 1 to access memory at
>> time (3), resulting in NULL pointer dereference.
>>
>> To fix it, add lock_sock back on the egress path for sk_psock_handle_skb.
>>
>> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
>> Signed-off-by: Xu Kuohai <xukuohai@...wei.com>
>> ---
>>   net/core/skmsg.c | 44 ++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 34 insertions(+), 10 deletions(-)
>>
>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> index 7c2764beeb04..8b758c51aa0d 100644
>> --- a/net/core/skmsg.c
>> +++ b/net/core/skmsg.c
>> @@ -609,15 +609,42 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>>   	return err;
>>   }
>>   
>> +static int sk_psock_handle_ingress_skb(struct sk_psock *psock,
>> +				       struct sk_buff *skb,
>> +				       u32 off, u32 len)
>> +{
>> +	if (sock_flag(psock->sk, SOCK_DEAD))
>> +		return -EIO;
> 
> We didn't previously have the SOCK_DEAD check on ingress which
> looks fine because we will come along and flush the ingress
> queue when psock is being torn down. Adding it looks fine
> though because __tcp_close is flushing the sk_receive_queue
> and detaching the user from the socket so we have no way
> to read the data anyways. This will then abort the backlog
> which moves the psock destruct op along a bit faster.
> 
>> +	return sk_psock_skb_ingress(psock, skb, off, len);
>> +}
>> +
>> +static int sk_psock_handle_egress_skb(struct sk_psock *psock,
>> +				      struct sk_buff *skb,
>> +				      u32 off, u32 len)
>> +{
>> +	int ret;
>> +
>> +	lock_sock(psock->sk);
>> +
>> +	if (sock_flag(psock->sk, SOCK_DEAD))
>> +		ret = -EIO;
> 
> OK, the sock_orphan() call from tcp_close adjudge_to_death block will set
> the SOCK_DEAD flag and ensure we abort the send here. EIO then forces
> backlog to abort. This looks correct to me.
> 
>> +	else if (!sock_writeable(psock->sk))
>> +		ret = -EAGAIN;
>> +	else
>> +		ret = skb_send_sock_locked(psock->sk, skb, off, len);
>> +
>> +	release_sock(psock->sk);
>> +
>> +	return ret;
>> +}
>> +
>>   static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
>>   			       u32 off, u32 len, bool ingress)
>>   {
>> -	if (!ingress) {
>> -		if (!sock_writeable(psock->sk))
>> -			return -EAGAIN;
>> -		return skb_send_sock(psock->sk, skb, off, len);
>> -	}
>> -	return sk_psock_skb_ingress(psock, skb, off, len);
>> +	if (ingress)
>> +		return sk_psock_handle_ingress_skb(psock, skb, off, len);
>> +	else
>> +		return sk_psock_handle_egress_skb(psock, skb, off, len);
>>   }
>>   
>>   static void sk_psock_skb_state(struct sk_psock *psock,
>> @@ -660,10 +687,7 @@ static void sk_psock_backlog(struct work_struct *work)
>>   		ingress = skb_bpf_ingress(skb);
>>   		skb_bpf_redirect_clear(skb);
>>   		do {
>> -			ret = -EIO;
>> -			if (!sock_flag(psock->sk, SOCK_DEAD))
>> -				ret = sk_psock_handle_skb(psock, skb, off,
>> -							  len, ingress);
>> +			ret = sk_psock_handle_skb(psock, skb, off, len, ingress);
>>   			if (ret <= 0) {
>>   				if (ret == -EAGAIN) {
>>   					sk_psock_skb_state(psock, state, len, off);
> 
> OK LGTM nice catch I left my commentary above that helped as I reviewed it. I
> guess we need more stress testing along this path all of our testing is on
> ingress path at the moment. Do you happen to have something coded up that
> stress tests the redirect send paths?
>
Not yet, this bug was triggered in one of our http pressure tests.

> Reviewed-by: John Fastabend <john.fastabend@...il.com>
> 
> .