netdev - Re: KASAN veth use after free in XDP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ea1ec50f-06ca-7771-f9f8-e45c13179733@gmail.com>
Date:   Tue, 14 Mar 2023 17:02:40 +0900
From:   Toshiaki Makita <toshiaki.makita1@...il.com>
To:     Shawn Bohrer <sbohrer@...udflare.com>, lorenzo@...nel.org
Cc:     Toke Høiland-Jørgensen <toke@...nel.org>,
        makita.toshiaki@....ntt.co.jp, netdev@...r.kernel.org,
        bpf@...r.kernel.org, kernel-team@...udflare.com
Subject: Re: KASAN veth use after free in XDP_REDIRECT

On 2023/03/09 7:33, Shawn Bohrer wrote:
> On Wed, Jan 25, 2023 at 11:07:32AM +0900, Toshiaki Makita wrote:
>> On 2023/01/25 10:54, Toke Høiland-Jørgensen wrote:
>>> Shawn Bohrer <sbohrer@...udflare.com> writes:
>>>
>>>> Hello,
>>>>
>>>> We've seen the following KASAN report on our systems. When using
>>>> AF_XDP on a veth.
>>>>
>>>> KASAN report:
>>>>
>>>> BUG: KASAN: use-after-free in __xsk_rcv+0x18d/0x2c0
>>>> Read of size 78 at addr ffff888976250154 by task napi/iconduit-g/148640
>>>>
>>>> CPU: 5 PID: 148640 Comm: napi/iconduit-g Kdump: loaded Tainted: G           O       6.1.4-cloudflare-kasan-2023.1.2 #1
>>>> Hardware name: Quanta Computer Inc. QuantaPlex T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
>>>> Call Trace:
>>>>    <TASK>
>>>>    dump_stack_lvl+0x34/0x48
>>>>    print_report+0x170/0x473
>>>>    ? __xsk_rcv+0x18d/0x2c0
>>>>    kasan_report+0xad/0x130
>>>>    ? __xsk_rcv+0x18d/0x2c0
>>>>    kasan_check_range+0x149/0x1a0
>>>>    memcpy+0x20/0x60
>>>>    __xsk_rcv+0x18d/0x2c0
>>>>    __xsk_map_redirect+0x1f3/0x490
>>>>    ? veth_xdp_rcv_skb+0x89c/0x1ba0 [veth]
>>>>    xdp_do_redirect+0x5ca/0xd60
>>>>    veth_xdp_rcv_skb+0x935/0x1ba0 [veth]
>>>>    ? __netif_receive_skb_list_core+0x671/0x920
>>>>    ? veth_xdp+0x670/0x670 [veth]
>>>>    veth_xdp_rcv+0x304/0xa20 [veth]
>>>>    ? do_xdp_generic+0x150/0x150
>>>>    ? veth_xdp_rcv_one+0xde0/0xde0 [veth]
>>>>    ? _raw_spin_lock_bh+0xe0/0xe0
>>>>    ? newidle_balance+0x887/0xe30
>>>>    ? __perf_event_task_sched_in+0xdb/0x800
>>>>    veth_poll+0x139/0x571 [veth]
>>>>    ? veth_xdp_rcv+0xa20/0xa20 [veth]
>>>>    ? _raw_spin_unlock+0x39/0x70
>>>>    ? finish_task_switch.isra.0+0x17e/0x7d0
>>>>    ? __switch_to+0x5cf/0x1070
>>>>    ? __schedule+0x95b/0x2640
>>>>    ? io_schedule_timeout+0x160/0x160
>>>>    __napi_poll+0xa1/0x440
>>>>    napi_threaded_poll+0x3d1/0x460
>>>>    ? __napi_poll+0x440/0x440
>>>>    ? __kthread_parkme+0xc6/0x1f0
>>>>    ? __napi_poll+0x440/0x440
>>>>    kthread+0x2a2/0x340
>>>>    ? kthread_complete_and_exit+0x20/0x20
>>>>    ret_from_fork+0x22/0x30
>>>>    </TASK>
>>>>
>>>> Freed by task 148640:
>>>>    kasan_save_stack+0x23/0x50
>>>>    kasan_set_track+0x21/0x30
>>>>    kasan_save_free_info+0x2a/0x40
>>>>    ____kasan_slab_free+0x169/0x1d0
>>>>    slab_free_freelist_hook+0xd2/0x190
>>>>    __kmem_cache_free+0x1a1/0x2f0
>>>>    skb_release_data+0x449/0x600
>>>>    consume_skb+0x9f/0x1c0
>>>>    veth_xdp_rcv_skb+0x89c/0x1ba0 [veth]
>>>>    veth_xdp_rcv+0x304/0xa20 [veth]
>>>>    veth_poll+0x139/0x571 [veth]
>>>>    __napi_poll+0xa1/0x440
>>>>    napi_threaded_poll+0x3d1/0x460
>>>>    kthread+0x2a2/0x340
>>>>    ret_from_fork+0x22/0x30
>>>>
>>>>
>>>> The buggy address belongs to the object at ffff888976250000
>>>>    which belongs to the cache kmalloc-2k of size 2048
>>>> The buggy address is located 340 bytes inside of
>>>>    2048-byte region [ffff888976250000, ffff888976250800)
>>>>
>>>> The buggy address belongs to the physical page:
>>>> page:00000000ae18262a refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x976250
>>>> head:00000000ae18262a order:3 compound_mapcount:0 compound_pincount:0
>>>> flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
>>>> raw: 002ffff800010200 0000000000000000 dead000000000122 ffff88810004cf00
>>>> raw: 0000000000000000 0000000080080008 00000002ffffffff 0000000000000000
>>>> page dumped because: kasan: bad access detected
>>>>
>>>> Memory state around the buggy address:
>>>>    ffff888976250000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>    ffff888976250080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>> ffff888976250100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>                                                    ^
>>>>    ffff888976250180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>    ffff888976250200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>
>>>>
>>>> If I understand the code correctly it looks like a xdp_buf is
>>>> constructed pointing to the memory backed by a skb but consume_skb()
>>>> is called while the xdp_buf() is still in use.
>>>>
>>>> ```
>>>> 	case XDP_REDIRECT:
>>>> 		veth_xdp_get(&xdp);
>>>> 		consume_skb(skb);
>>>> 		xdp.rxq->mem = rq->xdp_mem;
>>>> 		if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
>>>> 			stats->rx_drops++;
>>>> 			goto err_xdp;
>>>> 		}
>>>> 		stats->xdp_redirect++;
>>>> 		rcu_read_unlock();
>>>> 		goto xdp_xmit;
>>>> ```
>>>>
>>>> It is worth noting that I think XDP_TX has the exact same problem.
>>>>
>>>> Again assuming I understand the problem one naive solution might be to
>>>> move the consum_skb() call after xdp_do_redirect().  I think this
>>>> might work for BPF_MAP_TYPE_XSKMAP, BPF_MAP_TYPE_DEVMAP, and
>>>> BPF_MAP_TYPE_DEVMAP_HASH since those all seem to copy the xdb_buf to
>>>> new memory.  The copy happens for XSKMAP in __xsk_rcv() and for the
>>>> DEVMAP cases happens in dev_map_enqueue_clone().
>>>>
>>>> However, it would appear that for BPF_MAP_TYPE_CPUMAP that memory can
>>>> live much longer, possibly even after xdp_do_flush().  If I'm correct,
>>>> I'm not really sure where it would be safe to call consume_skb().
>>>
>>> So the idea is that veth_xdp_get() does a
>>> get_page(virt_to_page(xdp->data)), where xdp->data in this case points
>>> to skb->head. This should keep the data page alive even if the skb
>>> surrounding it is freed by the call to consume_skb().
>>>
>>> However, because the skb->head in this case was allocated from a slab
>>> allocator, taking a page refcount is not enough to prevent it from being
>>> freed.
>>
>> Not sure why skb->head is kmallocked here.
>> skb_head_is_locked() check in veth_convert_skb_to_xdp_buff() should ensure that
>> skb head is a page fragment.
> 
> I have a few more details here.  We have some machines running 5.15
> kernels and some are running 6.1 kernels.  So far it appears this only
> happens on 6.1.  We also have a couple of different network cards but
> it appears that only the machines with Solarflare cards using the sfc
> driver hit the KASAN BUG.
> 
> 718a18a0c8a67f97781e40bdef7cdd055c430996 "veth: Rework
> veth_xdp_rcv_skb in order to accept non-linear skb" reworked and added
> the veth_convert_skb_to_xdp_buff() call you mentioned.  Importantly it
> also added a call to pskb_expand_head() which will kmalloc() the
> skb->head.  This looks like a path that could be causing the KASAN
> BUG, but I have not yet confirmed that is the path we are hitting.
> This change was also added in 5.18 so might explain why we don't see
> it on 5.15.

I think you made a valid point. It looks like pskb_expand_head() is incorrect here.
At least I did not expect kmallocked skb->data at this point when I introduced
veth XDP.

Also I looked into the discussion on the suspected patch,
https://patchwork.kernel.org/project/netdevbpf/list/?series=&submitter=&state=*&q=rework+veth_xdp_rcv_skb&archive=both&delegate=
But there was no discussion on pskb_expand_head() nor kmallocked data.

Lorenzo, what do you think?

Toshiaki Makita