linux-kernel - LKMM: Read dependencies of writes ordered by dma

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <YRo58c+JGOvec7tc@elver.google.com>
Date:   Mon, 16 Aug 2021 12:12:01 +0200
From:   Marco Elver <elver@...gle.com>
To:     "Paul E. McKenney" <paulmck@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>,
        Alan Stern <stern@...land.harvard.edu>,
        Andrea Parri <parri.andrea@...il.com>,
        Will Deacon <will@...nel.org>,
        Mark Rutland <mark.rutland@....com>
Cc:     Dmitry Vyukov <dvyukov@...gle.com>, kasan-dev@...glegroups.com,
        linux-kernel@...r.kernel.org
Subject: LKMM: Read dependencies of writes ordered by dma_wmb()?

Hello,

Commit c58a801701693 added a paragraph to the LKMM:

	+Although we said that plain accesses are not linked by the ppo
	+relation, they do contribute to it indirectly.  Namely, when there is
	+an address dependency from a marked load R to a plain store W,
	+followed by smp_wmb() and then a marked store W', the LKMM creates a
	+ppo link from R to W'.

Defining that certain _marked reads_ will also be ordered by smp_wmb().
But otherwise, other reads (especially plain reads!) will _never_ be
ordered by smp_wmb(). Is my understanding correct?

I am asking because KCSAN is growing limited support for weak memory
modeling and memory barriers, and I'm trying to figure out if I'm seeing
a false positive or genuinely allowed race.

One caveat is the case I'm trying to understand doesn't involve just 2
CPUs but also a device. And for now, I'm assuming that dma_wmb() is as
strong as smp_wmb() also wrt other CPUs (but my guess is this
assumption is already too strong).

The whole area of the memory model that includes talking to devices and
devices talking back to CPUs seems quite murky, and need to confirm that
I either got it right or wrong. :-)

The report (explained below):

| assert no accesses to 0xffff8880077b5500 of 232 bytes by interrupt on cpu 1:
|  __cache_free mm/slab.c:3450 [inline]
|  kmem_cache_free+0x4b/0xe0 mm/slab.c:3740
|  kfree_skbmem net/core/skbuff.c:709 [inline]
|  __kfree_skb+0x145/0x190 net/core/skbuff.c:745
|  consume_skb+0x6d/0x190 net/core/skbuff.c:900
|  __dev_kfree_skb_any+0xb8/0xc0 net/core/dev.c:3195
|  dev_kfree_skb_any include/linux/netdevice.h:3979 [inline]
|  e1000_unmap_and_free_tx_resource drivers/net/ethernet/intel/e1000/e1000_main.c:1969 [inline]
|  e1000_clean_tx_irq drivers/net/ethernet/intel/e1000/e1000_main.c:3859 [inline]
|  e1000_clean+0x302/0x2080 drivers/net/ethernet/intel/e1000/e1000_main.c:3800
|  __napi_poll+0x81/0x430 net/core/dev.c:7019
|  napi_poll net/core/dev.c:7086 [inline]
|  net_rx_action+0x2cf/0x6b0 net/core/dev.c:7173
|  __do_softirq+0x12c/0x275 kernel/softirq.c:558
| [...]
| 
| read (reordered) to 0xffff8880077b5570 of 4 bytes by task 1985 on cpu 0:
|  skb_headlen include/linux/skbuff.h:2139 [inline]
|  e1000_tx_map drivers/net/ethernet/intel/e1000/e1000_main.c:2829 [inline]
|  e1000_xmit_frame+0x12fd/0x2720 drivers/net/ethernet/intel/e1000/e1000_main.c:3243
|  __netdev_start_xmit include/linux/netdevice.h:4944 [inline]
|  netdev_start_xmit include/linux/netdevice.h:4958 [inline]
|  xmit_one+0x103/0x2c0 net/core/dev.c:3658
|  dev_hard_start_xmit+0x70/0x130 net/core/dev.c:3674
|  sch_direct_xmit+0x1e5/0x600 net/sched/sch_generic.c:342
|  __dev_xmit_skb net/core/dev.c:3874 [inline]
|  __dev_queue_xmit+0xd26/0x1990 net/core/dev.c:4241
|  dev_queue_xmit+0x1d/0x30 net/core/dev.c:4306
| [...]
|   |
|   +-> reordered to: e1000_xmit_frame+0x2294/0x2720 drivers/net/ethernet/intel/e1000/e1000_main.c:3282

KCSAN is saying there is a potential use-after-free read of an skb due
to the read to 0xffff8880077b5570 potentially being delayed/reordered
later. If the memory was reallocated and reused concurrently, the read
could read garbage data:

1.	The e1000 driver is being instructed to transmit in
	e1000_xmit_frame(). Here it uses the data in the skb in various
	places (e.g. in skb_headlen() above) to set up a new element in
	the ring buffer to be consumed by the device via DMA.

2.	Eventually it calls e1000_tx_queue(), which seems to publish the
	next entry into the ring buffer and finally calls dma_wmb().
	Until this point I see no other barriers (although there's a
	writel(), but it doesn't always seem to be called).

3.	e1000_clean_tx_irq() is called on another CPU after transmit
	completes, and we know the device has consumed that entry from
	the ring buffer. At this point the driver then says that the
	associated skb can be kfree()'d.

4.	If I interpreted dma_wmb() (and smp_wmb()) right, plain reads
	may be reordered after it, irrespective if a write that depended
	on such reads was ordered by the wmb(). Which means the
	reordering of the plain reads accessing the skb before it may in
	fact happen concurrently with the kfree() of skb if reordered
	after. For example reordered to the very end of
	e1000_xmit_frame() (line 3282) as KCSAN simulated in this case.

Is the above result allowed by the kernel's memory model?

In practice, my guess is no compiler and architecture combination would
allow this today; or is there an arch where it could?

Thanks,
-- Marco