linux-kernel - Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOZ5it3xiMwD4_HsgXR_7-ERTzoS+FG3W5Og4sKtgthFA7HsVQ@mail.gmail.com>
Date: Thu, 21 Nov 2024 10:31:11 -0700
From: Brian Johannesmeyer <bjohannesmeyer@...il.com>
To: Keith Busch <kbusch@...nel.org>
Cc: Christoph Hellwig <hch@...radead.org>, Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, linux-hardening@...r.kernel.org, 
	Raphael Isemann <teemperor@...il.com>, Cristiano Giuffrida <giuffrida@...vu.nl>, Herbert Bos <h.j.bos@...nl>, 
	Greg KH <gregkh@...uxfoundation.org>
Subject: Re: [RFC v2 0/2] dmapool: Mitigate device-controllable mem. corruption

> Here's my quick thoughts at this late hour, though I might have
> something else tomorrow. If the links are external to the dma addr
> being freed, then I think you need to change the entire alloc/free API
> to replace the dma_addr_t handle with a new type, like a tuple of
>
>   { dma_addr_t, priv_list_link }
>
> That should make it possible to preserve the low constant time to alloc
> and free in the critical section, which I think is a good thing to have.
> I found 170 uses of dma_pool_alloc, and 360 dma_pool_free in the kernel,
> so changing the API is no small task. :(

Indeed, an API change like this might be the only way to move metadata
out of the DMA blocks while maintaining its current performance.

Regarding the performance hit of the submitted patches:
- *Allocations* remain constant time (`O(1)`), as they simply check
the `pool->next_block` pointer.
- *Frees* are no longer constant time. Previously, `dma_pool_free()`
converted a `vaddr` to its corresponding `block` by type casting, but
now it calls `pool_find_block()`, which iterates over all pages
(`O(n)`).

Therefore, optimizing `dma_pool_free()` is key. While restructuring
the API is the "best" solution, I implemented a compromise:
introducing a `struct maple_tree block_map` field in `struct dma_pool`
to save mappings of a `vaddr` to its corresponding `block`. A maple
tree isn’t constant time, but it offers `O(log n)` performance, which
is a significant improvement over the current `O(n)` iteration.

Here are the performance results. I've already reported the first two
sets of numbers, but I'll repeat them here:

**Without no patches applied:**
```
dmapool test: size:16   align:16   blocks:8192 time:11860
dmapool test: size:64   align:64   blocks:8192 time:11951
dmapool test: size:256  align:256  blocks:8192 time:12287
dmapool test: size:1024 align:1024 blocks:2048 time:3134
dmapool test: size:4096 align:4096 blocks:1024 time:1686
dmapool test: size:68   align:32   blocks:8192 time:12050
```

**With the submitted patches applied:**
```
dmapool test: size:16   align:16   blocks:8192 time:34432
dmapool test: size:64   align:64   blocks:8192 time:62262
dmapool test: size:256  align:256  blocks:8192 time:238137
dmapool test: size:1024 align:1024 blocks:2048 time:61386
dmapool test: size:4096 align:4096 blocks:1024 time:75342
dmapool test: size:68   align:32   blocks:8192 time:88243
```

**With the submitted patches applied AND using a maple tree to improve
the performance of vaddr-to-block translations:**
```
dmapool test: size:16   align:16   blocks:8192 time:43668
dmapool test: size:64   align:64   blocks:8192 time:44746
dmapool test: size:256  align:256  blocks:8192 time:45434
dmapool test: size:1024 align:1024 blocks:2048 time:11013
dmapool test: size:4096 align:4096 blocks:1024 time:5250
dmapool test: size:68   align:32   blocks:8192 time:45900
```

The maple tree optimization reduces the performance hit for most block
sizes, especially for larger blocks. While the performance is not
fully back to baseline, it gives a reasonable trade-off between
protection, runtime performance, and ease of deployment (i.e., not
requiring an API change).

If this approach looks acceptable, I can submit it as a V3 patch
series for further review and discussion.

Thanks,

Brian Johannesmeyer