[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b039012c-1b04-40cb-a760-b1ef942fe23c@gmail.com>
Date: Thu, 22 Aug 2024 20:19:36 +0200
From: Christian Lamparter <chunkeey@...il.com>
To: Christoph Hellwig <hch@....de>,
LEROY Christophe <christophe.leroy2@...soprasteria.com>
Cc: Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Paul Mackerras <paulus@...ba.org>, Michael Ellerman <mpe@...erman.id.au>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
Stan Johnson <userm57@...oo.com>, Finn Thain <fthain@...ux-m68k.org>
Subject: Re: [PATCH v2] powerpc: warn on emulation of dcbz instruction in
kernel mode
Hi!
Thank you all for your insightful insights :-) .
... and derp, used the wrong mail.
On 8/22/24 9:14 AM, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 06:39:33AM +0000, LEROY Christophe wrote:
>> powerpc has a magic instruction 'dcbz' which clears a full cacheline in
>> one go. It is far more efficient than a loop to store zeros, and since
>> 2015 memset(0) has been implemented with that instruction (commit
>> 5b2a32e80634 ("powerpc/32: memset(0): use cacheable_memzero"))
>>
>> But that instruction generates an alignment exception when used on
>> non-cached memory (whether it is RAM or not doesn't matter). It is then
>> emulated by the kernel but it of course leads to a serious performance
>> degradation, hence the warning added by commit cbe654c77961 ("powerpc:
>> warn on emulation of dcbz instruction in kernel mode"). Until now it
>> helped identify and fix use of memset() on IO memory.
>>
>> But if memset() is expected to be used with non-cached RAM, then I don't
>> know what to do. Any suggestion ?
>
> I'd suggest two things:
>
> 1) remove the warning. The use case is perfectly valid and everything
> using uncached memory is already slow, so people will just have to
> deal with it. Maybe offer a trace point instead if people care about
> it.
> 2) figure out a way to avoid this case in the dma-coherent allocator,
> which is probably the only case where it happens frequently
> (a few drivers also zero or re-zero coherent memory, but most of the
> time that is cargo cult programming and not actually needed)
I tested your patch below and got the next warning. This time from
dma_alloc_from_pool about a similar memset in:
https://elixir.bootlin.com/linux/v6.6.47/source/kernel/dma/pool.c#L261
it triggers for the sata_dwc_ex460 driver when it access the harddrive.
For anyone interested, please look below for the full warning splat.
(I switched to the MyBook Live, it's easier to disassemble+reassemble
than the MX60... but with similar results).
> For 2 I can think of two options:
>
> a) provide a arch hook for zeroing the dma memory that defaults to
> memset, but which powerpc can override
> a) figure out a way to clear the memory before marking it uncached
> if we can
>
> a) it obviously easier to verify, but b) is probably going to give
> way better performance.
>
> Below is an untested implementation of b) for dma-direct, I just need to
> find out if there is any architecture that requires the memory to be
> zeroed after it іt has been remapped. The iommu drivers might also
> need similar treatment.
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 4480a3cd92e087..66e94b32ab0081 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -275,6 +275,9 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> if (force_dma_unencrypted(dev))
> prot = pgprot_decrypted(prot);
>
> + if (!PageHighMem(page))
> + memset(page_address(page), 0, size);
> +
> /* remove any dirty cache lines on the kernel alias */
> arch_dma_prep_coherent(page, size);
>
> @@ -283,14 +286,15 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> __builtin_return_address(0));
> if (!ret)
> goto out_free_pages;
> + if (PageHighMem(page))
> + memset(ret, 0, size);
> } else {
> ret = page_address(page);
> if (dma_set_decrypted(dev, ret, size))
> goto out_leak_pages;
> + memset(ret, 0, size);
> }
>
> - memset(ret, 0, size);
> -
> if (set_uncached) {
> arch_dma_prep_coherent(page, size);
> ret = arch_dma_set_uncached(ret, size);
I added a counter in fix_alignment to count the number of traps
issued by the kernel code. From what I measured in a quick test,
With this patch, the count during boot went down from ~21000
(unmodified) to 1200 (patched)... but from what I can tell, it
still triggers for the dma_direct_alloc in emac.
if there are more patches, please. I can test them during the next
few days while I'm on holiday.
Cheers,
Christian
---
[ 2.596136] ------------[ cut here ]------------
[ 2.600759] WARNING: CPU: 0 PID: 24 at fix_alignment+0x17c/0x1b4
[ 2.606760] Modules linked in:
[ 2.609813] CPU: 0 PID: 24 Comm: kworker/u2:2 Not tainted 6.6.47 #0
[ 2.616057] Hardware name: MyBook Live APM821XX 0x12c41c83 PowerPC 44x Platform
[ 2.623337] Workqueue: events_unbound async_run_entry_fn
[ 2.628646] NIP: c0003190 LR: c0003070 CTR: c0014744
[ 2.633674] REGS: c11a94b0 TRAP: 0700 Not tainted (6.6.47)
[ 2.639402] MSR: 00021000 <CE,ME> CR: 22008808 XER: 20000000
[ 2.645312]
[ 2.645312] GPR00: c0008640 c11a95a0 c10de300 00000000 00000001 00000018 01f01cdf 00000004
[ 2.645312] GPR08: c0f30000 00000000 0147ae14 c11a9610 c0013abc 00000000 c004c1e4 00000002
[ 2.645312] GPR16: 00000000 00000001 00000000 00000000 d1000000 c0910000 c1182010 c0f50000
[ 2.645312] GPR24: c11a9774 c00804a8 c0f30000 c0910000 c0910000 00001000 c108e3a0 c11a9620
[ 2.680114] NIP [c0003190] fix_alignment+0x17c/0x1b4
[ 2.685065] LR [c0003070] fix_alignment+0x5c/0x1b4
[ 2.689844] Call Trace:
[ 2.692279] [c11a95a0] [c11a9734] 0xc11a9734 (unreliable)
[ 2.697670] [c11a95f0] [c0008640] alignment_exception+0xf0/0x164
[ 2.703666] [c11a9610] [c0000a30] Alignment+0x130/0x180
[ 2.708877] --- interrupt: 600 at memset+0x60/0xc0
[ 2.713672] NIP: c0013abc LR: c0081df8 CTR: 0000007f
[ 2.718700] REGS: c11a9620 TRAP: 0600 Not tainted (6.6.47)
[ 2.724420] MSR: 00021000 <CE,ME> CR: 42008808 XER: 20000000
[ 2.730330] DEAR: d1000020 ESR: 00000000
[ 2.730330] GPR00: 00000007 c11a9710 c10de300 d1000000 00000000 00001000 d100001c 00000004
[ 2.730330] GPR08: 00001000 0000007f 00000001 53c20800 28008808 00000000 c004c1e4 00000002
[ 2.730330] GPR16: 00000000 00000001 00000000 00000000 d1000000 c0910000 c1182010 c0f50000
[ 2.730330] GPR24: c11a9774 c00804a8 c0f30000 c0910000 c0910000 00001000 c108e3a0 000010a0
[ 2.767645] NIP [c0013abc] memset+0x60/0xc0
[ 2.771819] LR [c0081df8] dma_alloc_from_pool+0x118/0x204
[ 2.777201] --- interrupt: 600
[ 2.780242] [c11a9710] [c0081eb0] dma_alloc_from_pool+0x1d0/0x204 (unreliable)
[ 2.787449] [c11a9760] [c00807e0] dma_direct_alloc+0x90/0x2a4
[ 2.793178] [c11a97a0] [c007f850] dma_alloc_attrs+0xa8/0xf8
[ 2.798758] [c11a97e0] [c019d180] dma_pool_alloc+0x11c/0x2ac
[ 2.804409] [c11a9810] [c0434cec] dwc_desc_get+0x2c/0x98
[ 2.809722] [c11a9840] [c04355cc] dwc_prep_slave_sg+0x180/0x524
[ 2.815633] [c11a98b0] [c04d7938] sata_dwc_qc_issue+0x1d8/0x268
[ 2.821559] [c11a9920] [c04bbfb0] ata_qc_issue+0x174/0x2b0
[ 2.827046] [c11a9940] [c04c674c] __ata_scsi_queuecmd+0x200/0x4fc
[ 2.833129] [c11a9960] [c04c6a84] ata_scsi_queuecmd+0x3c/0x88
[ 2.838865] [c11a9980] [c04a7bb0] scsi_queue_rq+0x6fc/0xb40
[ 2.844429] [c11a99c0] [c03801e8] __blk_mq_issue_directly+0x40/0xe0
[ 2.850694] [c11a99f0] [c038462c] blk_mq_try_issue_directly+0xa8/0x10c
[ 2.857208] [c11a9a10] [c03854e4] blk_mq_submit_bio+0x5f4/0x67c
[ 2.863118] [c11a9a70] [c0374ef4] submit_bio_noacct_nocheck+0x210/0x2d0
[ 2.869728] [c11a9aa0] [c0206850] block_read_full_folio+0x1f4/0x3a4
[ 2.875983] [c11a9b10] [c01353d8] filemap_read_folio+0x40/0x240
[ 2.881910] [c11a9b60] [c01382d0] do_read_cache_folio+0xb4/0x228
[ 2.887906] [c11a9b90] [c039144c] read_part_sector+0x50/0x10c
[ 2.893643] [c11a9bb0] [c0392358] read_lba+0xb8/0x198
[ 2.898680] [c11a9bf0] [c0392854] efi_partition+0xd8/0xc30
[ 2.904150] [c11a9cc0] [c0390be4] bdev_disk_changed+0x2ac/0x710
[ 2.910051] [c11a9d30] [c036be5c] blkdev_get_whole+0xc4/0xd8
[ 2.915693] [c11a9d50] [c036cb7c] blkdev_get_by_dev.part.0+0x288/0x380
[ 2.922199] [c11a9d90] [c038d49c] disk_scan_partitions+0x90/0x150
[ 2.928272] [c11a9dc0] [c038d9ac] device_add_disk+0x450/0x484
[ 2.934001] [c11a9df0] [c04b7718] sd_probe+0x360/0x4fc
[ 2.939132] [c11a9e20] [c0479db0] really_probe+0x2a0/0x370
[ 2.944628] [c11a9e50] [c0479f10] __driver_probe_device+0x90/0x200
[ 2.950796] [c11a9e70] [c047a0c8] driver_probe_device+0x48/0x104
[ 2.956793] [c11a9ea0] [c047a268] __device_attach_driver+0xe4/0x130
[ 2.963048] [c11a9ec0] [c04772e0] bus_for_each_drv+0x88/0xe4
[ 2.968699] [c11a9ef0] [c047939c] __device_attach_async_helper+0xa0/0xe8
[ 2.975386] [c11a9f20] [c005096c] async_run_entry_fn+0x40/0x11c
[ 2.981296] [c11a9f40] [c0044b70] process_one_work+0x1bc/0x35c
[ 2.987111] [c11a9f70] [c0045088] worker_thread+0x378/0x4e8
[ 2.992667] [c11a9fc0] [c004c2c8] kthread+0xe4/0xe8
[ 2.997548] [c11a9ff0] [c000d210] start_kernel_thread+0x10/0x14
[ 3.003449] Code: 39400000 7d500ba6 4c00012c 2c090000 41a2fec4 83e1004c 3860fff2 38210050 4e800020 38a00000 3920fff2 4bffffd4 <0fe00000> 4bffff8c 80010054 7c0803a6
[ 3.018138] ---[ end trace 0000000000000000 ]---
Powered by blists - more mailing lists