lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cf6a5ec7-b7cf-4ac6-91ed-5543477d4e30@broadcom.com>
Date: Tue, 11 Mar 2025 17:00:15 -0700
From: Justin Chen <justin.chen@...adcom.com>
To: Florian Fainelli <florian.fainelli@...adcom.com>,
 Robin Murphy <robin.murphy@....com>, linux-kernel@...r.kernel.org
Cc: bcm-kernel-feedback-list@...adcom.com, Jonathan Corbet <corbet@....net>,
 Christoph Hellwig <hch@....de>, Marek Szyprowski <m.szyprowski@...sung.com>,
 Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Petr Tesarik <petr@...arici.cz>, Michael Kelley <mhklinux@...look.com>,
 Lukas Bulwahn <lukas.bulwahn@...hat.com>,
 Bagas Sanjaya <bagasdotme@...il.com>, Eder Zulian <ezulian@...hat.com>,
 Sean Anderson <sean.anderson@...ux.dev>,
 "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
 "open list:DMA MAPPING HELPERS" <iommu@...ts.linux.dev>,
 "open list:TRACING" <linux-trace-kernel@...r.kernel.org>
Subject: Re: [PATCH] swiotlb: Introduce DMA_ATTR_SKIP_DEVICE_SYNC



On 1/15/25 2:16 PM, Florian Fainelli wrote:
> On 1/15/25 12:12, Robin Murphy wrote:
>> On 2025-01-15 7:46 pm, Florian Fainelli wrote:
>>> From: Justin Chen <justin.chen@...adcom.com>
>>>
>>> Network device driver's receive path typically do the following:
>>>
>>> - dma_map_single(.., DMA_FROM_DEVICE)
>>> - dma_sync_single_for_cpu() to allow the CPU to inspect packet
>>>    descriptors
>>> - dma_unmap_single(.., DMA_FROM_DEVICE) when releasing the buffer
>>>
>>> Each of those operations incurs a copy from the original buffer to the
>>> TLB buffer, even if the device is known to be writing full buffers.
>>>
>>> Add a DMA_ATTR_SKIP_DEVICE_SYNC flag which can be set by device drivers
>>> to skip the copy at dma_map_single() to speed up the RX path when the
>>> device is known to be doing full buffer writes.
>>>
>>> This has been seen to provide a 20% speedup for Wi-Fi RX throughput
>>> testing.
>>>
>>> Signed-off-by: Justin Chen <justin.chen@...adcom.com>
>>> [florian: commit message, add DMA-API attribute flag]
>>> Signed-off-by: Florian Fainelli <florian.fainelli@...adcom.com>
>>> ---
>>>   Documentation/core-api/dma-attributes.rst | 9 +++++++++
>>>   Documentation/core-api/swiotlb.rst        | 4 +++-
>>>   include/linux/dma-mapping.h               | 6 ++++++
>>>   include/trace/events/dma.h                | 3 ++-
>>>   kernel/dma/swiotlb.c                      | 8 ++++++++
>>>   5 files changed, 28 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/core-api/dma-attributes.rst b/ 
>>> Documentation/core-api/dma-attributes.rst
>>> index 1887d92e8e92..ccd9c1891200 100644
>>> --- a/Documentation/core-api/dma-attributes.rst
>>> +++ b/Documentation/core-api/dma-attributes.rst
>>> @@ -130,3 +130,12 @@ accesses to DMA buffers in both privileged 
>>> "supervisor" and unprivileged
>>>   subsystem that the buffer is fully accessible at the elevated 
>>> privilege
>>>   level (and ideally inaccessible or at least read-only at the
>>>   lesser-privileged levels).
>>> +
>>> +DMA_ATTR_SKIP_DEVICE_SYNC
>>> +-------------------------
>>> +
>>> +Device drivers can set DMA_ATTR_SKIP_DEVICE_SYNC in order to avoid 
>>> doing a copy
>>> +from the original buffer to the TLB buffer for dma_map_single() with a
>>> +DMA_FROM_DEVICE direction. This can be used to save an extra copy in 
>>> a device
>>> +driver's data path when using swiotlb bounce buffering.
>>> +
>>> diff --git a/Documentation/core-api/swiotlb.rst b/Documentation/core- 
>>> api/swiotlb.rst
>>> index 9e0fe027dd3b..3bc1f9ba67b2 100644
>>> --- a/Documentation/core-api/swiotlb.rst
>>> +++ b/Documentation/core-api/swiotlb.rst
>>> @@ -67,7 +67,9 @@ to the driver for programming into the device. If a 
>>> DMA operation specifies
>>>   multiple memory buffer segments, a separate bounce buffer must be 
>>> allocated for
>>>   each segment. swiotlb_tbl_map_single() always does a "sync" 
>>> operation (i.e., a
>>>   CPU copy) to initialize the bounce buffer to match the contents of 
>>> the original
>>> -buffer.
>>> +buffer, except if DMA_ATTR_SKIP_DEVICE_SYNC is specified and the 
>>> direction is
>>> +DMA_FROM_DEVICE. This is a performance optimization that may not be 
>>> suitable for
>>> +all platforms.
>>>   swiotlb_tbl_unmap_single() does the reverse. If the DMA operation 
>>> might have
>>>   updated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not 
>>> set, the
>>> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
>>> index b79925b1c433..bfdaa65f8e9d 100644
>>> --- a/include/linux/dma-mapping.h
>>> +++ b/include/linux/dma-mapping.h
>>> @@ -58,6 +58,12 @@
>>>    */
>>>   #define DMA_ATTR_PRIVILEGED        (1UL << 9)
>>> +/*
>>> + * DMA_ATTR_SKIP_DEVICE_SYNC: used to indicate that the buffer does 
>>> not need to
>>> + * be synchronized to the device.
>>> + */
>>> +#define DMA_ATTR_SKIP_DEVICE_SYNC    (1UL << 10)
>>> +
>>>   /*
>>>    * A dma_addr_t can hold any valid DMA or bus address for the 
>>> platform.  It can
>>>    * be given to a device to use as a DMA source or target.  It is 
>>> specific to a
>>> diff --git a/include/trace/events/dma.h b/include/trace/events/dma.h
>>> index d8ddc27b6a7c..6eb8fd7e3515 100644
>>> --- a/include/trace/events/dma.h
>>> +++ b/include/trace/events/dma.h
>>> @@ -31,7 +31,8 @@ TRACE_DEFINE_ENUM(DMA_NONE);
>>>           { DMA_ATTR_FORCE_CONTIGUOUS, "FORCE_CONTIGUOUS" }, \
>>>           { DMA_ATTR_ALLOC_SINGLE_PAGES, "ALLOC_SINGLE_PAGES" }, \
>>>           { DMA_ATTR_NO_WARN, "NO_WARN" }, \
>>> -        { DMA_ATTR_PRIVILEGED, "PRIVILEGED" })
>>> +        { DMA_ATTR_PRIVILEGED, "PRIVILEGED" }, \
>>> +        { DMA_ATTR_SKIP_DEVICE_SYNC, "SKIP_DEVICE_SYNC" })
>>>   DECLARE_EVENT_CLASS(dma_map,
>>>       TP_PROTO(struct device *dev, phys_addr_t phys_addr, dma_addr_t 
>>> dma_addr,
>>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>>> index abcf3fa63a56..8dab89bf5e33 100644
>>> --- a/kernel/dma/swiotlb.c
>>> +++ b/kernel/dma/swiotlb.c
>>> @@ -1435,8 +1435,16 @@ phys_addr_t swiotlb_tbl_map_single(struct 
>>> device *dev, phys_addr_t orig_addr,
>>>        * the original data, even if it's garbage, is necessary to match
>>>        * hardware behavior.  Use of swiotlb is supposed to be 
>>> transparent,
>>>        * i.e. swiotlb must not corrupt memory by clobbering unwritten 
>>> bytes.
>>> +     *
>>> +     * Setting DMA_ATTR_SKIP_DEVICE_SYNC will negate the behavior 
>>> described
>>> +     * before and avoid the copy from the original buffer to the TLB
>>> +     * buffer.
>>>        */
>>> +    if (dir == DMA_FROM_DEVICE && (attrs & DMA_ATTR_SKIP_DEVICE_SYNC))
>>> +        goto out;
>>
>> Nope, we still need to initialise the SWIOTLB slot with *something*, 
>> or we're just reintroducing the same data leakage issue again. The 
>> whole deal with that was that the caller *did* expect the entire 
>> buffer to be written, but the device had an error, and thus the 
>> subsequent unmap bounced out whatever data was in SWIOTLB before.
> 
> Do you know how common that assumption is within drivers? Could that 
> behavior been hidden behind a flag, sort of like what is being done here?
> 
>>
>> A memset is hopefully at least a bit faster than a copy, so maybe 
>> there is still some life in this idea, but the semantic is not "skip 
>> sync", it's "I'm OK with this entire buffer getting scribbled even my 
>> device ends up never touching it".
> 
> Sure, let me test with a memset(), any suggestion on a name that carries 
> better semantics, I do agree that "skip sync" is not quite descriptive 
> enough.

I ran some more test and see minimal improvement from memset. I ran an 
isolated test of using one 4k page buffer.

Average time per RX packet:
NO SWIOTLB: 2.111 us
SWIOTLB without bounce: 4.937 us
SWIOTLB with bounce: 6.631 us
SWIOTLB with memset: 6.506 us

I understand this introduces potential data leakage, but the 25% 
improvement is quite significant. The percentage will increase with 
bigger buffers. Is there anyway we can find a compromise here? What if 
we hid this behind the swiotlb flag?

@@ -176,8 +176,9 @@ setup_io_tlb_npages(char *str)
                 swiotlb_force_bounce = true;
         else if (!strcmp(str, "noforce"))
                 swiotlb_force_disable = true;
+       else if (!strcmp(str, "leakDMAbuffer"))
+               swiotlb_leak_dma_buffer = true;

Thanks,
Justin



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ