[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BYAPR11MB30957D871AF159CB7BB7F753D96C9@BYAPR11MB3095.namprd11.prod.outlook.com>
Date: Mon, 15 Mar 2021 20:04:36 +0000
From: "Chen, Mike Ximing" <mike.ximing.chen@...el.com>
To: "Williams, Dan J" <dan.j.williams@...el.com>
CC: Greg KH <gregkh@...uxfoundation.org>,
Netdev <netdev@...r.kernel.org>,
David Miller <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
"Arnd Bergmann" <arnd@...db.de>,
Pierre-Louis Bossart <pierre-louis.bossart@...ux.intel.com>,
"Brandeburg, Jesse" <jesse.brandeburg@...el.com>
Subject: RE: [PATCH v10 00/20] dlb: introduce DLB device driver
> From: Dan Williams <dan.j.williams@...el.com>
> On Fri, Mar 12, 2021 at 1:55 PM Chen, Mike Ximing <mike.ximing.chen@...el.com> wrote:
> >
> > In a brief description, Intel DLB is an accelerator that replaces
> > shared memory queuing systems. Large modern server-class CPUs, with
> > local caches for each core, tend to incur costly cache misses, cross
> > core snoops and contentions. The impact becomes noticeable at high
> > (messages/sec) rates, such as are seen in high throughput packet
> > processing and HPC applications. DLB is used in high rate pipelines
> > that require a variety of packet distribution & synchronization
> > schemes. It can be leveraged to accelerate user space libraries, such
> > as DPDK eventdev. It could show similar benefits in frameworks such as
> > PADATA in the Kernel - if the messaging rate is sufficiently high.
>
> Where is PADATA limited by distribution and synchronization overhead?
> It's meant for parallelizable work that has minimal communication between the work units, ordering is
> about it's only synchronization overhead, not messaging. It's used for ipsec crypto and page init.
> Even potential future bulk work usages that might benefit from PADATA like like md-raid, ksm, or kcopyd
> do not have any messaging overhead.
>
In the our PADATA investigation, the improvements are primarily from ordering overhead.
Parallel scheduling is offloaded by DLB orderd parallel queue.
Serialization (re-order) is offloaded by DLB directed queue.
We see significant throughput increases in crypto tests using tcrypt. In our test configuration, preliminary results show that the dlb accelerated case encrypts at 2.4x (packets/s), and decrypts at 2.6x of the unaccelerated case.
Powered by blists - more mailing lists