lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CH0PR01MB7153F90EA5FAD6C18D361CC4F2039@CH0PR01MB7153.prod.exchangelabs.com>
Date:   Mon, 28 Jun 2021 21:59:48 +0000
From:   "Marciniszyn, Mike" <mike.marciniszyn@...nelisnetworks.com>
To:     Jason Gunthorpe <jgg@...dia.com>,
        "Dalessandro, Dennis" <dennis.dalessandro@...nelisnetworks.com>
CC:     Leon Romanovsky <leon@...nel.org>,
        Doug Ledford <dledford@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        "Pine, Kevin" <kevin.pine@...nelisnetworks.com>
Subject: RE: [PATCH rdma-next] RDMA/rdmavt: Decouple QP and SGE lists
 allocations

>
> Fine, but the main question is if you can use normal memory policy settings, not
> this.
>
> Jason

Our performance team has gotten some preliminary data on AMD platforms.

I prepared a kernel that will using allocate the QP using the "local" numa node (as currently done) and an allocation that intentionally allocates on the opposite socket based on a module parameter and our internal tests were executed with progressively larger queue pair counts.

In the second case on 64 core/socket AMD platforms, we are seeing with the intentionally opposite allocation, latency dropped ~6-7% and BW dropped ~13% on high queue count perftest.

SKX impact is minimal if any, but we need to look at legacy Intel chips that preceded SKX.   We are still reviewing the data and expanding the test to older chips.

Our theory is the hfi1 interrupt receive processing is fetching cachelines between the sockets causing the slowdown.   The receive processing is critical for hfi1 (and qib before that).    This is a heavily tuned code path.

To answer some of the pending questions posed before, the mempolicy looks to be a process relative control and does not apply to our QP allocation where the struct rvt_qp is in the kernel.  It certainly does not apply to kernel ULPs such as those created by say Lustre, ipoib, SRP, iSer, and NFS RDMA.

We do support comp_vector stuff, but that distributes completion processing.  Completions are triggered in our receive processing but to a much less extent based on ULP choices and packet type.    From a strategy standpoint, the code assumes distribution of kernel receive interrupt processing is vectored either by irqbalance or by explicit user mode scripting to spread RC QP receive processing across CPUs on the local socket.

Mike
External recipient

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ