lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <802b22da-199a-a724-972b-9bc0cabd43fb@huawei.com>
Date:   Wed, 1 Dec 2021 17:45:12 +0800
From:   Chengchang Tang <tangchengchang@...wei.com>
To:     <Brice.Goglin@...ia.fr>
CC:     <hwloc-devel@...ts.open-mpi.org>, <linux-kernel@...r.kernel.org>,
        <song.bao.hua@...ilicon.com>, <linuxarm@...wei.com>,
        "shenyang (M)" <shenyang39@...wei.com>,
        Jonathan Cameron <jonathan.cameron@...wei.com>,
        yangyicong <yangyicong@...wei.com>
Subject: [RFC] hwloc: Add support for exporting latency, bandwidth topology
 through calibration

Currently, hwloc can export hardware and network locality for 
applications to obtain and set their affinity. However, in many 
scenarios, the information provided by the topology is not enough, for 
example, it cannot reflect the actual memory latency and bandwidth data 
between different schedule domain. We hope to provide more detailed and 
precise information of HW capabilities in hwloc by adding several new 
calibration tools, so that application can achieve a more refined design 
to achieve higher performance and fully tap the capabilities of the HW.

We mainly focus on exposing memory/bus bandwidth, cache coherence/bus 
communication latency etc to users. Those topology information has 
neither standard ACPI nor dts interface to export, but they can be 
beneficial of user applications. Some examples,
1. the memory bandwidth while we spread tasks between multiple clusters 
vs. gather them in one cluster
2. the memory bandwidth while we spread tasks between multiple NUMA 
nodes vs. gather them in one NUMA
3. the cache synchronization latency while we spread tasks between 
multiple clusters vs. gather them in one cluster
4. the cache synchronization latency while we spread tasks between 
multiple NUMA nodes vs. gather them in one NUMA node
5. bus bandwidth and congestion in complex topology, for example, for 
the below topology
node 1 - node0 - node2 - node3
the bus between node0 and node2 might become bottleneck as the 
communications between node1 and node3 also depend on it.
numa distance can't describe this kind of complex bus topology at all.
6. I/O bandwidth and latency while we access I/O devices such as 
accelerators, networks, storages from the NUMA node which devices belong 
to vs. from different NUMA nodes.
...

If possible, we also can export more such as IPC bandwidth and 
latency(for example, pipe), spinlock/mutex latency etc. Calibration 
tools will provide these data about different entities at some certain 
topology levels so that application could select the spreading and 
gathering strategy of threads according to this data.

The design of the calibration tool will be similar to netloc. Three 
steps are required to use the calibration tool.

The first step is to get data about system bandwidth, latency, etc by 
running some benchmark tests since the standard operating system does 
not support providing this information. The raw data will be saved in 
files. This step may need to be performed by a privilege user.

The second step is to convert the original file generated in the 
previous step into a file in a readable format by the calibration tool. 
No privileges are required for this step.

In the third step, the application could obtain the calibration 
information of the system through a C APIs exposed by calibration tool 
and hwloc commands can be also extended to show these new information. 
The source of the calibration data is the readable file generated in the 
second step. E.g. hwloc_get_mem_bandwidth(hwloc_topology_t topology, 
unsigned idx1, unsigned idx2) could be used to get the memory bandwidth 
ability between idx1 and idx2 in some topology type.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ