[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <340ff8ef-9ff5-7175-c234-4132bbdfc5f7@cybernetics.com>
Date: Tue, 7 Jun 2022 14:38:34 -0400
From: Tony Battersby <tonyb@...ernetics.com>
To: linux-mm@...ck.org, linux-kernel@...r.kernel.org
Cc: iommu@...ts.linux-foundation.org, kernel-team@...com,
Matthew Wilcox <willy@...radead.org>,
Keith Busch <kbusch@...nel.org>,
Andy Shevchenko <andy.shevchenko@...il.com>,
Robin Murphy <robin.murphy@....com>,
Tony Lindgren <tony@...mide.com>
Subject: [PATCH v6 00/11] mpt3sas and dmapool scalability
This patch series improves dmapool scalability by replacing linear scans
with red-black trees.
Note that Keith Busch is also working on improving dmapool scalability,
so for now I would recommend not merging my scalability patches until
Keith's approach can be evaluated. In the meantime, my patches can
serve as a benchmark comparison. I also have a number of cleanup
patches in my series that could be useful on their own.
Changes since v5:
1. inline pool_free_page() into dma_pool_destroy() to avoid adding
unused code
2. convert scnprintf() to sysfs_emit()
3. avoid adding a hole in struct dma_pool
4. fix big O usage in description
References:
v5
https://lore.kernel.org/linux-mm/9b08ab7c-b80b-527d-9adf-7716b0868fbc@cybernetics.com/
Keith Busch's dmapool performance enhancements
https://lore.kernel.org/linux-mm/20220428202714.17630-1-kbusch@kernel.org/
Below is my original description of the motivation for these patches.
drivers/scsi/mpt3sas is running into a scalability problem with the
kernel's DMA pool implementation. With a LSI/Broadcom SAS 9300-8i
12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the
equivalent of:
chain_dma_pool = dma_pool_create(size = 128);
for (i = 0; i < 373959; i++)
{
dma_addr[i] = dma_pool_alloc(chain_dma_pool);
}
And at rmmod, system shutdown, or system reboot, mpt3sas does the
equivalent of:
for (i = 0; i < 373959; i++)
{
dma_pool_free(chain_dma_pool, dma_addr[i]);
}
dma_pool_destroy(chain_dma_pool);
With this usage, both dma_pool_alloc() and dma_pool_free() exhibit
O(n) complexity, although dma_pool_free() is much worse due to
implementation details. On my system, the dma_pool_free() loop above
takes about 9 seconds to run. Note that the problem was even worse
before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory
consumption in driver."), where the dma_pool_free() loop could take ~30
seconds.
mpt3sas also has some other DMA pools, but chain_dma_pool is the only
one with so many allocations:
cat /sys/devices/pci0000:80/0000:80:07.0/0000:85:00.0/pools
(manually cleaned up column alignment)
poolinfo - 0.1
reply_post_free_array pool 1 21 192 1
reply_free pool 1 1 41728 1
reply pool 1 1 1335296 1
sense pool 1 1 970272 1
chain pool 373959 386048 128 12064
reply_post_free pool 12 12 166528 12
The patches in this series improve the scalability of the DMA pool
implementation, which significantly reduces the running time of the
DMA alloc/free loops. With the patches applied, "modprobe mpt3sas",
"rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are
significantly faster. Here are some benchmarks (of DMA alloc/free
only, not the entire modprobe/rmmod):
dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959
original: 350 ms ( 1x)
dmapool patches: 18 ms (19x)
dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959
original: 8901 ms ( 1x)
dmapool patches: 19 ms ( 477x)
Powered by blists - more mailing lists