linux-kernel - Re: [PATCH] nvme-pci: fix host memory buffer allocation size

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6123b484-bf2c-49f0-a657-6085c7333b2e@t-8ch.de>
Date:   Tue, 10 May 2022 12:20:17 +0200
From:   Thomas Weißschuh <linux@...ssschuh.net>
To:     Christoph Hellwig <hch@....de>, g@...y.t-8ch.de
Cc:     Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...com>,
        Sagi Grimberg <sagi@...mberg.me>, linux-kernel@...r.kernel.org,
        linux-nvme@...ts.infradead.org
Subject: Re: [PATCH] nvme-pci: fix host memory buffer allocation size

On 2022-05-10 09:03+0200, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 06:09:11PM +0200, Thomas Weißschuh wrote:
> > > > On my hardware we start with a chunk_size of 4MiB and just allocate
> > > > 8 (hmmaxd) * 4 = 32 MiB which is worse than 1 * 200MiB.
> > > 
> > > And that is because the hardware only has a limited set of descriptors.
> > 
> > Wouldn't it make more sense then to allocate as much memory as possible for
> > each descriptor that is available?
> > 
> > The comment in nvme_alloc_host_mem() tries to "start big".
> > But it actually starts with at most 4MiB.
> 
> Compared to what other operating systems offer, that is quite large.

Ok. I only looked at FreeBSD, which uses up to 5% of total memory per
device. [0]

> > And on devices that have hmminds > 4MiB the loop condition will never succeed
> > at all and HMB will not be used.
> > My fairly boring hardware already is at a hmminds of 3.3MiB.
> > 
> > > Is there any real problem you are fixing with this?  Do you actually
> > > see a performance difference on a relevant workload?
> > 
> > I don't have a concrete problem or performance issue.
> > During some debugging I stumbled in my kernel logs upon
> > "nvme nvme0: allocated 32 MiB host memory buffer"
> > and investigated why it was so low.
> 
> Until recently we could not even support these large sizes at all on
> typical x86 configs.  With my fairly recent change to allow vmap
> remapped iommu allocations on x86 we can do that now.  But if we
> unconditionally enabled it I'd be a little worried about using too
> much memory very easily.

This should still be limited to max_host_mem_size_mb which defaults to 128MiB,
or?

> We could look into removing the min with
> PAGE_SIZE * MAX_ORDER_NR_PAGES to try to do larger segments for
> "segment challenged" controllers now that it could work on a lot
> of iommu enabled setups.  But I'd rather have a very good reason for
> that.

On my current setup (WD SN770 on ThinkPad X1 Carbon Gen9) frequently the NVME
controller stops responding. Switching from no scheduler to mq-deadline reduced
this but did not eliminate it.
Since switching to HMB of 1 * 200MiB and no scheduler this did not happen anymore.
(But I'll need some more time to gain real confidence in this)

Initially I assumed that the PAGE_SIZE * MAX_ORDER_NR_PAGES was indeed
meant as a minimum for DMA allocation.
As that is not the case, removing the min() completely instead of the max() I
proposed would obviously be the correct thing to do.

[0] https://manpages.debian.org/testing/freebsd-manpages/nvme.4freebsd.en.html