[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211213035412.GA24932@MiWiFi-R3L-srv>
Date: Mon, 13 Dec 2021 11:54:12 +0800
From: Baoquan He <bhe@...hat.com>
To: John Donnelly <John.p.donnelly@...cle.com>
Cc: linux-kernel@...r.kernel.org, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, luto@...nel.org,
peterz@...radead.org, linux-mm@...ck.org,
akpm@...ux-foundation.org, hch@....de, robin.murphy@....com,
cl@...ux.com, penberg@...nel.org, rientjes@...gle.com,
iamjoonsoo.kim@....com, vbabka@...e.cz, m.szyprowski@...sung.com,
kexec@...ts.infradead.org, rppt@...ux.ibm.com
Subject: Re: [PATCH RESEND v2 0/5] Avoid requesting page from DMA zone when
no managed pages
On 12/06/21 at 10:03pm, John Donnelly wrote:
> On 12/6/21 9:16 PM, Baoquan He wrote:
> > Sorry, forgot adding x86 and x86/mm maintainers
>
> Hi,
>
> These commits need applied to Linux-5.15.0 (LTS) too since it has the
> original regression :
>
> 1d659236fb43 ("dma-pool: scale the default DMA coherent pool
> size with memory capacity")
Yeah, Fixes and stable need be added. Thanks for pointing out.
As I have said in cover letter, this issue didn't occur until below
commits applied. So I will add 'Fixes: 6f599d84231f ("x86/kdump: Always
reserve the low 1M when the crashkernel option is specified")' to patch
4, 5. The patch 1, 2 are cleanup|improvement, not related to this issue.
1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
23721c8e92f7 x86/crash: Remove crash_reserve_low_1M()
f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM
7c321eb2b843 x86/kdump: Remove the backup region handling
6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified
>
> Maybe add "Fixes" to the other commits ?
>
>
> >
> > On 12/07/21 at 11:07am, Baoquan He wrote:
> > > ***Problem observed:
> > > On x86_64, when crash is triggered and entering into kdump kernel, page
> > > allocation failure can always be seen.
> > >
> > > ---------------------------------
> > > DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
> > > swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
> > > CPU: 0 PID: 1 Comm: swapper/0
> > > Call Trace:
> > > dump_stack+0x7f/0xa1
> > > warn_alloc.cold+0x72/0xd6
> > > ......
> > > __alloc_pages+0x24d/0x2c0
> > > ......
> > > dma_atomic_pool_init+0xdb/0x176
> > > do_one_initcall+0x67/0x320
> > > ? rcu_read_lock_sched_held+0x3f/0x80
> > > kernel_init_freeable+0x290/0x2dc
> > > ? rest_init+0x24f/0x24f
> > > kernel_init+0xa/0x111
> > > ret_from_fork+0x22/0x30
> > > Mem-Info:
> > > ------------------------------------
> > >
> > > ***Root cause:
> > > In the current kernel, it assumes that DMA zone must have managed pages
> > > and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
> > > always true. E.g in kdump kernel of x86_64, only low 1M is presented and
> > > locked down at very early stage of boot, so that this low 1M won't be
> > > added into buddy allocator to become managed pages of DMA zone. This
> > > exception will always cause page allocation failure if page is requested
> > > from DMA zone.
> > >
> > > ***Investigation:
> > > This failure happens since below commit merged into linus's tree.
> > > 1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
> > > 23721c8e92f7 x86/crash: Remove crash_reserve_low_1M()
> > > f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM
> > > 7c321eb2b843 x86/kdump: Remove the backup region handling
> > > 6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified
> > >
> > > Before them, on x86_64, the low 640K area will be reused by kdump kernel.
> > > So in kdump kernel, the content of low 640K area is copied into a backup
> > > region for dumping before jumping into kdump. Then except of those firmware
> > > reserved region in [0, 640K], the left area will be added into buddy
> > > allocator to become available managed pages of DMA zone.
> > >
> > > However, after above commits applied, in kdump kernel of x86_64, the low
> > > 1M is reserved by memblock, but not released to buddy allocator. So any
> > > later page allocation requested from DMA zone will fail.
> > >
> > > This low 1M lock down is needed because AMD SME encrypts memory making
> > > the old backup region mechanims impossible when switching into kdump
> > > kernel. And Intel engineer mentioned their TDX (Trusted domain extensions)
> > > which is under development in kernel also needs lock down the low 1M.
> > > So we can't simply revert above commits to fix the page allocation
> > > failure from DMA zone as someone suggested.
> > >
> > > ***Solution:
> > > Currently, only DMA atomic pool and dma-kmalloc will initialize and
> > > request page allocation with GFP_DMA during bootup. So only initialize
> > > them when DMA zone has available managed pages, otherwise just skip the
> > > initialization. From testing and code, this doesn't matter. In kdump
> > > kernel of x86_64, the page allocation failure disappear.
> > >
> > > ***Further thinking
> > > On x86_64, it consistently takes [0, 16M] into ZONE_DMA, and (16M, 4G]
> > > into ZONE_DMA32 by default. The zone DMA covering low 16M is used to
> > > take care of antique ISA devices. In fact, on 64bit system, it rarely
> > > need ZONE_DMA (which is low 16M) to support almost extinct ISA devices.
> > > However, some components treat DMA as a generic concept, e.g
> > > kmalloc-dma, slab allocator initializes it for later any DMA related
> > > buffer allocation, but not limited to ISA DMA.
> > >
> > > On arm64, even though both CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32
> > > are enabled, it makes ZONE_DMA covers the low 4G area, and ZONE_DMA32
> > > empty. Unless on specific platforms (e.g. 30-bit on Raspberry Pi 4),
> > > then zone DMA covers the 1st 1G area, zone DMA32 covers the rest of
> > > the 32-bit addressable memory.
> > >
> > > I am wondering if we can also change the size of DMA and DMA32 ZONE as
> > > dynamically adjusted, just as arm64 is doing? On x86_64, we can make
> > > zone DMA covers the 32-bit addressable memory, and empty zone DMA32 by
> > > default. Once ISA_DMA_API is enabled, we go back to make zone DMA covers
> > > low 16M area, zone DMA32 covers the rest of 32-bit addressable memory.
> > > (I am not familiar with ISA_DMA_API, will it require 24-bit addressable
> > > memory when enabled?)
> > >
> > > Change history:
> > >
> > > v2 post:
> > > https://urldefense.com/v3/__https://lore.kernel.org/all/20210810094835.13402-1-bhe@redhat.com/T/*u__;Iw!!ACWV5N9M2RV99hQ!beOGaLK9suYILSZ8uvbAt4Xd7raHP_p6tcVTvcnZMWCq_eL1VQxSMIJdw-z6EjaERCi0$
> > >
> > > v1 post:
> > > https://urldefense.com/v3/__https://lore.kernel.org/all/20210624052010.5676-1-bhe@redhat.com/T/*u__;Iw!!ACWV5N9M2RV99hQ!beOGaLK9suYILSZ8uvbAt4Xd7raHP_p6tcVTvcnZMWCq_eL1VQxSMIJdw-z6EgRgBiPP$
> > >
> > > v2->v2 RESEND:
> > > John pinged to push the repost of this patchset. So fix one typo of
> > > suject of patch 3/5; Fix a building error caused by mix declaration in
> > > patch 5/5. Both of them are found by John from his testing.
> > >
> > > v1->v2:
> > > Change to check if managed DMA zone exists. If DMA zone has managed
> > > pages, go further to request page from DMA zone to initialize. Otherwise,
> > > just skip to initialize stuffs which need pages from DMA zone.
> > >
> > > Baoquan He (5):
> > > docs: kernel-parameters: Update to reflect the current default size of
> > > atomic pool
> > > dma-pool: allow user to disable atomic pool
> > > mm_zone: add function to check if managed dma zone exists
> > > dma/pool: create dma atomic pool only if dma zone has managed pages
> > > mm/slub: do not create dma-kmalloc if no managed pages in DMA zone
> > >
> > > .../admin-guide/kernel-parameters.txt | 5 ++++-
> > > include/linux/mmzone.h | 21 +++++++++++++++++++
> > > kernel/dma/pool.c | 11 ++++++----
> > > mm/page_alloc.c | 11 ++++++++++
> > > mm/slab_common.c | 9 ++++++++
> > > 5 files changed, 52 insertions(+), 5 deletions(-)
> > >
> > > --
> > > 2.17.2
> > >
> >
>
Powered by blists - more mailing lists