linux-kernel - Re: collision between ZONE_MOVABLE and memblock allocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9ef757dc-da4b-9fa1-de84-1328a74f18a7@redhat.com>
Date:   Wed, 26 Jul 2023 10:44:21 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Ross Zwisler <zwisler@...gle.com>, Michal Hocko <mhocko@...e.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Mike Rapoport <rppt@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Mel Gorman <mgorman@...e.de>, Vlastimil Babka <vbabka@...e.cz>
Subject: Re: collision between ZONE_MOVABLE and memblock allocations

On 20.07.23 00:48, Ross Zwisler wrote:
> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
>> On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
>> [...]
>>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap
>>> allocations, because this issue essentially makes the movablecore= kernel
>>> command line parameter useless in many cases, as the ZONE_MOVABLE region it
>>> creates will often actually be unmovable.
>>
>> movablecore is kinda hack and I would be more inclined to get rid of it
>> rather than build more into it. Could you be more specific about your
>> use case?
> 
> The problem that I'm trying to solve is that I'd like to be able to get kernel
> core dumps off machines (chromebooks) so that we can debug crashes.  Because
> the memory used by the crash kernel ("crashkernel=" kernel command line
> option) is consumed the entire time the machine is booted, there is a strong
> motivation to keep the crash kernel as small and as simple as possible.  To
> this end I'm trying to get away without SSD drivers, not having to worry about
> encryption on the SSDs, etc.

Okay, so you intend to keep the crashkernel area as small as possible.

> 
> So, the rough plan right now is:
>  > 1) During boot set aside some memory that won't contain kernel 
allocations.
> I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways.
> 
> We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE
> region (or whatever non-kernel region) will be set aside as PMEM in the crash
> kernel.  This is done with the memmap=nn[KMG]!ss[KMG] kernel command line
> parameter passed to the crash kernel.
> 
> So, in my sample 4G VM system, I see:
> 
>    # lsmem --split ZONES --output-all
>    RANGE                                  SIZE  STATE REMOVABLE BLOCK NODE   ZONES
>    0x0000000000000000-0x0000000007ffffff  128M online       yes     0    0    None
>    0x0000000008000000-0x00000000bfffffff  2.9G online       yes  1-23    0   DMA32
>    0x0000000100000000-0x000000012fffffff  768M online       yes 32-37    0  Normal
>    0x0000000130000000-0x000000013fffffff  256M online       yes 38-39    0 Movable
>    
>    Memory block size:       128M
>    Total online memory:       4G
>    Total offline memory:      0B
> 
> so I'll pass "memmap=256M!0x130000000" to the crash kernel.
> 
> 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set
> aside only contains user data, which we don't want to store anyway.  

I raised that in different context already, but such assumptions are not 
100% future proof IMHO. For example, we might at one point be able to 
make user page tables movable and place them on there.

But yes, most kernel data structures (which you care about) will 
probably never be movable and never end up on these regions.

> We make a
> filesystem in there, and create a kernel crash dump using 'makedumpfile':
> 
>    mkfs.ext4 /dev/pmem0
>    mount /dev/pmem0 /mnt
>    makedumpfile -c -d 31 /proc/vmcore /mnt/kdump
> 
> We then set up the next full kernel boot to also have this same PMEM region,
> using the same memmap kernel parameter.  We reboot back into a full kernel.
> 
> 3) The next full kernel will be a normal boot with a full networking stack,
> SSD drivers, disk encryption, etc.  We mount up our PMEM filesystem, pull out
> the kdump and either store it somewhere persistent or upload it somewhere.  We
> can then unmount the PMEM and reconfigure it back to system ram so that the
> live system isn't missing memory.
> 
>    ndctl create-namespace --reconfig=namespace0.0 -m devdax -f
>    daxctl reconfigure-device --mode=system-ram dax0.0
> 
> This is the flow I'm trying to support, and have mostly working in a VM,
> except up until now makedumpfile would crash because all the memblock
> structures it needed were in the PMEM area that I had just wiped out by making
> a new filesystem. :)


Thinking out loud (and remembering that some architectures relocate the 
crashkernel during kexec, if I am not wrong), maybe the following would 
also work and make your setup eventually easier:

1) Don't reserve a crashkernel area in the traditional way, instead 
reserve that area using CMA. It can be used for MOVABLE allocations.

2) Let kexec load the crashkernel+initrd into ordinary memory only 
(consuming as much as you would need there).

3) On kexec, relocate the crashkernel+initrd into the CMA area 
(overwriting any movable data in there)

4) In makedumpfile, don't dump any memory that falls into the 
crashkernel area. It might already have been overwritten by the second 
kernel


Maybe that would allow you to make the crashkernel+initrd slightly 
bigger (to include SSD drivers etc.) and have a bigger crashkernel area, 
because while the crashkernel is armed it will only consume the 
crashkernel+initrd size and not the overall crashkernel area size.

If that makes any sense :)

-- 
Cheers,

David / dhildenb