lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc456c6b4b1ed51e568a37cf29b33d537e4bd94c.camel@infradead.org>
Date: Thu, 24 Jul 2025 11:25:46 +0200
From: David Woodhouse <dwmw2@...radead.org>
To: "Rafael J. Wysocki" <rafael@...nel.org>, Pavel Machek
 <pavel@...nel.org>,  linux-pm <linux-pm@...r.kernel.org>, Marc Zyngier
 <maz@...nel.org>,  linux-arm-kernel@...ts.infradead.org, "Saidi, Ali"
 <alisaidi@...zon.com>,  "oliver.upton" <oliver.upton@...ux.dev>, Joey Gouly
 <joey.gouly@....com>, Suzuki K Poulose <suzuki.poulose@....com>, Zenghui Yu
 <yuzenghui@...wei.com>, Catalin Marinas <catalin.marinas@....com>, Will
 Deacon <will@...nel.org>, linux-kernel <linux-kernel@...r.kernel.org>,
 "Heyne, Maximilian" <mheyne@...zon.de>,  Alexander Graf <graf@...zon.com>,
 "Stamatis, Ilias" <ilstam@...zon.com>
Subject: Re: Memory corruption after resume from hibernate with Arm GICv3 ITS

On Wed, 2025-07-23 at 12:04 +0200, David Woodhouse wrote:
> We have seen guests crashing when, after they resume from hibernate,
> the hypervisor serializes their state for live update or live
> migration.
> 
> The Arm Generic Interrupt Controller is a complicated beast, and it
> does scattershot DMA to little tables all across the guest's address
> space, without even living behind an IOMMU.
> 
> Rather than simply turning it off overall, the guest has to explicitly
> tear down *every* one of the individual tables which were previously
> configured, in order to ensure that the memory is no longer used.
> 
> KVM's implementation of the virtual GIC only uses this guest memory
> when asked to serialize its state. Instead of passing the information
> up to userspace as most KVM devices will do for serialization, KVM
> *only* supports scribbling it to guest memory.
> 
> So, when the transition from boot to resumed kernel leaves the vGIC
> pointing at the *wrong* addresses, that's why a subsequent LU/LM of
> that guest triggers the memory corruption by writing the KVM state to a
> guest address that the now-running kernel did *not* expect.
> 
> I tried this, just to get some more information:
> 
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -720,7 +720,7 @@ static struct its_collection *its_build_mapd_cmd(struct its_node *its,
>         its_encode_valid(cmd, desc->its_mapd_cmd.valid);
>  
>         its_fixup_cmd(cmd);
> -
> +       printk("%s dev 0x%x valid %d addr 0x%lx\n", __func__, desc->its_mapd_cmd.dev->device_id, desc->its_mapd_cmd.valid, itt_addr);
>         return NULL;
>  }
>  
> @@ -4996,10 +4996,15 @@ static int its_save_disable(void)
>         struct its_node *its;
>         int err = 0;
>  
> +       printk("%s\n", __func__);
>         raw_spin_lock(&its_lock);
>         list_for_each_entry(its, &its_nodes, entry) {
> +               struct its_device *its_dev;
>                 void __iomem *base;
>  
> +               list_for_each_entry(its_dev, &its->its_device_list, entry) {
> +                       its_send_mapd(its_dev, 0);
> +               }
>                 base = its->base;
>                 its->ctlr_save = readl_relaxed(base + GITS_CTLR);
>                 err = its_force_quiescent(base);
> @@ -5032,8 +5037,10 @@ static void its_restore_enable(void)
>         struct its_node *its;
>         int ret;
>  
> +       printk("%s\n", __func__);
>         raw_spin_lock(&its_lock);
>         list_for_each_entry(its, &its_nodes, entry) {
> +               struct its_device *its_dev;
>                 void __iomem *base;
>                 int i;
>  
> @@ -5083,6 +5090,10 @@ static void its_restore_enable(void)
>                 if (its->collections[smp_processor_id()].col_id <
>                     GITS_TYPER_HCC(gic_read_typer(base + GITS_TYPER)))
>                         its_cpu_init_collection(its);
> +
> +               list_for_each_entry(its_dev, &its->its_device_list, entry) {
> +                       its_send_mapd(its_dev, 1);
> +               }
>         }
>         raw_spin_unlock(&its_lock);
>  }
> 
> 
> Running on a suitable host with qemu, I reproduce with
>   # echo reboot > /sys/power/disk
>   # echo disk > /sys/power/state
> 
> Example qemu command line:
>  qemu-system-aarch64  -serial mon:stdio -M virt,gic-version=host -cpu max -enable-kvm -drive file=~/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2,id=nvm,if=none,snapshot=off,format=qcow2 -device nvme,drive=nvm,serial=1 -m 8g -nographic  -nic user,model=virtio -kernel vmlinuz-6.16.0-rc7-dirty  -initrd initramfs-6.16.0-rc7-dirty.img -append 'root=UUID=6c7b9058-d040-4047-a892-d2f1c7dee687 ro rootflags=subvol=root no_timer_check console=tty1 console=ttyAMA0,115200n8 systemd.firstboot=off rootflags=subvol=root no_console_suspend=1 resume_offset=366703 resume=/dev/nvme0n1p3' -trace gicv3_its\*
> 
> As the kernel boots up for the first time, it sends a normal MAPD command:
> 
> [    1.292956] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> 
> On hibernation, my newly added code unmaps and then *remaps* the same:
> 
> [root@...alhost ~]# echo disk > /sys/power/state
> [   42.118573] PM: hibernation: hibernation entry
> [   42.134574] Filesystems sync: 0.015 seconds
> [   42.134899] Freezing user space processes
> [   42.135566] Freezing user space processes completed (elapsed 0.000 seconds)
> [   42.136040] OOM killer disabled.
> [   42.136307] PM: hibernation: Preallocating image memory
> [   42.371141] PM: hibernation: Allocated 297401 pages for snapshot
> [   42.371163] PM: hibernation: Allocated 1189604 kbytes in 0.23 seconds (5172.19 MB/s)
> [   42.371170] Freezing remaining freezable tasks
> [   42.373465] Freezing remaining freezable tasks completed (elapsed 0.002 seconds)
> [   42.378350] Disabling non-boot CPUs ...
> [   42.378363] its_save_disable
> [   42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> [   42.378363] PM: hibernation: Creating image:
> [   42.378363] PM: hibernation: Need to copy 153098 pages
> [   42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
> [   42.378363] its_restore_enable
> [   42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [   42.383601] nvme nvme0: 1/0/0 default/read/poll queues
> [   42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
> [   42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
> [   42.387742] PM: Using 1 thread(s) for lzo compression
> [   42.387748] PM: Compressing and saving image data (115654 pages)...
> [   42.387757] PM: Image saving progress:   0%
> [   43.485794] PM: Image saving progress:  10%
> [   44.739662] PM: Image saving progress:  20%
> [   46.617453] PM: Image saving progress:  30%
> [   48.437644] PM: Image saving progress:  40%
> [   49.857855] PM: Image saving progress:  50%
> [   52.156928] PM: Image saving progress:  60%
> [   53.344810] PM: Image saving progress:  70%
> [   54.472998] PM: Image saving progress:  80%
> [   55.083950] PM: Image saving progress:  90%
> [   56.406480] PM: Image saving progress: 100%
> [   56.407088] PM: Image saving done
> [   56.407100] PM: hibernation: Wrote 462616 kbytes in 14.01 seconds (33.02 MB/s)
> [   56.407106] PM: Image size after compression: 148041 kbytes
> [   56.408210] PM: S|
> [   56.642393] Flash device refused suspend due to active operation (state 20)
> [   56.642871] Flash device refused suspend due to active operation (state 20)
> [   56.643432] reboot: Restarting system
> [    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd4f1]
> 
> Then the *boot* kernel comes up, does its own MAPD using a slightly different address:
> 
> [    1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
> 
>  ... and then transfers control to the hibernated kernel, which again
> tries to unmap and remap the ITT at its original address due to my
> suspend/resume hack (which is clearly hooking the wrong thing, but is
> at least giving us useful information):
>
> Starting systemd-hibernate-resume.service - Resume from hibernation...
> [    1.391340] PM: hibernation: resume from hibernation
> [    1.391861] random: crng reseeded on system resumption
> [    1.391927] Freezing user space processes
> [    1.392984] Freezing user space processes completed (elapsed 0.001 seconds)
> [    1.393473] OOM killer disabled.
> [    1.393486] Freezing remaining freezable tasks
> [    1.395012] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> [    1.400817] PM: Using 1 thread(s) for lzo decompression
> [    1.400832] PM: Loading and decompressing image data (115654 pages)...
> [    1.400836] hibernate: Hibernated on CPU 0 [mpidr:0x0]
> [    1.438621] PM: Image loading progress:   0%
> [    1.554623] PM: Image loading progress:  10%
> [    1.594714] PM: Image loading progress:  20%
> [    1.639317] PM: Image loading progress:  30%
> [    1.683055] PM: Image loading progress:  40%
> [    1.720726] PM: Image loading progress:  50%
> [    1.768878] PM: Image loading progress:  60%
> [    1.800203] PM: Image loading progress:  70%
> [    1.822833] PM: Image loading progress:  80%
> [    1.840985] PM: Image loading progress:  90%
> [    1.871253] PM: Image loading progress: 100%
> [    1.871611] PM: Image loading done
> [    1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)
> [   42.378350] Disabling non-boot CPUs ...
> [   42.378363] its_save_disable
> [   42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> [   42.378363] PM: hibernation: Creating image:
> [   42.378363] PM: hibernation: Need to copy 153098 pages
> [   42.378363] hibernate: Restored 0 MTE pages
> [   42.378363] its_restore_enable
> [   42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [   42.417445] OOM killer enabled.
> [   42.417455] Restarting tasks: Starting
> [   42.419915] nvme nvme0: 1/0/0 default/read/poll queues
> [   42.420407] Restarting tasks: Done
> [   42.420781] PM: hibernation: hibernation exit
> [   42.421149] nvme nvme0: Ignoring bogus Namespace Identifiers

Rafael points out that the resumed kernel isn't doing the unmap/remap
again; it's merely printing the *same* messages again from the printk
buffer.

Before writing the hibernate image, the kernel calls the suspend op:

[   42.378350] Disabling non-boot CPUs ...
[   42.378363] its_save_disable
[   42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
[   42.378363] PM: hibernation: Creating image:

Those messages are stored in the printk buffer in the image. Then the
hibernating kernel calls the resume op, and writes the image:

[   42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
[   42.378363] its_restore_enable
[   42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
[   42.383601] nvme nvme0: 1/0/0 default/read/poll queues
[   42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
[   42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
[   42.387742] PM: Using 1 thread(s) for lzo compression
[   42.387748] PM: Compressing and saving image data (115654 pages)...
[   42.387757] PM: Image saving progress:   0%
[   43.485794] PM: Image saving progress:  10%
...

Then the boot kernel comes up and maps an ITT:

[    1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000

The boot kernel never seems to *unmap* that because the suspend method
doesn't get called before resuming the image.

On resume, the previous kernel flushes the messages which were in its
printk buffer to the serial port again, and then prints these *new*
messages...

[   42.378363] hibernate: Restored 0 MTE pages
[   42.378363] its_restore_enable
[   42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
[   42.417445] OOM killer enabled.
[   42.417455] Restarting tasks: Starting

So the hibernated kernel seems to be doing the right thing in both
suspend and resume phases but it looks like the *boot* kernel doesn't
call the suspend method before transitioning; is that intentional? I
think we *should* unmap all the ITTs from the boot kernel.

At least for the vGIC, when the hibernated image resumes it will
*change* the mapping for every device that it knows about, but there's
a *possibility* that the boot kernel might have set up one that the
hibernated kernel didn't know about (if a new PCI device exists now?).
And I'm not sure what the real hardware will do if it gets a subsequent
MAPD without the previous one being unmapped. 



Download attachment "smime.p7s" of type "application/pkcs7-signature" (5069 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ