linux-kernel - Re: Memory corruption after resume from hibernate with Arm GICv3 ITS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0iF7xAF105byp4j777Aks8KDKAh0-hJyfzkUFq5pm-JVQ@mail.gmail.com>
Date: Thu, 24 Jul 2025 11:51:42 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: David Woodhouse <dwmw2@...radead.org>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Pavel Machek <pavel@...nel.org>, 
	linux-pm <linux-pm@...r.kernel.org>, Marc Zyngier <maz@...nel.org>, 
	linux-arm-kernel@...ts.infradead.org, "Saidi, Ali" <alisaidi@...zon.com>, 
	"oliver.upton" <oliver.upton@...ux.dev>, Joey Gouly <joey.gouly@....com>, 
	Suzuki K Poulose <suzuki.poulose@....com>, Zenghui Yu <yuzenghui@...wei.com>, 
	Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>, 
	linux-kernel <linux-kernel@...r.kernel.org>, "Heyne, Maximilian" <mheyne@...zon.de>, 
	Alexander Graf <graf@...zon.com>, "Stamatis, Ilias" <ilstam@...zon.com>
Subject: Re: Memory corruption after resume from hibernate with Arm GICv3 ITS

On Thu, Jul 24, 2025 at 11:26 AM David Woodhouse <dwmw2@...radead.org> wrote:
>
> On Wed, 2025-07-23 at 12:04 +0200, David Woodhouse wrote:
> > We have seen guests crashing when, after they resume from hibernate,
> > the hypervisor serializes their state for live update or live
> > migration.
> >
> > The Arm Generic Interrupt Controller is a complicated beast, and it
> > does scattershot DMA to little tables all across the guest's address
> > space, without even living behind an IOMMU.
> >
> > Rather than simply turning it off overall, the guest has to explicitly
> > tear down *every* one of the individual tables which were previously
> > configured, in order to ensure that the memory is no longer used.
> >
> > KVM's implementation of the virtual GIC only uses this guest memory
> > when asked to serialize its state. Instead of passing the information
> > up to userspace as most KVM devices will do for serialization, KVM
> > *only* supports scribbling it to guest memory.
> >
> > So, when the transition from boot to resumed kernel leaves the vGIC
> > pointing at the *wrong* addresses, that's why a subsequent LU/LM of
> > that guest triggers the memory corruption by writing the KVM state to a
> > guest address that the now-running kernel did *not* expect.
> >
> > I tried this, just to get some more information:
> >
> > --- a/drivers/irqchip/irq-gic-v3-its.c
> > +++ b/drivers/irqchip/irq-gic-v3-its.c
> > @@ -720,7 +720,7 @@ static struct its_collection *its_build_mapd_cmd(struct its_node *its,
> >         its_encode_valid(cmd, desc->its_mapd_cmd.valid);
> >
> >         its_fixup_cmd(cmd);
> > -
> > +       printk("%s dev 0x%x valid %d addr 0x%lx\n", __func__, desc->its_mapd_cmd.dev->device_id, desc->its_mapd_cmd.valid, itt_addr);
> >         return NULL;
> >  }
> >
> > @@ -4996,10 +4996,15 @@ static int its_save_disable(void)
> >         struct its_node *its;
> >         int err = 0;
> >
> > +       printk("%s\n", __func__);
> >         raw_spin_lock(&its_lock);
> >         list_for_each_entry(its, &its_nodes, entry) {
> > +               struct its_device *its_dev;
> >                 void __iomem *base;
> >
> > +               list_for_each_entry(its_dev, &its->its_device_list, entry) {
> > +                       its_send_mapd(its_dev, 0);
> > +               }
> >                 base = its->base;
> >                 its->ctlr_save = readl_relaxed(base + GITS_CTLR);
> >                 err = its_force_quiescent(base);
> > @@ -5032,8 +5037,10 @@ static void its_restore_enable(void)
> >         struct its_node *its;
> >         int ret;
> >
> > +       printk("%s\n", __func__);
> >         raw_spin_lock(&its_lock);
> >         list_for_each_entry(its, &its_nodes, entry) {
> > +               struct its_device *its_dev;
> >                 void __iomem *base;
> >                 int i;
> >
> > @@ -5083,6 +5090,10 @@ static void its_restore_enable(void)
> >                 if (its->collections[smp_processor_id()].col_id <
> >                     GITS_TYPER_HCC(gic_read_typer(base + GITS_TYPER)))
> >                         its_cpu_init_collection(its);
> > +
> > +               list_for_each_entry(its_dev, &its->its_device_list, entry) {
> > +                       its_send_mapd(its_dev, 1);
> > +               }
> >         }
> >         raw_spin_unlock(&its_lock);
> >  }
> >
> >
> > Running on a suitable host with qemu, I reproduce with
> >   # echo reboot > /sys/power/disk
> >   # echo disk > /sys/power/state
> >
> > Example qemu command line:
> >  qemu-system-aarch64  -serial mon:stdio -M virt,gic-version=host -cpu max -enable-kvm -drive file=~/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2,id=nvm,if=none,snapshot=off,format=qcow2 -device nvme,drive=nvm,serial=1 -m 8g -nographic  -nic user,model=virtio -kernel vmlinuz-6.16.0-rc7-dirty  -initrd initramfs-6.16.0-rc7-dirty.img -append 'root=UUID=6c7b9058-d040-4047-a892-d2f1c7dee687 ro rootflags=subvol=root no_timer_check console=tty1 console=ttyAMA0,115200n8 systemd.firstboot=off rootflags=subvol=root no_console_suspend=1 resume_offset=366703 resume=/dev/nvme0n1p3' -trace gicv3_its\*
> >
> > As the kernel boots up for the first time, it sends a normal MAPD command:
> >
> > [    1.292956] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> >
> > On hibernation, my newly added code unmaps and then *remaps* the same:
> >
> > [root@...alhost ~]# echo disk > /sys/power/state
> > [   42.118573] PM: hibernation: hibernation entry
> > [   42.134574] Filesystems sync: 0.015 seconds
> > [   42.134899] Freezing user space processes
> > [   42.135566] Freezing user space processes completed (elapsed 0.000 seconds)
> > [   42.136040] OOM killer disabled.
> > [   42.136307] PM: hibernation: Preallocating image memory
> > [   42.371141] PM: hibernation: Allocated 297401 pages for snapshot
> > [   42.371163] PM: hibernation: Allocated 1189604 kbytes in 0.23 seconds (5172.19 MB/s)
> > [   42.371170] Freezing remaining freezable tasks
> > [   42.373465] Freezing remaining freezable tasks completed (elapsed 0.002 seconds)
> > [   42.378350] Disabling non-boot CPUs ...
> > [   42.378363] its_save_disable
> > [   42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> > [   42.378363] PM: hibernation: Creating image:
> > [   42.378363] PM: hibernation: Need to copy 153098 pages
> > [   42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
> > [   42.378363] its_restore_enable
> > [   42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> > [   42.383601] nvme nvme0: 1/0/0 default/read/poll queues
> > [   42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
> > [   42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
> > [   42.387742] PM: Using 1 thread(s) for lzo compression
> > [   42.387748] PM: Compressing and saving image data (115654 pages)...
> > [   42.387757] PM: Image saving progress:   0%
> > [   43.485794] PM: Image saving progress:  10%
> > [   44.739662] PM: Image saving progress:  20%
> > [   46.617453] PM: Image saving progress:  30%
> > [   48.437644] PM: Image saving progress:  40%
> > [   49.857855] PM: Image saving progress:  50%
> > [   52.156928] PM: Image saving progress:  60%
> > [   53.344810] PM: Image saving progress:  70%
> > [   54.472998] PM: Image saving progress:  80%
> > [   55.083950] PM: Image saving progress:  90%
> > [   56.406480] PM: Image saving progress: 100%
> > [   56.407088] PM: Image saving done
> > [   56.407100] PM: hibernation: Wrote 462616 kbytes in 14.01 seconds (33.02 MB/s)
> > [   56.407106] PM: Image size after compression: 148041 kbytes
> > [   56.408210] PM: S|
> > [   56.642393] Flash device refused suspend due to active operation (state 20)
> > [   56.642871] Flash device refused suspend due to active operation (state 20)
> > [   56.643432] reboot: Restarting system
> > [    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd4f1]
> >
> > Then the *boot* kernel comes up, does its own MAPD using a slightly different address:
> >
> > [    1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
> >
> >  ... and then transfers control to the hibernated kernel, which again
> > tries to unmap and remap the ITT at its original address due to my
> > suspend/resume hack (which is clearly hooking the wrong thing, but is
> > at least giving us useful information):
> >
> > Starting systemd-hibernate-resume.service - Resume from hibernation...
> > [    1.391340] PM: hibernation: resume from hibernation
> > [    1.391861] random: crng reseeded on system resumption
> > [    1.391927] Freezing user space processes
> > [    1.392984] Freezing user space processes completed (elapsed 0.001 seconds)
> > [    1.393473] OOM killer disabled.
> > [    1.393486] Freezing remaining freezable tasks
> > [    1.395012] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> > [    1.400817] PM: Using 1 thread(s) for lzo decompression
> > [    1.400832] PM: Loading and decompressing image data (115654 pages)...
> > [    1.400836] hibernate: Hibernated on CPU 0 [mpidr:0x0]
> > [    1.438621] PM: Image loading progress:   0%
> > [    1.554623] PM: Image loading progress:  10%
> > [    1.594714] PM: Image loading progress:  20%
> > [    1.639317] PM: Image loading progress:  30%
> > [    1.683055] PM: Image loading progress:  40%
> > [    1.720726] PM: Image loading progress:  50%
> > [    1.768878] PM: Image loading progress:  60%
> > [    1.800203] PM: Image loading progress:  70%
> > [    1.822833] PM: Image loading progress:  80%
> > [    1.840985] PM: Image loading progress:  90%
> > [    1.871253] PM: Image loading progress: 100%
> > [    1.871611] PM: Image loading done
> > [    1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)
> > [   42.378350] Disabling non-boot CPUs ...
> > [   42.378363] its_save_disable
> > [   42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> > [   42.378363] PM: hibernation: Creating image:
> > [   42.378363] PM: hibernation: Need to copy 153098 pages
> > [   42.378363] hibernate: Restored 0 MTE pages
> > [   42.378363] its_restore_enable
> > [   42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> > [   42.417445] OOM killer enabled.
> > [   42.417455] Restarting tasks: Starting
> > [   42.419915] nvme nvme0: 1/0/0 default/read/poll queues
> > [   42.420407] Restarting tasks: Done
> > [   42.420781] PM: hibernation: hibernation exit
> > [   42.421149] nvme nvme0: Ignoring bogus Namespace Identifiers
>
> Rafael points out that the resumed kernel isn't doing the unmap/remap
> again; it's merely printing the *same* messages again from the printk
> buffer.
>
> Before writing the hibernate image, the kernel calls the suspend op:
>
> [   42.378350] Disabling non-boot CPUs ...
> [   42.378363] its_save_disable
> [   42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> [   42.378363] PM: hibernation: Creating image:
>
> Those messages are stored in the printk buffer in the image. Then the
> hibernating kernel calls the resume op, and writes the image:
>
> [   42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
> [   42.378363] its_restore_enable
> [   42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [   42.383601] nvme nvme0: 1/0/0 default/read/poll queues
> [   42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
> [   42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
> [   42.387742] PM: Using 1 thread(s) for lzo compression
> [   42.387748] PM: Compressing and saving image data (115654 pages)...
> [   42.387757] PM: Image saving progress:   0%
> [   43.485794] PM: Image saving progress:  10%
> ...
>
> Then the boot kernel comes up and maps an ITT:
>
> [    1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
>
> The boot kernel never seems to *unmap* that because the suspend method
> doesn't get called before resuming the image.
>
> On resume, the previous kernel flushes the messages which were in its
> printk buffer to the serial port again, and then prints these *new*
> messages...
>
> [   42.378363] hibernate: Restored 0 MTE pages
> [   42.378363] its_restore_enable
> [   42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [   42.417445] OOM killer enabled.
> [   42.417455] Restarting tasks: Starting
>
> So the hibernated kernel seems to be doing the right thing in both
> suspend and resume phases but it looks like the *boot* kernel doesn't
> call the suspend method before transitioning;

No, it does this, but the messages are missing from the log.

The last message you see from the boot/restore kernel is about loading
the image; a lot of stuff happens afterwards.

This message:

[    1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)

is printed by load_compressed_image() which gets called by
swsusp_read(), which is invoked by load_image_and_restore().

It is successful, so hibernation_restore() gets called and it does
quite a bit of work, including calling resume_target_kernel(), which
among other things calls syscore_suspend(), from where your messages
should be printed if I'm not mistaken.

I have no idea why those messages don't get into the log (that would
happen if your boot kernel were different from the image kernel and it
didn't actually print them).

> is that intentional? I think we *should* unmap all the ITTs from the boot kernel.

Yes, it's better to unmap them, even though ->

> At least for the vGIC, when the hibernated image resumes it will
> *change* the mapping for every device that it knows about, but there's
> a *possibility* that the boot kernel might have set up one that the
> hibernated kernel didn't know about (if a new PCI device exists now?).

-> HW configuration is not supposed to change across hibernation/restore.

> And I'm not sure what the real hardware will do if it gets a subsequent
> MAPD without the previous one being unmapped.