[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8f771516-86f3-6724-7b2c-22cc23933075@nvidia.com>
Date: Wed, 8 May 2024 23:02:50 -0400
From: Feng Liu <feliu@...dia.com>
To: Catherine Redfield <catherine.redfield@...onical.com>,
Jason Wang <jasowang@...hat.com>
Cc: Joseph Salisbury <joseph.salisbury@...onical.com>, parav@...dia.com,
jiri@...dia.com, mst@...hat.com, yishaih@...dia.com,
alex.williamson@...hat.com, xuanzhuo@...ux.alibaba.com,
virtualization@...ts.linux.dev, linux-kernel@...r.kernel.org,
Francis Ginther <francis.ginther@...onical.com>,
John Cabaj <john.cabaj@...onical.com>,
Ankush Pathak <ankush.pathak@...onical.com>,
Chlo Smith <chloe.smith@...onical.com>
Subject: Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue
On 2024-05-08 a.m.7:18, Catherine Redfield wrote:
> *External email: Use caution opening links or attachments*
>
>
> On a VM with the GCP kernel (where we first identified the problem), I see:
>
> 1. The full kernel log from `journalctl --system > kernlog` attached.
> The specific suspend section is here:
>
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> systemd[1]: Reached target sleep.target - Sleep.
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> systemd[1]: Starting systemd-suspend.service - System Suspend...
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> systemd-sleep[1413]: Performing sleep operation 'suspend'...
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: PM: suspend entry (deep)
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: Filesystems sync: 0.008 seconds
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: Freezing user space processes
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: Freezing user space processes completed (elapsed 0.001 seconds)
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: OOM killer disabled.
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: Freezing remaining freezable tasks
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: printk: Suspending console(s) (use no_console_suspend to debug)
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: port 00:03:0.0: PM: dpm_run_callback():
> pm_runtime_force_suspend+0x0/0x130 returns -16
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: port 00:03:0.0: PM: failed to suspend: error -16
Thanks Joesph and Catherine's help.
Hi,
I have alreay synced up with Cananical guys offline about this issue.
I can run "suspend/resume" sucessfully on my local server and VM.
And "PM: failed to suspend: error -16" looks like not cause by my
previous virtio patch ( fd27ef6b44be ("virtio-pci: Introduce admin
virtqueue")) which only modified "virtio_device_freeze" about "suspend"
action.
So I have provide the my steps and debug patch to Joesph and Catherine.
I will also sync up the information here, as follow:
I have read the qemu code and find a way to trigger "suspend/resume" on
my setup, and add some debug message in the latest kerenel
My setps are:
1. QEMU cmdline add following
...
-global PIIX4_PM.disable_s3=0 \
-global PIIX4_PM.disable_s4=1 \
...
-netdev type=tap,ifname=tap0,id=hostnet0,script=no,downscript=no \
-device
virtio-net-pci,netdev=hostnet0,id=net0,mac=$SSH_MAC,bus=pci.0,addr=0x3 \
.....
2. In the VM, run "systemctl suspend" to PM suspend the VM into memory
3. In qemu hmp shell, run "system_wakeup" to resume the VM again
My VM configuration:
NIC: 1 virtio nic emulated by QEMU
OS: Ubuntu 22.04.4 LTS
kernel: latest kernel, 6.9-rc7: ee5b455b0ada (kernel2/net-next-virito,
kernel2/master, master) Merge tag 'slab-for-6.9-rc7-fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab)
I add some debug message on the latest kernel, and do above steps to
trigger "suspen/resume". Everything of VM is OK, VM could suspend/resume
successfully.
Follwing is the kernel log:
----------------------------------------------------------------------------
.......
May 6 15:59:52 feliu-vm kernel: [ 43.446737] PM: suspend entry (deep)
May 6 16:00:04 feliu-vm kernel: [ 43.467640] Filesystems sync: 0.020
seconds
May 6 16:00:04 feliu-vm kernel: [ 43.467923] Freezing user space
processes
May 6 16:00:04 feliu-vm kernel: [ 43.470294] Freezing user space
processes completed (elapsed 0.002 seconds)
May 6 16:00:04 feliu-vm kernel: [ 43.470299] OOM killer disabled.
May 6 16:00:04 feliu-vm kernel: [ 43.470301] Freezing remaining
freezable tasks
May 6 16:00:04 feliu-vm kernel: [ 43.471482] Freezing remaining
freezable tasks completed (elapsed 0.001 seconds)
May 6 16:00:04 feliu-vm kernel: [ 43.471495] printk: Suspending
console(s) (use no_console_suspend to debug)
May 6 16:00:04 feliu-vm kernel: [ 43.474034] virtio_net virtio0:
godeng virtio device freeze
May 6 16:00:04 feliu-vm kernel: [ 43.475714] virtio_net virtio0 ens3:
godfeng virtnet_freeze done
May 6 16:00:04 feliu-vm kernel: [ 43.475717] virtio_net virtio0:
godfeng VIRTIO_F_ADMIN_VQ not enabled
May 6 16:00:04 feliu-vm kernel: [ 43.475719] virtio_net virtio0:
godeng virtio device freeze done
.......
May 6 16:00:04 feliu-vm kernel: [ 43.535382] smpboot: CPU 1 is now
offline
May 6 16:00:04 feliu-vm kernel: [ 43.537283] IRQ fixup: irq 1 move in
progress, old vector 32
May 6 16:00:04 feliu-vm kernel: [ 43.538504] smpboot: CPU 2 is now
offline
May 6 16:00:04 feliu-vm kernel: [ 43.541392] smpboot: CPU 3 is now
offline
.....
May 6 16:00:04 feliu-vm kernel: [ 54.973285] smpboot: Booting Node 0
Processor 15 APIC 0xf
May 6 16:00:04 feliu-vm kernel: [ 54.975190] CPU15 is up
May 6 16:00:04 feliu-vm kernel: [ 54.976011] ACPI: PM: Waking up from
system sleep state S3
May 6 16:00:04 feliu-vm kernel: [ 54.986071] virtio_net virtio0:
godeng virtio device restore
May 6 16:00:04 feliu-vm kernel: [ 54.987563] virtio_net virtio0 ens3:
godfeng virtnet_restore done
May 6 16:00:04 feliu-vm kernel: [ 54.987635] virtio_net virtio0:
godfeng: virtio device restore done
.....
May 6 16:00:04 feliu-vm kernel: [ 55.307221] ata8: SATA link down
(SStatus 0 SControl 300)
May 6 16:00:04 feliu-vm kernel: [ 55.442048] OOM killer enabled.
May 6 16:00:04 feliu-vm kernel: [ 55.442051] Restarting tasks ... done.
May 6 16:00:04 feliu-vm kernel: [ 55.443576] random: crng reseeded on
system resumption
May 6 16:00:04 feliu-vm kernel: [ 55.443582] PM: suspend exit
----------------------------------------------------------------------------
Attachment is the full kernel log. I think maybe it is some configration
error.
Thanks
Feng
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: sd 0:0:1:0: [sda] Synchronizing SCSI cache
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: PM: Some devices failed to suspend, or early wake event detected
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: OOM killer enabled.
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: Restarting tasks ... done.
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: random: crng reseeded on system resumption
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: PM: suspend exit
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
> kernel: PM: suspend entry (s2idle)
> -- Boot 61828bc938b44fc68a8aeedc16a23a9d --
> May 08 11:09:03 localhost kernel: Linux version 6.8.0-1007-gcp
> (buildd@...02-amd64-079) (x86_64-linux-gnu-gcc-13 (Ubuntu
> 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42)
> #7-Ubuntu SMP Sat Apr 20 00:58:31 UTC 2024 (Ubuntu 6.8.0-1007.7-gcp 6.8.1)
> May 08 11:09:03 localhost kernel: Command line:
> BOOT_IMAGE=/vmlinuz-6.8.0-1007-gcp
> root=PARTUUID=7a949935-6bf2-4cae-b404-803c95163572 ro
> console=ttyS0,115200 panic=-1
>
> 2. The features the devices has:
>
> catred@...nel-test-202405080702:~$ cat
> /sys/bus/virtio/devices/virtio0/features
> 0110000000000000000000000000010000000000000000000000000000000000
> catred@...nel-test-202405080702:~$ cat
> /sys/bus/virtio/devices/virtio1/features
> 1110010110011001110000100000010000000000000000000000000000000000
> catred@...nel-test-202405080702:~$ cat
> /sys/bus/virtio/devices/virtio2/features
> 1110000000000000000000000000000000000000000000000000000000000000
> catred@...nel-test-202405080702:~$ cat
> /sys/bus/virtio/devices/virtio3/features
> 0000000000000000000000000000000000000000000000000000000000000000
>
> Catherine
>
> On Tue, May 7, 2024 at 11:34 PM Jason Wang <jasowang@...hat.com
> <mailto:jasowang@...hat.com>> wrote:
>
> On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury
> <joseph.salisbury@...onical.com
> <mailto:joseph.salisbury@...onical.com>> wrote:
> >
> > Hi Feng,
> >
> > During testing, a kernel bug was identified with the suspend/resume
> > functionality on instances running in a public cloud [0]. This
> bug is a
> > regression introduced in v6.8-rc1. After a kernel bisect, the
> following
> > commit was identified as the cause of the regression:
> >
> > fd27ef6b44be ("virtio-pci: Introduce admin virtqueue")
>
> Have a quick glance at the patch it seems it should not damage the
> freeze/restore as it should behave as in the past.
>
> But I found something interesting:
>
> 1) assumes 1 admin vq which is not what spec said
> 2) special function for admin virtqueue during freeze/restore, but it
> doesn't do anything special than del_vq()
> 3) lack real users but I guess e.g the destroy_avq() needs to be
> synchronized with the one that is using admin virtqueue
>
> >
> > I was hoping to get your feedback, since you are the patch author. Do
> > you think gathering any additional data will help diagnose this
> issue?
>
> Yes, please show us
>
> 1) the kernel log here.
> 2) the features that the device has like
> /sys/bus/virtio/devices/virtio0/features
>
> > This commit is depended upon by other virtio commits, so a revert
> test
> > is not really straight forward without reverting all the
> dependencies.
> > Any ideas you have would be greatly appreciated.
>
> Thanks
>
> >
> >
> > Thanks,
> >
> > Joe
> >
> > http://pad.lv/2063315 <http://pad.lv/2063315>
> >
>
View attachment "kern.log" of type "text/plain" (14287 bytes)
Powered by blists - more mailing lists