lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8f771516-86f3-6724-7b2c-22cc23933075@nvidia.com>
Date: Wed, 8 May 2024 23:02:50 -0400
From: Feng Liu <feliu@...dia.com>
To: Catherine Redfield <catherine.redfield@...onical.com>,
 Jason Wang <jasowang@...hat.com>
Cc: Joseph Salisbury <joseph.salisbury@...onical.com>, parav@...dia.com,
 jiri@...dia.com, mst@...hat.com, yishaih@...dia.com,
 alex.williamson@...hat.com, xuanzhuo@...ux.alibaba.com,
 virtualization@...ts.linux.dev, linux-kernel@...r.kernel.org,
 Francis Ginther <francis.ginther@...onical.com>,
 John Cabaj <john.cabaj@...onical.com>,
 Ankush Pathak <ankush.pathak@...onical.com>,
 Chlo Smith <chloe.smith@...onical.com>
Subject: Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue


On 2024-05-08 a.m.7:18, Catherine Redfield wrote:
> *External email: Use caution opening links or attachments*
> 
> 
> On a VM with the GCP kernel (where we first identified the problem), I see:
> 
> 1. The full kernel log from `journalctl --system > kernlog` attached.  
> The specific suspend section is here:
> 
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> systemd[1]: Reached target sleep.target - Sleep.
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> systemd[1]: Starting systemd-suspend.service - System Suspend...
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> systemd-sleep[1413]: Performing sleep operation 'suspend'...
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: PM: suspend entry (deep)
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: Filesystems sync: 0.008 seconds
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: Freezing user space processes
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: Freezing user space processes completed (elapsed 0.001 seconds)
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: OOM killer disabled.
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: Freezing remaining freezable tasks
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: printk: Suspending console(s) (use no_console_suspend to debug)
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: port 00:03:0.0: PM: dpm_run_callback(): 
> pm_runtime_force_suspend+0x0/0x130 returns -16
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: port 00:03:0.0: PM: failed to suspend: error -16

Thanks Joesph and Catherine's help.

Hi,

I have alreay synced up with Cananical guys offline about this issue.

I can run "suspend/resume" sucessfully on my local server and VM.
And "PM: failed to suspend: error -16" looks like not cause by my 
previous virtio patch ( fd27ef6b44be  ("virtio-pci: Introduce admin 
virtqueue")) which only modified "virtio_device_freeze" about "suspend" 
action.

So I have provide the my steps and debug patch to Joesph and Catherine. 
I will also sync up the information here, as follow:

I have read the qemu code and find a way to trigger "suspend/resume" on 
my setup, and add some debug message in the latest kerenel

My setps are:
1. QEMU cmdline add following
...
-global PIIX4_PM.disable_s3=0 \
-global PIIX4_PM.disable_s4=1 \
...
-netdev type=tap,ifname=tap0,id=hostnet0,script=no,downscript=no \
-device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=$SSH_MAC,bus=pci.0,addr=0x3 \
.....

2. In the VM, run "systemctl suspend" to PM suspend the VM into memory
3. In qemu hmp shell, run "system_wakeup" to resume the VM again

My VM configuration:
NIC:     1 virtio nic emulated by QEMU
OS:      Ubuntu 22.04.4 LTS
kernel:  latest kernel, 6.9-rc7: ee5b455b0ada (kernel2/net-next-virito, 
kernel2/master, master) Merge tag 'slab-for-6.9-rc7-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab)


I add some debug message on the latest kernel, and do above steps to 
trigger "suspen/resume". Everything of VM is OK, VM could suspend/resume 
successfully.
Follwing is the kernel log:
----------------------------------------------------------------------------
.......
May  6 15:59:52 feliu-vm kernel: [   43.446737] PM: suspend entry (deep)
May  6 16:00:04 feliu-vm kernel: [   43.467640] Filesystems sync: 0.020 
seconds
May  6 16:00:04 feliu-vm kernel: [   43.467923] Freezing user space 
processes
May  6 16:00:04 feliu-vm kernel: [   43.470294] Freezing user space 
processes completed (elapsed 0.002 seconds)
May  6 16:00:04 feliu-vm kernel: [   43.470299] OOM killer disabled.
May  6 16:00:04 feliu-vm kernel: [   43.470301] Freezing remaining 
freezable tasks
May  6 16:00:04 feliu-vm kernel: [   43.471482] Freezing remaining 
freezable tasks completed (elapsed 0.001 seconds)
May  6 16:00:04 feliu-vm kernel: [   43.471495] printk: Suspending 
console(s) (use no_console_suspend to debug)
May  6 16:00:04 feliu-vm kernel: [   43.474034] virtio_net virtio0: 
godeng virtio device freeze
May  6 16:00:04 feliu-vm kernel: [   43.475714] virtio_net virtio0 ens3: 
godfeng virtnet_freeze done
May  6 16:00:04 feliu-vm kernel: [   43.475717] virtio_net virtio0: 
godfeng VIRTIO_F_ADMIN_VQ not enabled
May  6 16:00:04 feliu-vm kernel: [   43.475719] virtio_net virtio0: 
godeng virtio device freeze done
.......
May  6 16:00:04 feliu-vm kernel: [   43.535382] smpboot: CPU 1 is now 
offline
May  6 16:00:04 feliu-vm kernel: [   43.537283] IRQ fixup: irq 1 move in 
progress, old vector 32
May  6 16:00:04 feliu-vm kernel: [   43.538504] smpboot: CPU 2 is now 
offline
May  6 16:00:04 feliu-vm kernel: [   43.541392] smpboot: CPU 3 is now 
offline

.....

May  6 16:00:04 feliu-vm kernel: [   54.973285] smpboot: Booting Node 0 
Processor 15 APIC 0xf
May  6 16:00:04 feliu-vm kernel: [   54.975190] CPU15 is up
May  6 16:00:04 feliu-vm kernel: [   54.976011] ACPI: PM: Waking up from 
system sleep state S3
May  6 16:00:04 feliu-vm kernel: [   54.986071] virtio_net virtio0: 
godeng virtio device restore
May  6 16:00:04 feliu-vm kernel: [   54.987563] virtio_net virtio0 ens3: 
godfeng virtnet_restore done
May  6 16:00:04 feliu-vm kernel: [   54.987635] virtio_net virtio0: 
godfeng: virtio device restore done
.....
May  6 16:00:04 feliu-vm kernel: [   55.307221] ata8: SATA link down 
(SStatus 0 SControl 300)
May  6 16:00:04 feliu-vm kernel: [   55.442048] OOM killer enabled.
May  6 16:00:04 feliu-vm kernel: [   55.442051] Restarting tasks ... done.
May  6 16:00:04 feliu-vm kernel: [   55.443576] random: crng reseeded on 
system resumption
May  6 16:00:04 feliu-vm kernel: [   55.443582] PM: suspend exit

----------------------------------------------------------------------------

Attachment is the full kernel log. I think maybe it is some configration 
error.


Thanks
Feng


> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: sd 0:0:1:0: [sda] Synchronizing SCSI cache
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: PM: Some devices failed to suspend, or early wake event detected
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: OOM killer enabled.
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: Restarting tasks ... done.
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: random: crng reseeded on system resumption
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: PM: suspend exit
> May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
> kernel: PM: suspend entry (s2idle)
> -- Boot 61828bc938b44fc68a8aeedc16a23a9d --
> May 08 11:09:03 localhost kernel: Linux version 6.8.0-1007-gcp 
> (buildd@...02-amd64-079) (x86_64-linux-gnu-gcc-13 (Ubuntu 
> 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) 
> #7-Ubuntu SMP Sat Apr 20 00:58:31 UTC 2024 (Ubuntu 6.8.0-1007.7-gcp 6.8.1)
> May 08 11:09:03 localhost kernel: Command line: 
> BOOT_IMAGE=/vmlinuz-6.8.0-1007-gcp 
> root=PARTUUID=7a949935-6bf2-4cae-b404-803c95163572 ro 
> console=ttyS0,115200 panic=-1
> 
> 2. The features the devices has:
> 
> catred@...nel-test-202405080702:~$ cat 
> /sys/bus/virtio/devices/virtio0/features
> 0110000000000000000000000000010000000000000000000000000000000000
> catred@...nel-test-202405080702:~$ cat 
> /sys/bus/virtio/devices/virtio1/features
> 1110010110011001110000100000010000000000000000000000000000000000
> catred@...nel-test-202405080702:~$ cat 
> /sys/bus/virtio/devices/virtio2/features
> 1110000000000000000000000000000000000000000000000000000000000000
> catred@...nel-test-202405080702:~$ cat 
> /sys/bus/virtio/devices/virtio3/features
> 0000000000000000000000000000000000000000000000000000000000000000
> 
> Catherine
> 
> On Tue, May 7, 2024 at 11:34 PM Jason Wang <jasowang@...hat.com 
> <mailto:jasowang@...hat.com>> wrote:
> 
>     On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury
>     <joseph.salisbury@...onical.com
>     <mailto:joseph.salisbury@...onical.com>> wrote:
>      >
>      > Hi Feng,
>      >
>      > During testing, a kernel bug was identified with the suspend/resume
>      > functionality on instances running in a public cloud [0].  This
>     bug is a
>      > regression introduced in v6.8-rc1.  After a kernel bisect, the
>     following
>      > commit was identified as the cause of the regression:
>      >
>      >         fd27ef6b44be  ("virtio-pci: Introduce admin virtqueue")
> 
>     Have a quick glance at the patch it seems it should not damage the
>     freeze/restore as it should behave as in the past.
> 
>     But I found something interesting:
> 
>     1) assumes 1 admin vq which is not what spec said
>     2) special function for admin virtqueue during freeze/restore, but it
>     doesn't do anything special than del_vq()
>     3) lack real users but I guess e.g the destroy_avq() needs to be
>     synchronized with the one that is using admin virtqueue
> 
>      >
>      > I was hoping to get your feedback, since you are the patch author. Do
>      > you think gathering any additional data will help diagnose this
>     issue?
> 
>     Yes, please show us
> 
>     1) the kernel log here.
>     2) the features that the device has like
>     /sys/bus/virtio/devices/virtio0/features
> 
>      > This commit is depended upon by other virtio commits, so a revert
>     test
>      > is not really straight forward without reverting all the
>     dependencies.
>      > Any ideas you have would be greatly appreciated.
> 
>     Thanks
> 
>      >
>      >
>      > Thanks,
>      >
>      > Joe
>      >
>      > http://pad.lv/2063315 <http://pad.lv/2063315>
>      >
> 
View attachment "kern.log" of type "text/plain" (14287 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ