[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220612164002.GA1242564@paulmck-ThinkPad-P17-Gen-1>
Date: Sun, 12 Jun 2022 09:40:02 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: "zhangfei.gao@...mail.com" <zhangfei.gao@...mail.com>
Cc: Zhangfei Gao <zhangfei.gao@...aro.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
rcu@...r.kernel.org, Lai Jiangshan <jiangshanlai@...il.com>,
Josh Triplett <josh@...htriplett.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Matthew Wilcox <willy@...radead.org>,
Shameerali Kolothum Thodi
<shameerali.kolothum.thodi@...wei.com>, mtosatti@...hat.com,
pbonzini@...hat.com, sheng.yang@...el.com
Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking
readers from consuming CPU) cause qemu boot slow
On Sun, Jun 12, 2022 at 09:20:29AM -0700, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 10:59:30PM +0800, zhangfei.gao@...mail.com wrote:
> > Hi, Paul
> >
> > On 2022/6/12 下午9:36, Paul E. McKenney wrote:
> > > On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@...mail.com wrote:
> > > > Hi, Paul
> > > >
> > > > On 2022/6/12 上午12:59, Paul E. McKenney wrote:
> > > > > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
> > > > > > Hi, Paul
> > > > > >
> > > > > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
> > > > > > stuck for several minutes.
> > > > > Stuck for several minutes but then continues normally? Or stuck for
> > > > > several minutes before you kill qemu?
> > > > qemu boot stuck for several minutes, then guest can bootup normally, just
> > > > slower.
> > > > > And I have to ask... What happened without the ACPI RMR feature?
> > > > If no ACPI, qemu boot quickly without stuck.
> > > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > > virt,gic-version=3,iommu=smmuv3 \
> > > > -enable-kvm -cpu host -m 1024 \
> > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
> > > >
> > > > Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
> > > >
> > > > By the way, my hardware platform is aarch64.
> > > Thank you for the information! The problem is excessive delay rather
> > > than a hang, and it is configuration-dependent. Good to know!
> > >
> > > > Only change this can solve the stuck issue.
> > > >
> > > > --- a/kernel/rcu/srcutree.c
> > > > +++ b/kernel/rcu/srcutree.c
> > > > @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct
> > > > *ssp)
> > > > {
> > > > unsigned long jbase = SRCU_INTERVAL;
> > > >
> > > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > > + return 0;
> > > > + return SRCU_INTERVAL;
> > > I am glad that you have a workaround for this problem, but this change
> > > would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent
> > > expedited GPs and blocking readers from consuming CPU") was intended
> > > to fix. For one example, your change can prevent kernel live patching
> > > from applying a patch. So something else is needed.
> > Understand, it is just debug where has issue.
> > >
> > > Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay
> > > significantly? (This is not a fix, either, but instead a debug check.)
> > No use.
>
> OK, that indicates that you have a very large number of invocations
> of synchronize_srcu() or synchronize_srcu_expedited() instead of only
> a few that take a very long time each.
>
> > > Your change always returns zero if another SRCU grace period is needed.
> > > Let's look at the callers of srcu_get_delay():
> > >
> > > o cleanup_srcu_struct() uses it to check whether there is an
> > > expedited grace period pending, leaking the srcu_struct if so.
> > > This should not affect boot delay. (Unless you are invoking
> > > init_srcu_struct() and cleanup_srcu_struct() really really
> > > often.)
> > >
> > > o srcu_gp_end() uses it to determine whether or not to allow
> > > a one-jiffy delay before invoking callbacks at the end of
> > > a grace period.
> > >
> > > o srcu_funnel_gp_start() uses it to determine whether or not to
> > > allow a one-jiffy delay before starting the process of checking
> > > for the end of an SRCU grace period.
> > >
> > > o try_check_zero() uses it to add an additional short delay
> > > (instead of a long delay) between checks of reader state.
> > >
> > > o process_srcu() uses it to calculate the long delay between
> > > checks of reader state.
> > >
> > > These add one-jiffy delays, except for process_srcu(), which adds a delay
> > > of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that
> > > I normally use), this requires thousands of such delays to add up to the
> > > several minutes that you are seeing. (In theory, the delays could also
> > > be due to SRCU readers, except that in that case adjusting timeouts in
> > > the grace-period processing would not make things go faster.)
> > >
> > > So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so,
> > > it would be very good make sure that this code is using SRCU efficiently.
> > > One way to check would be to put a printk() into synchronize_srcu(),
> > > though maintaining a counter and printing (say) every 1000th invocation
> > > might be easier on the console output.
> >
> > good idea.
>
> Glad you like it. ;-)
>
> > > > > > And on 5.18, there is no such problem.
> > > > > >
> > > > > > After revert this patch, the issue solved.
> > > > > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
> > > > > > consuming CPU)
> > > > > >
> > > > > >
> > > > > > qemu cmd:
> > > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > > > > virt,gic-version=3,iommu=smmuv3 \
> > > > > > -enable-kvm -cpu host -m 1024 \
> > > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
> > > > > > -bios QEMU_EFI.fd
> > > > > >
> > > > > > log:
> > > > > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > > > > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > > > > > 0x75459000
> > > > > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> > > > > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> > > > > > ProtectUefiImageCommon - 0x7AA4D040
> > > > > > - 0x0000000075458000 - 0x000000000003F000
> > > > > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
> > > > > > (0x0000000000004008)
> > > > > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
> > > > > > (0x0000000000020008)
> > > > > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
> > > > > > (0x0000000000004008)
> > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> > > > > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> > > > > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
> > > > > >
> > > > > >
> > > > > > Not sure it is either reported or solved.
> > > > > This is the first I have heard of it, so thank you for reporting it.
> > > > >
> > > > > Do you have a way of collecting something sysrq-t output?
> > > > Do you mean "echo t > /proc/sysrq-trigger",
> > > > There are too much output and kernel dump can not stop.
> > > OK. What other tools do you have to work out what is happening during
> > > temporary hangs such as this one?
> > >
> > > The question to be answered: "Is there usually at least one task waiting
> > > in synchronize_srcu() during these hangs, and if so, which srcu_struct
> > > is passed to those synchronize_srcu() calls?"
> >
> > As you suggested, add print in __synchronize_srcu, 1000 times print once.
> >
> > With acpi=force & -bios QEMU_EFI.fd
> >
> > When qemu stuck in
> > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > 0x75459000
> >
> > The print in __synchronize_srcu is print from 0 t0 9001.
>
> Now that is what I call a large number of calls!
>
> > [ 94.271350] gzf __synchronize_srcu loop=1001
> > ....
> >
> > [ 222.621659] __synchronize_srcu loop=9001
> > [ 222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted
> > 5.19.0-rc1-15071-g697f40b5235f-dirty #615
> > [ 222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS
> > 2280-V2 CS V5.B133.01 03/25/2021
> > [ 222.621667] Call trace:
> > [ 222.621668] dump_backtrace+0xe4/0xf0
> > [ 222.621670] show_stack+0x20/0x70
> > [ 222.621672] dump_stack_lvl+0x8c/0xb8
> > [ 222.621674] dump_stack+0x18/0x34
> > [ 222.621676] __synchronize_srcu+0x120/0x128
> > [ 222.621678] synchronize_srcu_expedited+0x2c/0x40
> > [ 222.621680] kvm_swap_active_memslots+0x130/0x198
> > [ 222.621683] kvm_activate_memslot+0x40/0x68
> > [ 222.621684] kvm_set_memslot+0x184/0x3b0
> > [ 222.621686] __kvm_set_memory_region+0x288/0x438
> > [ 222.621688] kvm_set_memory_region+0x3c/0x60
>
> This is KVM setting up one mapping in your IORT RMR memory, correct?
> (As in I/O remapping table reserved memory regions.)
>
> > [ 222.621689] kvm_vm_ioctl+0x5a0/0x13e0
>
> And this ioctl() is qemu telling KVM to do so, correct?
>
> > [ 222.621691] __arm64_sys_ioctl+0xb0/0xf8
> > [ 222.621693] invoke_syscall+0x4c/0x110
> > [ 222.621696] el0_svc_common.constprop.0+0x68/0x128
> > [ 222.621698] do_el0_svc+0x34/0xc0
> > [ 222.621701] el0_svc+0x30/0x98
> > [ 222.621704] el0t_64_sync_handler+0xb8/0xc0
> > [ 222.621706] el0t_64_sync+0x18c/0x190
> >
> >
> > If no acpi=force, no print at all, 1000 times one print.
>
> OK, this certainly explains the slowdown, both from adding the IORT RMR
> and from the commit that you bisected to. The answers to a few questions
> might help illuminate a path towards a fix:
>
> Do these reserved memory regions really need to be allocated separately?
> (For example, are they really all non-contiguous? If not, that is, if
> there are a lot of contiguous memory regions, could you sort the IORT
> by address and do one ioctl() for each set of contiguous memory regions?)
>
> Are all of these reserved memory regions set up before init is spawned?
>
> Are all of these reserved memory regions set up while there is only a
> single vCPU up and running?
>
> Is the SRCU grace period really needed in this case? (I freely confess
> to not being all that familiar with KVM.)
Oh, and there was a similar many-requests problem with networking many
years ago. This was solved by adding a new syscall/ioctl()/whatever
that permitted many requests to be presented to the kernel with a single
system call.
Could a new ioctl() be introduced that requested a large number
of these memory regions in one go so as to make each call to
synchronize_rcu_expedited() cover a useful fraction of your 9000+
requests? Adding a few of the KVM guys on CC for their thoughts.
Thanx, Paul
Powered by blists - more mailing lists