lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130529132144.GF5931@amt.cnet>
Date:	Wed, 29 May 2013 10:21:44 -0300
From:	Marcelo Tosatti <mtosatti@...hat.com>
To:	Xiao Guangrong <xiaoguangrong@...ux.vnet.ibm.com>
Cc:	gleb@...hat.com, avi.kivity@...il.com, pbonzini@...hat.com,
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org
Subject: Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
> On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
> > On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
> >> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
> >>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
> >>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
> >>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> >>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> >>>>>> caused by requiring lock
> >>>>>>
> >>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> >>>>>> so update the comments
> >>>>>>
> >>>>>> [ It improves kernel building 0.6% ~ 1% ]
> >>>>>
> >>>>> Can you please describe the overload in more detail? Under what scenario
> >>>>> is kernel building improved?
> >>>>
> >>>> Yes.
> >>>>
> >>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
> >>>> every one second.
> >>>>
> >>>> [
> >>>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
> >>>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
> >>>> ]
> >>>
> >>> Can't see why it reflects real world scenario (or a real world
> >>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
> >>>
> >>> Point is, it would be good to understand why this change 
> >>> is improving performance? What are these cases where breaking out of
> >>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
> >>> < 10 ?
> >>
> >> When guest read ROM, qemu will set the memory to map the device's firmware,
> >> that is why kvm_mmu_zap_all can be called in the scenario.
> >>
> >> The reasons why it heart the performance are:
> >> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
> >>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
> >>     other vcpus need wait a long time to do I/O.
> >>
> >> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
> >>     request from other vcpus.
> >>
> >> Is it enough?
> > 
> > That is no problem. The problem is why you chose "10" as the minimum number of
> > pages to zap before considering reschedule. I would expect the need to
> 
> Well, my description above explained why batch-zapping is needed - we do
> not want the vcpu spend lots of time to zap all pages because it hurts other
> vcpus running.
> 
> But, why the batch page number is "10"... I can not answer this, i just guessed
> that '10' can make vcpu do not spend long time on zap_all_pages and do
> not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
> it is the best value but at lease, i think it can work.
> 
> > reschedule to be rare enough that one kvm_mmu_zap_all instance (between
> > schedule in and schedule out) to be able to release no less than a
> > thousand pages.
> 
> Unfortunately, no.
> 
> This information is I replied Gleb in his mail where he raced a question that
> why "collapse tlb flush is needed":
> 
> ======
> It seems no.
> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
> is easily contended. I did the simple track:
> 
> +       int num = 0;
>  restart:
>         list_for_each_entry_safe_reverse(sp, node,
>               &kvm->arch.active_mmu_pages, link) {
> @@ -4265,6 +4265,7 @@ restart:
>                 if (batch >= BATCH_ZAP_PAGES &&
>                       cond_resched_lock(&kvm->mmu_lock)) {
>                         batch = 0;
> +                       num++;
>                         goto restart;
>                 }
> 
> @@ -4277,6 +4278,7 @@ restart:
>          * may use the pages.
>          */
>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +       printk("lock-break: %d.\n", num);
>  }
> 
> I do read pci rom when doing kernel building in the guest which
> has 1G memory and 4vcpus with ept enabled, this is the normal
> workload and normal configuration.
> 
> # dmesg
> [ 2338.759099] lock-break: 8.
> [ 2339.732442] lock-break: 5.
> [ 2340.904446] lock-break: 3.
> [ 2342.513514] lock-break: 3.
> [ 2343.452229] lock-break: 3.
> [ 2344.981599] lock-break: 4.
> 
> Basically, we need to break many times.
> ======
> 
> You can see we should break 3 times to zap all pages even if we have zapoed
> 10 pages in batch. It is obviously that it need break more times without
> batch-zapping.

Yes, but this is not a real scenario, or even describes a real scenario
as far as i know. 

Are you sure this minimum-batching-before-considering-reschedule even
after obsolete pages optimization?

I fail to see why.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ