linux-kernel - Re: [RFC PATCH 0/4] Gang scheduling in CFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 20 Feb 2012 13:38:28 +0530
From:	Nikunj A Dadhania <nikunj@...ux.vnet.ibm.com>
To:	Ingo Molnar <mingo@...e.hu>, Avi Kivity <avi@...hat.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Rik van Riel <riel@...hat.com>, linux-kernel@...r.kernel.org,
	vatsa@...ux.vnet.ibm.com, bharata@...ux.vnet.ibm.com
Subject: Re: [RFC PATCH 0/4] Gang scheduling in CFS

On Thu, 5 Jan 2012 10:10:59 +0100, Ingo Molnar <mingo@...e.hu> wrote:
> 
> * Avi Kivity <avi@...hat.com> wrote:
> 
> > > So why wait for non-running vcpus at all? That is, why not 
> > > paravirt the TLB flush such that the invalidate marks the 
> > > non-running VCPU's state so that on resume it will first 
> > > flush its TLBs. That way you don't have to wake it up and 
> > > wait for it to invalidate its TLBs.
> > 
> > That's what Xen does, but it's tricky.  For example 
> > get_user_pages_fast() depends on the IPI to hold off page 
> > freeing, if we paravirt it we have to take that into 
> > consideration.
> > 
> > > Or am I like totally missing the point (I am after all 
> > > reading the thread backwards and I haven't yet fully paged 
> > > the kernel stuff back into my brain).
> > 
> > You aren't, and I bet those kernel pages are unswappable 
> > anyway.
> > 
> > > I guess tagging remote VCPU state like that might be 
> > > somewhat tricky.. but it seems worth considering, the whole 
> > > wake and wait for flush thing seems daft.
> > 
> > It's nasty, but then so is paravirt.  It's hard to get right, 
> > and it has a tendency to cause performance regressions as 
> > hardware improves.
> 
> Here it would massively improve performance - without regressing 
> the scheduler code massively.
> 
I tried doing an experiment with the flush_tlb_others_ipi. This depends
on Raghu's "kvm : Paravirt-spinlock support for KVM guests"
(https://lkml.org/lkml/2012/1/14/66), which has new hypercall for
kicking another vcpu out of halt.

  Here are the results from non-PLE hardware. Running ebizzy
  workload inside the VMs. The table shows the ebizzy score -
  Records/sec.

  8CPU Intel Xeon, HT disabled, 64 bit VM(8vcpu, 1G RAM)

  +--------+------------+------------+-------------+
  |        |  baseline  |   gang     |   pv_flush  |
  +--------+------------+------------+-------------+
  |   2VM  |   3979.50  |   8818.00  |   11002.50  |
  |   4VM  |   1817.50  |   6236.50  |    6196.75  |
  |   8VM  |    922.12  |   4043.00  |    4001.38  |
  +--------+------------+------------+-------------+

I will be posting the results for PLE hardware as well.

Here is the patch, this still needs to be hooked with the pv_mmu_ops. So,

Not-yet-Signed-off-by: Nikunj A Dadhania <nikunj@...ux.vnet.ibm.com>

Index: linux-tip-f4ab688-pv/arch/x86/mm/tlb.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/mm/tlb.c	2012-02-14 18:26:21.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/mm/tlb.c	2012-02-20 15:23:10.242576314 +0800
@@ -43,6 +43,7 @@ union smp_flush_state {
 		struct mm_struct *flush_mm;
 		unsigned long flush_va;
 		raw_spinlock_t tlbstate_lock;
+		int sender_cpu;
 		DECLARE_BITMAP(flush_cpumask, NR_CPUS);
 	};
 	char pad[INTERNODE_CACHE_BYTES];
@@ -116,6 +117,9 @@ EXPORT_SYMBOL_GPL(leave_mm);
  *
  * Interrupts are disabled.
  */
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+extern void kvm_kick_cpu(int cpu);
+#endif
 
 /*
  * FIXME: use of asmlinkage is not consistent.  On x86_64 it's noop
@@ -166,6 +170,10 @@ out:
 	smp_mb__before_clear_bit();
 	cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
 	smp_mb__after_clear_bit();
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+	if (cpumask_empty(to_cpumask(f->flush_cpumask)))
+		kvm_kick_cpu(f->sender_cpu);
+#endif
 	inc_irq_stat(irq_tlb_count);
 }
 
@@ -184,7 +192,10 @@ static void flush_tlb_others_ipi(const s
 
 	f->flush_mm = mm;
 	f->flush_va = va;
+	f->sender_cpu = smp_processor_id();
 	if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+		int loop = 1024;
+
 		/*
 		 * We have to send the IPI only to
 		 * CPUs affected.
@@ -192,8 +203,15 @@ static void flush_tlb_others_ipi(const s
 		apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
 			      INVALIDATE_TLB_VECTOR_START + sender);
 
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+		while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
+			cpu_relax();
+		if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
+			halt();
+#else
 		while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
 			cpu_relax();
+#endif
 	}
 
 	f->flush_mm = NULL;
Index: linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/kernel/kvm.c	2012-02-14 18:26:55.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c	2012-02-14 18:26:55.178450933 +0800
@@ -653,16 +653,17 @@ out:
 PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
 
 /* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int apicid)
+void kvm_kick_cpu(int cpu)
 {
+	int apicid = per_cpu(x86_cpu_to_apicid, cpu);
 	kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
 }
+EXPORT_SYMBOL_GPL(kvm_kick_cpu);
 
 /* Kick vcpu waiting on @lock->head to reach value @ticket */
 static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 {
 	int cpu;
-	int apicid;
 
 	add_stats(RELEASED_SLOW, 1);
 
@@ -671,8 +672,7 @@ static void kvm_unlock_kick(struct arch_
 		if (ACCESS_ONCE(w->lock) == lock &&
 		    ACCESS_ONCE(w->want) == ticket) {
 			add_stats(RELEASED_SLOW_KICKED, 1);
-			apicid = per_cpu(x86_cpu_to_apicid, cpu);
-			kvm_kick_cpu(apicid);
+			kvm_kick_cpu(cpu);
 			break;
 		}
 	}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/