linux-kernel - Re: [PATCH v4 3/7] x86/flush_tlb: try flush_tlb_single one by one in flush_tlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 10 May 2012 09:53:25 +0200
From:	Borislav Petkov <bp@...64.org>
To:	Alex Shi <alex.shi@...el.com>
Cc:	rob@...dley.net, tglx@...utronix.de, mingo@...hat.com,
	hpa@...or.com, arnd@...db.de, rostedt@...dmis.org,
	fweisbec@...il.com, jeremy@...p.org, gregkh@...uxfoundation.org,
	riel@...hat.com, luto@....edu, avi@...hat.com, len.brown@...el.com,
	dhowells@...hat.com, fenghua.yu@...el.com, ak@...ux.intel.com,
	cpw@....com, steiner@....com, akpm@...ux-foundation.org,
	penberg@...nel.org, hughd@...gle.com, rientjes@...gle.com,
	kosaki.motohiro@...fujitsu.com, n-horiguchi@...jp.nec.com,
	paul.gortmaker@...driver.com, trenn@...e.de, tj@...nel.org,
	oleg@...hat.com, axboe@...nel.dk, a.p.zijlstra@...llo.nl,
	kamezawa.hiroyu@...fujitsu.com, viro@...iv.linux.org.uk,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 3/7] x86/flush_tlb: try flush_tlb_single one by one in
 flush_tlb_range

On Thu, May 10, 2012 at 01:00:09PM +0800, Alex Shi wrote:
> x86 has no flush_tlb_range support in instruction level. Currently the
> flush_tlb_range just implemented by flushing all page table. That is not
> the best solution for all scenarios. In fact, if we just use 'invlpg' to
> flush few lines from TLB, we can get the performance gain from later
> remain TLB lines accessing.
> 
> But the 'invlpg' instruction costs much of time. Its execution time can
> compete with cr3 rewriting, and even a bit more on SNB CPU.
> 
> So, on a 512 4KB TLB entries CPU, the balance points is at:
> 	(512 - X) * 100ns(assumed TLB refill cost) =
> 		X(TLB flush entries) * 100ns(assumed invlpg cost)
> 
> Here, X is 256, that is 1/2 of 512 entries.
> 
> But with the mysterious CPU pre-fetcher and page miss handler Unit, the
> assumed TLB refill cost is far lower then 100ns in sequential access. And
> 2 HT siblings in one core makes the memory access more faster if they are
> accessing the same memory. So, in the patch, I just do the change when
> the target entries is less than 1/16 of whole active tlb entries.
> Actually, I have no data support for the percentage '1/16', so any
> suggestions are welcomed.
> 
> As to hugetlb, guess due to smaller page table, and smaller active TLB
> entries, I didn't see benefit via my benchmark, so no optimizing now.
> 
> My macro benchmark show in ideal scenarios, the performance improves 70
> percent in reading. And in worst scenario, the reading/writing
> performance is similar with unpatched 3.4-rc4 kernel.
> 
> Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
> 'always':
> 
> multi thread testing, '-t' paramter is thread number:
> 	       	        with patch   unpatched 3.4-rc4
> ./mprotect -t 1           14ns		24ns
> ./mprotect -t 2           13ns		22ns
> ./mprotect -t 4           12ns		19ns
> ./mprotect -t 8           14ns		16ns
> ./mprotect -t 16          28ns		26ns
> ./mprotect -t 32          54ns		51ns
> ./mprotect -t 128         200ns		199ns
> 
> Single process with sequencial flushing and memory accessing:
> 
> 		       	with patch   unpatched 3.4-rc4
> ./mprotect		    7ns			11ns
> ./mprotect -p 4096  -l 8 -n 10240
> 			    21ns		21ns
> 
> I also tried other benchmarks on Intel core2/NHM/SNB EP and NHM EX machine.
> No clear performance change on specjbb2005 with openjdk, and oltp reading.
> 
> Signed-off-by: Alex Shi <alex.shi@...el.com>
> ---

[ … ]

> +#define FLUSHALL_BAR	16
> +
> +void flush_tlb_range(struct vm_area_struct *vma,
> +				   unsigned long start, unsigned long end)
> +{
> +	struct mm_struct *mm;
> +
> +	if (!cpu_has_invlpg || vma->vm_flags & VM_HUGETLB) {
> +		flush_tlb_mm(vma->vm_mm);
> +		return;
> +	}
> +
> +	preempt_disable();
> +	mm = vma->vm_mm;
> +	if (current->active_mm == mm) {
> +		if (current->mm) {
> +			unsigned long addr, vmflag = vma->vm_flags;
> +			unsigned act_entries, tlb_entries = 0;
> +
> +			if (vmflag & VM_EXEC)
> +				tlb_entries = tlb_lli_4k[ENTRIES];
> +			else
> +				tlb_entries = tlb_lld_4k[ENTRIES];
> +
> +			act_entries = tlb_entries > mm->total_vm ?
> +					mm->total_vm : tlb_entries;

Ok, question:

we're comparing TLB size with the amount of pages mapped by this mm
struct. AFAICT, that doesn't mean that all those mapped pages do have
respective entries in the TLB, does it?

If so, then the actual entries number is kinda inaccurate, no? We don't
really know how many TLB entries actually belong to this mm struct. Or am I
missing something?

> +			if ((end - start)/PAGE_SIZE > act_entries/FLUSHALL_BAR)

Oh, in a later patch you do this:

+                       if ((end - start) >> PAGE_SHIFT >
+                                       act_entries >> tlb_flushall_factor)

and the tlb_flushall_factor factor is 5 or 6 but the division by 16
(FLUSHALL_BAR) was a >> 4. So, is this to assume that it is not 16 but
actually more than 32 or even 64 TLB entries where a full TLB flush
makes sense and one-by-one if less?

> +				local_flush_tlb();
> +			else {
> +				for (addr = start; addr <= end;
> +						addr += PAGE_SIZE)
> +					__flush_tlb_single(addr);
> +
> +				if (cpumask_any_but(mm_cpumask(mm),
> +					smp_processor_id()) < nr_cpu_ids)
> +					flush_tlb_others(mm_cpumask(mm), mm,
> +								start, end);
> +				preempt_enable();
> +				return;
> +			}
> +		} else {
> +			leave_mm(smp_processor_id());
> +		}
> +	}
> +	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
> +		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
>  	preempt_enable();

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/