linux-kernel - Re: [PATCH 0/4] Fix ebizzy performance regression due to X86 TLB range flush v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131217095523.GX11295@suse.de>
Date:	Tue, 17 Dec 2013 09:55:23 +0000
From:	Mel Gorman <mgorman@...e.de>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	"H. Peter Anvin" <hpa@...or.com>, Alex Shi <alex.shi@...aro.org>,
	Ingo Molnar <mingo@...nel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Fengguang Wu <fengguang.wu@...el.com>,
	Linux-X86 <x86@...nel.org>, Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/4] Fix ebizzy performance regression due to X86 TLB
 range flush v2

On Mon, Dec 16, 2013 at 09:17:35AM -0800, Linus Torvalds wrote:
> On Mon, Dec 16, 2013 at 2:39 AM, Mel Gorman <mgorman@...e.de> wrote:
> >
> > First was Alex's microbenchmark from https://lkml.org/lkml/2012/5/17/59
> > and ran it for a range of thread numbers, 320 iterations per thread with
> > random number of entires to flush. Results are from two machines
> 
> There's something wrong with that benchmark, it sometimes gets stuck,

It's not a thread-safe benchmark. The parent unmapping thread can finish
before the children start and it infinite loops.

> and the profile numbers are just random (and mostly in user space).
> 

Yep, it's why when I used it I ran a large number of iterations with
semi-randomised number of entries trying to knock some sense out of it.
I was hoping that the Intel folk might come back with more details on
what their testing methodology was.

> I think you mentioned fixing a bug in it, mind pointing at the fixed benchmark?
> 

Ugh, I'm embarassed by this. I did not properly fix the benchmark, just
bodged around the part that can lockup. Patch is below. Actual testing was
run using mmtests with the configs/config-global-dhp__tlbflush-performance
configuration file using something like this

# build boot kernel 1
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__tlbflush-performance test-kernel-1
# build boot kernel 2
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__tlbflush-performance test-kernel-2
cd work/log
../../compare-kernels.sh

> Looking at the kernel footprint, it seems to depend on what parameters
> you ran that benchmark with. Under certain loads, it seems to spend
> most of the time in clearing pages and in the page allocation ("-t 8
> -n 320"). And in other loads, it hits smp_call_function_many() and the
> TLB flushers ("-t 8 -n 8"). So exactly what parameters did you use?
> 

A range of parameters. The test effectively does this

TLBFLUSH_MAX_ENTRIES=256
for_each_thread_count
	for iteration in `seq 1 320`
        	# Select a range of entries to randomly select from. This is to ensure
        	# an evenish spread of entries to be tested
        	NR_SECTION=$((ITERATION%8))
        	RANGE=$((TLBFLUSH_MAX_ENTRIES/8))
        	THIS_MIN_ENTRIES=$((RANGE*NR_SECTION+1))
        	THIS_MAX_ENTRIES=$((THIS_MIN_ENTRIES+RANGE))

        	NR_ENTRIES=$((THIS_MIN_ENTRIES+(RANDOM%RANGE)))
        	if [ $NR_ENTRIES -gt $THIS_MAX_ENTRIES ]; then
                	NR_ENTRIES=$THIS_MAX_ENTRIES
        	fi

		RESULT=`tlbflush -n $NR_ENTRIES -t $NR_THREADS 2>&1`
	done
done

It splits the values for nr_entries (-n switch) into 8 segments and randomly
selects values within them. This results in noise but ensures the test hits
the best, average and worst cases for TLB range flushing. Writing this,
I realise I should have made MAX_ENTRIES 512 to hit the original shift
values. The original mail indicated that this test was run once for a very
limited number of threads and entries and I really hope this is not what
actually happened to tune that shift value.

> Because we've had things that change those two things (and they are
> totally independent).
> 

Indeed and tuning on specifics would be a bad idea -- hence why my
testing took a randomised selection of ranges to test with and a large
number of iterations.

> And does anything stand out in the profiles of ebizzy? For example, in
> between 3.4.x and 3.11, we've converted the anon_vma locking from a
> mutex to a rwsem, and we know that caused several issues, possibly
> causing unfairness. There are other potential sources of unfairness.
> It would be good to perhaps bisect things at least *somewhat*, because
> *so* much has changed in 3.4 to 3.11 that it's impossible to guess.
> 

I'll check. Right now, the machines are still occupied running bisections
which is still finding bugs. When that has found the obvious stuff, I'll use
profiles to identify what's left. FWIW, I would be surprised if ebizzy was
affected by the anon_vma locking. I do not think the threads are operating
within the same VMAs in a manner that would contend on those locks. If there
is a lock being contended, it's going to be on mmap_sem for creating mappings
just slightly larger than MMAP_THRESHOLD.  Guessing though, not proven.

This is bodge that stops Alex's benchmark locking up. It's the wrong way to
fix a problem like this. I was not even convinced this benchmark was useful
to begin with and was unmotivated to spending time on fixing it up properly.

--- tlbflush.c.orig	2013-12-15 11:05:08.813821030 +0000
+++ tlbflush.c	2013-12-15 11:04:46.504926426 +0000
@@ -67,13 +67,17 @@
 	char x;
 	int i, k;
 	int randn[PAGE_SIZE];
+	int count = 0;
 	
 	for (i=0;i<PAGE_SIZE; i++)
 		randn[i] = rand();
 
 	actimes = malloc(sizeof(long));
 
-	while (*threadstart == 0 )
+	while (*threadstart == 0) {
+		if (++count > 1000000)
+			break;
 		usleep(1);
+	}
 
 	if (d->rw == 0)
@@ -180,6 +181,7 @@
 	threadstart = malloc(sizeof(int));
 	*threadstart = 0;
 	data.readp = &p; data.startaddr = startaddr; data.rw = rw; data.loop = l;
+	sleep(1);
 	for (i=0; i< t; i++)
 		if(pthread_create(&pid[i], NULL, accessmm, &data))
 			perror("pthread create");


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/