linux-kernel - Re: [RFC PATCH 0/3] restartable sequences benchmarks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 22 Oct 2015 12:11:42 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Dave Watson <davejwatson@...com>
Cc:	kernel-team@...com,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Linux API <linux-api@...r.kernel.org>,
	Paul Turner <pjt@...gle.com>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Subject: Re: [RFC PATCH 0/3] restartable sequences benchmarks

On Thu, Oct 22, 2015 at 11:06 AM, Dave Watson <davejwatson@...com> wrote:
> We've been testing out restartable sequences + malloc changes for use
> at Facebook.  Below are some test results, as well as some possible
> changes based on Paul Turner's original patches

Thanks!  I'll stare at this some time between now and Kernel Summit.

>
> https://lkml.org/lkml/2015/6/24/665
>
> I ran one service with several permutations of various mallocs.  The
> service is CPU-bound, and hits the allocator quite hard.  Requests/s
> are held constant at the source, so we use cpu idle time and latency
> as an indicator of service quality. These are average numbers over
> several hours.  Machines were dual E5-2660, total 16 cores +
> hyperthreading.  This service has ~400 total threads, 70-90 of which
> are doing work at any particular time.
>
>                                    RSS CPUIDLE LATENCYMS
> jemalloc 4.0.0                     31G   33%     390
> jemalloc + this patch              25G   33%     390
> jemalloc + this patch using lsl    25G   30%     420
> jemalloc + PT's rseq patch         25G   32%     405
> glibc malloc 2.20                  27G   30%     420
> tcmalloc gperftools trunk (2.2)    21G   30%     480

Slightly confused.  This is showing a space efficiency improvement but
not a performance improvement?  Is the idea that percpu free lists are
more space efficient than per-thread free lists?

>
> jemalloc rseq patch used for testing:
> https://github.com/djwatson/jemalloc
>
> lsl test - using lsl segment limit to get cpu (i.e. inlined vdso
> getcpu on x86) instead of using the thread caching as in this patch.
> There has been some suggestions to add the thread-cached getcpu()
> feature separately.  It does seem to move the needle in a real service
> by about ~3% to have a thread-cached getcpu vs. not.  I don't think we
> can use restartable sequences in production without a faster getcpu.

If nothing else, I'd like to replace the thread-cached getcpu thing
with percpu gsbase, at least on x86.  That doesn't necessarily have to
be exclusive with restartable sequences.

>
> GS-segment / migration only tests
>
> There's been some interest in seeing if we can do this with only gs
> segment, here's some numbers for those.  This doesn't have to be gs,
> it could just be a migration signal sent to userspace as well, the
> same approaches would apply.
>
> GS patch: https://lkml.org/lkml/2014/9/13/59
>
>                                    RSS CPUIDLE LATENCYMS
> jemalloc 4.0.0                     31G   33%     390
> jemalloc + percpu locking          25G   25%     420
> jemalloc + preempt lock / signal   25G   32%     415

Neat!

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/