linux-kernel - Re: [PATCH] x86/uaccess: use unrolled string copy for short strings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170622084732.xjnd5gx77ftaozem@gmail.com>
Date:   Thu, 22 Jun 2017 10:47:32 +0200
From:   Ingo Molnar <mingo@...nel.org>
To:     Paolo Abeni <pabeni@...hat.com>
Cc:     x86@...nel.org, Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Al Viro <viro@...iv.linux.org.uk>,
        Kees Cook <keescook@...omium.org>,
        Hannes Frederic Sowa <hannes@...essinduktion.org>,
        linux-kernel@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andy Lutomirski <luto@...nel.org>
Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings


* Paolo Abeni <pabeni@...hat.com> wrote:

> The 'rep' prefix suffers for a relevant "setup cost"; as a result
> string copies with unrolled loops are faster than even
> optimized string copy using 'rep' variant, for short string.
> 
> This change updates __copy_user_generic() to use the unrolled
> version for small string length. The threshold length for short
> string - 64 - has been selected with empirical measures as the
> larger value that still ensure a measurable gain.
> 
> A micro-benchmark of __copy_from_user() with different lengths shows
> the following:
> 
> string len	vanilla		patched 	delta
> bytes		ticks		ticks		tick(%)
> 
> 0		58		26		32(55%)
> 1		49		29		20(40%)
> 2		49		31		18(36%)
> 3		49		32		17(34%)
> 4		50		34		16(32%)
> 5		49		35		14(28%)
> 6		49		36		13(26%)
> 7		49		38		11(22%)
> 8		50		31		19(38%)
> 9		51		33		18(35%)
> 10		52		36		16(30%)
> 11		52		37		15(28%)
> 12		52		38		14(26%)
> 13		52		40		12(23%)
> 14		52		41		11(21%)
> 15		52		42		10(19%)
> 16		51		34		17(33%)
> 17		51		35		16(31%)
> 18		52		37		15(28%)
> 19		51		38		13(25%)
> 20		52		39		13(25%)
> 21		52		40		12(23%)
> 22		51		42		9(17%)
> 23		51		46		5(9%)
> 24		52		35		17(32%)
> 25		52		37		15(28%)
> 26		52		38		14(26%)
> 27		52		39		13(25%)
> 28		52		40		12(23%)
> 29		53		42		11(20%)
> 30		52		43		9(17%)
> 31		52		44		8(15%)
> 32		51		36		15(29%)
> 33		51		38		13(25%)
> 34		51		39		12(23%)
> 35		51		41		10(19%)
> 36		52		41		11(21%)
> 37		52		43		9(17%)
> 38		51		44		7(13%)
> 39		52		46		6(11%)
> 40		51		37		14(27%)
> 41		50		38		12(24%)
> 42		50		39		11(22%)
> 43		50		40		10(20%)
> 44		50		42		8(16%)
> 45		50		43		7(14%)
> 46		50		43		7(14%)
> 47		50		45		5(10%)
> 48		50		37		13(26%)
> 49		49		38		11(22%)
> 50		50		40		10(20%)
> 51		50		42		8(16%)
> 52		50		42		8(16%)
> 53		49		46		3(6%)
> 54		50		46		4(8%)
> 55		49		48		1(2%)
> 56		50		39		11(22%)
> 57		50		40		10(20%)
> 58		49		42		7(14%)
> 59		50		42		8(16%)
> 60		50		46		4(8%)
> 61		50		47		3(6%)
> 62		50		48		2(4%)
> 63		50		48		2(4%)
> 64		51		38		13(25%)
> 
> Above 64 bytes the gain fades away.
> 
> Very similar values are collectd for __copy_to_user().
> UDP receive performances under flood with small packets using recvfrom()
> increase by ~5%.

What CPU model(s) were used for the performance testing and was it performance 
tested on several different types of CPUs?

Please add a comment here:

+       if (len <= 64)
+               return copy_user_generic_unrolled(to, from, len);
+

... because it's not obvious at all that this is a performance optimization, not a 
correctness issue. Also explain that '64' is a number that we got from performance 
measurements.

But in general I like the change - as long as it was measured on reasonably modern 
x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs.

Thanks,

	Ingo