linux-kernel - Re: [PATCH] x86-64: fix memset() to support sizes of 4Gb and above

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 6 Jan 2012 12:05:20 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Jan Beulich <JBeulich@...e.com>
Cc:	tglx@...utronix.de, hpa@...or.com, linux-kernel@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] x86-64: fix memset() to support sizes of 4Gb and above


* Jan Beulich <JBeulich@...e.com> wrote:

> While currently there doesn't appear to be any reachable 
> in-tree case where such large memory blocks may be passed to 
> memset() (alloc_bootmem() being the primary non-reachable one, 
> as it gets called with suitably large sizes in FLATMEM 
> configurations), we have recently hit the problem a second 
> time in our Xen kernels. Rather than working around it a 
> second time, prevent others from falling into the same trap by 
> fixing this long standing limitation.
> 
> Signed-off-by: Jan Beulich <jbeulich@...e.com>

Have you checked the before/after size of the hotpath?

The patch suggests that it got shorter by 3 instructions:

> -	movl %edx,%r8d
> -	andl $7,%r8d
> -	movl %edx,%ecx
> -	shrl $3,%ecx
> +	movq %rdx,%rcx
> +	andl $7,%edx
> +	shrq $3,%rcx

[...]

>  	movq %rdi,%r10
> -	movq %rdx,%r11
>  

[...]

> -	movl	%r11d,%ecx
> -	andl	$7,%ecx
> +	andl	$7,%edx

Is that quick impression correct? I have not tried building or 
measuring it.

Also, note that we have some cool instrumentation tech upstream, 
lately we've added a way to measure the *kernel*'s memcpy 
routine performance in user-space, using perf bench:

  $ perf bench mem memcpy -r x86
  # Running mem/memcpy benchmark...
  Unknown routine:x86
  Available routines...
	default ... Default memcpy() provided by glibc
	x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S

  $ perf bench mem memcpy -r x86-64-unrolled
  # Running mem/memcpy benchmark...
  # Copying 1MB Bytes ...

       1.888902 GB/Sec
      15.024038 GB/Sec (with prefault)

This builds the routine in arch/x86/lib/memcpy_64.S and measures 
its bandwidth in the cache-cold and cache-hot case as well.

Would be nice to add support for arch/x86/lib/memset_64.S as 
well, and look at the before/after performance of it. In 
userspace we can do a lot more accurate measurements of this 
kind:

 $ perf stat --repeat 1000 -e instructions -e cycles perf bench mem memcpy -r x86-64-unrolled >/dev/null

 Performance counter stats for 'perf bench mem memcpy -r x86-64-unrolled' (1000 runs):

         4,924,378 instructions              #    0.84  insns per cycle          ( +-  0.03% )
         5,892,603 cycles                    #    0.000 GHz                      ( +-  0.06% )

       0.002208000 seconds time elapsed                                          ( +-  0.10% )

Check how the confidence interval of the measurement is in the 
0.03% range, thus even a single cycle change in overhead caused 
by your patch should be measurable. Note, you'll need to rebuild 
perf with the different memset routines.

For example the kernel's memcpy routine in slightly faster than 
glibc's:

  $ perf stat --repeat 1000 -e instructions -e cycles perf bench mem memcpy -r default >/dev/null

 Performance counter stats for 'perf bench mem memcpy -r default' (1000 runs):

         4,927,173 instructions              #    0.83  insns per cycle          ( +-  0.03% )
         5,928,168 cycles                    #    0.000 GHz                      ( +-  0.06% )

       0.002157349 seconds time elapsed                                          ( +-  0.10% )

If such measurements all suggests equal or better performance, 
and if there's no erratum in current CPUs that would make 4G 
string copies dangerous [which your research suggests should be 
fine], i have no principal objection against this patch.

I'd not be surprised (at all) if other OSs did larger than 4GB 
memsets in certain circumstances.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/