lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 17 Apr 2008 15:33:02 +0200
From:	"Alexander van Heukelum" <heukelum@...tmail.fm>
To:	"Andi Kleen" <andi@...stfloor.org>
Cc:	"Ingo Molnar" <mingo@...e.hu>, linux-kernel@...r.kernel.org
Subject: Re: [v2.6.26] what's brewing in x86.git for v2.6.26

On Thu, 17 Apr 2008 12:51:09 +0200, "Andi Kleen" <andi@...stfloor.org>
said:
> I think a realistic benchmark would be by running a real kernel
> and profiling the input values of the bitmap functions and then
> testing these cases.
> 
> I actually started that when I complained last time by writing
> a systemtap script for this that generates a histogram, but for some
> reason systemtap couldn't tap all bitmap functions in my kernel and
> missed some completely and I ran out of time tracking that down.
> 
> My gut feeling is the only interesting cases are cpumask/nodemask sized
> (which can be one word, two words but now upto 8 words on a NR_CPU=4096
> x86 kernel) and then 4k sized ext3/reiser/etc. block bitmaps.
>
> The generic version is out-of-line,
> > while the private implementation of i386 was inlined: this causes a
> > regression for very small bitmaps. However, if the bitmap size is
> > a constant and fits a long integer, the updated generic code should
> > inline an optimized version, like x86_64 currently does it.
> 
> Yes it should probably. cpumask walks are relatively common.

Hi,

The version that is in x86#testing _will_ do this optimization. For
32 node SMP on x86_64 this results in:

<__first_cpu>:
    mov    $0x20,%edx   (inlined...)
    mov    $0x100000000,%rax
    or     (%rdi),%rax
    bsf    %rax,%rax    (... find_first_bit)
    cmp    $0x20,%eax   (superfluous paranoia...)
    cmovg  %edx,%eax    (... for broken find_first_bit)
    retq   

and something similar for __next_cpu.

> I remember profiling mysql some time ago which did bad overscheduling
> due to dumb locking. Funny was that the mask walking in the scheduler
> actually stood out. No, i don't claim extreme overscheduling is an
> interesting case to optimize for, but then there are more realistic
> workloads which also do a lot of context switching.
> 
> BTW if you do generic work on this: one reason the generated code for
> for_each_cpu etc. is so ugly is that the code has checks for
> find_next_bit returning >= max size. If you can generize the
> code enough to make sure no arch does that anymore these checks
> could be eliminated.

for_each_cpu code looks fine:

    mov    $cpumapaddress,%rdi
    callq  <__first_cpu>
    jmp    end_of_body
start_of_body:
    ...
end_of_body:
    mov    $cpumapaddress,%edi  ($mapaddress often cached in register)
    callq  <__next_cpu>
    cmp    $0x1f,%eax
    jle    start_of_body

On the other hand it would be nice to change __first_cpu and
__next_cpu into inline functions. If all implementations of
find_first_bit and find_next_bit would reliably return max_size
if no bits were found, that would be a good thing to do. The
generic one does return max_size.

Greetings,
    Alexander

> -Andi
-- 
  Alexander van Heukelum
  heukelum@...tmail.fm

-- 
http://www.fastmail.fm - One of many happy users:
  http://www.fastmail.fm/docs/quotes.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ