linux-kernel - Re: [RFC v1][PATCH]page_fault retry with NOPAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <604427e00812041427j7f1c8118p48b1b5b577143703@mail.gmail.com>
Date:	Thu, 4 Dec 2008 14:27:54 -0800
From:	Ying Han <yinghan@...gle.com>
To:	Török Edwin <edwintorok@...il.com>
Cc:	Nick Piggin <npiggin@...e.de>, Mike Waychison <mikew@...gle.com>,
	Ingo Molnar <mingo@...e.hu>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, akpm <akpm@...ux-foundation.org>,
	David Rientjes <rientjes@...gle.com>,
	Rohit Seth <rohitseth@...gle.com>,
	Hugh Dickins <hugh@...itas.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

I am trying your test program(scalability) in house, but somehow i got
different result as you saw. i created 8 files each with 1G size on
separate drives( to avoid the latency disturbing of disk seek). I got
this number without applying the batch based on 2.6.26. May i ask how
to reproduce the mmap issue you mentioned?

8 CPU
read_worker
1 threads Real time: 101.058262 s (since task start)
2 threads Real time: 50.670456 s (since task start)
4 threads Real time: 25.904657 s (since task start)
8 threads Real time: 20.090677 s (since task start)
--------------------------------------------------------------------------------
mmap_worker
1 threads Real time: 101.340662 s (since task start)
2 threads Real time: 51.484646 s (since task start)
4 threads Real time: 28.414534 s (since task start)
8 threads Real time: 21.785818 s (since task start)

--Ying

On Thu, Nov 27, 2008 at 4:52 AM, Török Edwin <edwintorok@...il.com> wrote:
> On 2008-11-27 14:39, Nick Piggin wrote:
>> On Thu, Nov 27, 2008 at 02:21:16PM +0200, Török Edwin wrote:
>>
>>> On 2008-11-27 14:03, Nick Piggin wrote:
>>>
>>>>> Running my testcase shows no significant performance difference. What am
>>>>> I doing wrong?
>>>>>
>>>>>
>>>>
>>>> Software may just be doing a lot of mmap/munmap activity. threads +
>>>> mmap is never going to be pretty because it is always going to involve
>>>> broadcasting tlb flushes to other cores... Software writers shouldn't
>>>> be scared of using processes (possibly with some shared memory).
>>>>
>>>>
>>> It would be interesting to compare the performance of a threaded clamd,
>>> and of a clamd that uses multiple processes.
>>> Distributing tasks will be a bit more tricky, since it would need to use
>>> IPC, instead of mutexes and condition variables.
>>>
>>
>> Yes, although you could use PTHREAD_PROCESS_SHARED pthread mutexes on
>> the shared memory I believe (having never tried it myself).
>>
>>
>>
>>>> Actually, a lot of things get faster (like malloc, or file descriptor
>>>> operations) because locks aren't needed.
>>>>
>>>> Despite common perception, processes are actually much *faster* than
>>>> threads when doing common operations like these. They are slightly slower
>>>> sometimes with things like creation and exit, or context switching, but
>>>> if you're doing huge numbers of those operations, then it is unlikely
>>>> to be a performance critical app... :)
>>>>
>>>>
>>> How about distributing tasks to a set of worked threads, is the overhead
>>> of using IPC instead of
>>> mutexes/cond variables acceptable?
>>>
>>
>> It is really going to depend on a lot of things. What is involved in
>> distributing tasks, how many cores and cache/TLB architecture of the
>> system running on, etc.
>>
>> You want to distribute as much work as possible while touching as
>> little memory as possible, in general.
>>
>> But if you're distributing threads over cores, and shared caches are
>> physically tagged (which I think all x86 CPUs are), then you should
>> be able to have multiple processes operate on shared memory just as
>> efficiently as multiple threads I think.
>>
>> And then you also get the advantages of reduced contention on other
>> shared locks and resources.
>>
>
> Thanks for the tips, but lets get back to the original question:
> why don't I see any performance improvement with the fault-retry patches?
>
> My testcase only compares reads file with mmap, vs. reading files with
> read, with different number of threads.
> Leaving aside other reasons why mmap is slower, there should be some
> speedup by running 4 threads vs 1 thread, but:
>
> 1 thread: read:27,18 28.76
> 1 thread: mmap: 25.45, 25.24
> 2 thread: read: 16.03, 15.66
> 2 thread: mmap: 22.20, 20.99
> 4 thread: read: 9.15, 9.12
> 4 thread: mmap: 20.38, 20.47
>
> The speed of 4 threads is about the same as for 2 threads with mmap, yet
> with read it scales nicely.
> And the patch doesn't seem to improve scalability.
> How can I find out if the patch works as expected? [i.e. verify that
> faults are actually retried, and that they don't keep the semaphore locked]
>
>> OK, I'll see if I can find them (am overseas at the moment, and I suspect
>> they are stranded on some stationary rust back home, but I might be able
>> to find them on the web).
>
> Ok.
>
> Best regards,
> --Edwin
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/