[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20121210.095929.249234826.d.hatayama@jp.fujitsu.com>
Date: Mon, 10 Dec 2012 09:59:29 +0900 (JST)
From: HATAYAMA Daisuke <d.hatayama@...fujitsu.com>
To: cpw@....com
Cc: vgoyal@...hat.com, kexec@...ts.infradead.org, ptesarik@...e.cz,
linux-kernel@...r.kernel.org, kumagai-atsushi@....nes.nec.co.jp,
""@jp.fujitsu.com
Subject: Re: [PATCH] makedumpfile: request the kernel do page scans
From: Cliff Wickman <cpw@....com>
Subject: Re: [PATCH] makedumpfile: request the kernel do page scans
Date: Mon, 19 Nov 2012 12:07:10 -0600
> On Fri, Nov 16, 2012 at 03:39:44PM -0500, Vivek Goyal wrote:
>> On Thu, Nov 15, 2012 at 04:52:40PM -0600, Cliff Wickman wrote:
>> >
>> > Gentlemen,
>> >
>> > I know this is rather late to the game, given all the recent work to speed
>> > up makedumpfile and reduce the memory that it consumes.
>> > But I've been experimenting with asking the kernel to scan the page tables
>> > instead of reading all those page structures through /proc/vmcore.
>> >
>> > The results are rather dramatic -- if they weren't I would not presume to
>> > suggest such a radical path.
>> > On a small, idle UV system: about 4 sec. versus about 40 sec.
>> > On a 8TB UV the unnecessary page scan alone takes 4 minutes, vs. about 200 min
>> > through /proc/vmcore.
>> >
>> > I have not compared it to your version 1.5.1, so I don't know if your recent
>> > work provides similar speedups.
>>
>> I guess try 1.5.1-rc. IIUC, we had the logic of going through page tables
>> but that required one single bitmap to be present and in constrained
>> memory environment we will not have that.
>>
>> That's when this idea came up that scan portion of struct page range,
>> filter it, dump it and then move on to next range.
>>
>> Even after 1.5.1-rc if difference is this dramatic, that means we are
>> not doing something right in makedumpfile and it needs to be
>> fixed/optimized.
>>
>> But moving the logic to kernel does not make much sense to me at this
>> point of time untile and unless there is a good explanation that why
>> user space can't do a good job of what kernel is doing.
>
> I tested a patch in which makedumpfile does nothing but scan all the
> page structures using /proc/vmcore. It is simply reading each consecutive
> range of page structures in readmem() chunks of 512 structures. And doing
> nothing more than accumulating a hash total of the 'flags' field in each
> page (for a sanity check). On my test machine there are 6 blocks of page
> structures, totaling 12 million structures. This takes 31.1 'units of time'
> (I won't say seconds, as the speed of the clock seems to be way too fast in
> the crash kernel). If I increase the buffer size to 5120 structures: 31.0 units.
> At 51200 structures: 30.9. So buffer size has virtually no effect.
>
> I also request the kernel to do the same thing. Each of the 6 requests
> asks the kernel to scan a range of page structures and accumulate a hash
> total of the 'flags' field. (And also copy a 10000-element pfn list back
> to user space, to test that such copies don't add significant overhead.)
> And the 12 million pages are scanned in 1.6 'units of time'.
>
> If I compare the time for actual page scanning (unnecessary pages and
> free pages) through /proc/vmcore vs. requesting the kernel to do the
> scanning: 40 units vs. 3.8 units.
>
> My conclusion is that makedumpfile's page scanning procedure is extremely
> dominated by the overhead of copying page structures through /proc/vmcore.
> And that is about 20x slower than using the kernel to access pages.
I have not tested your patch set on the machine with 2TB due to
reservation problem, but I already tested it on my local machine with
32GB and saw big performance improvement.
I applied your patch set on makedumpfile v1.5.1-rc and added an option
-N not to dump pages to focus on scanning pages part only.
By this, while scanning pages in user-space took about 25 seconds,
scanning pages in kernel-space took about 1 second.
During the execution I profiled it using perf record/report and its
results are attached files.
>From this we can notice that current makedumpfile consumes large part
of execution time for ioremap and its related processing. copy_to_user
was less than 2% only relative to a whole processing.
Looking into the codes around read method of /proc/vmcore, its call
stack can be broken into as follows:
read_vmcore
read_from_oldmem
copy_oldmem_page
copy_oldmem_page reads the 1st kernel's memory *per page* using
ioremap_cache and after completing it, immediately unmaps the remapped
address using iounmap.
Because ioremap/iounmap is called *per page*, this number of calling
ioremap throught scanning a whole pages is unchanged even if
makedumpfile's cache size is changed. This seems consistent with
Cliff's explanation that increasing 512 entries of makedumpfile's
cache was meaningless.
I think the first step to address this issue is to introduce a kind of
cache in read_vmcore path to reduce the number of calling
ioremap/iounmap. Porting scanning logic into kernel-space should be
considered if it turns out not to work enough.
Thanks.
HATAYAMA, Daisuke
View attachment "makedumpfile-output-user.log" of type "Text/Plain" (28308 bytes)
View attachment "makedumpfile-output-kernel.txt" of type "Text/Plain" (26135 bytes)
View attachment "perf_report_user.txt" of type "Text/Plain" (47878 bytes)
View attachment "perf_report_kernel.txt" of type "Text/Plain" (12368 bytes)
Powered by blists - more mailing lists