[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4BD6EE18.4090909@linux.vnet.ibm.com>
Date: Tue, 27 Apr 2010 16:00:56 +0200
From: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
To: Rik van Riel <riel@...hat.com>, Mel Gorman <mel@....ul.ie>
CC: KOSAKI Motohiro <kosaki.motohiro@...il.com>,
Johannes Weiner <hannes@...xchg.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Nick Piggin <npiggin@...e.de>, gregkh@...ell.com,
linux-mm@...ck.org, Chris Mason <chris.mason@...cle.com>,
Jens Axboe <jens.axboe@...cle.com>,
linux-kernel@...r.kernel.org, Corrado Zoccolo <czoccolo@...il.com>
Subject: Re: Subject: [PATCH][RFC] mm: make working set portion that is protected
tunable v2
Rik van Riel wrote:
> On 04/26/2010 08:43 AM, Christian Ehrhardt wrote:
>
>>>> This patch creates a knob to help users that have workloads suffering
>>>> from the
>>>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21
>>>> vmscan:
>>>> evict use-once pages first".
>>>> It also provides the tuning mechanisms for other users that want an
>>>> even bigger
>>>> working set to be protected.
>>>
>>> We certainly need no knob. because typical desktop users use various
>>> application,
>>> various workload. then, the knob doesn't help them.
>>
>> Briefly - We had discussed non desktop scenarios where like a day load
>> that builds up the working set to 50% and a nightly backup job which
>> then is unable to use that protected 50% when sequentially reading a lot
>> of disks and due to that doesn't finish before morning.
>
> This is a red herring. A backup touches all of the
> data once, so it does not need a lot of page cache
> and will not "not finish before morning" due to the
> working set being protected.
>
> You're going to have to come up with a more realistic
> scenario than that.
I completely agree that a backup case is read once and therefore doesn't
benefit from caching itself, but you know my scenario from the thread
where this patch emerged from.
="Parallel iozone sequential read - resembling the classic backup case
(read once + sequential)."
While caching isn't helping the classic way, by having data in cache
ready on the next access it is still used transparently as the system
is reading ahead into page cache to assist the sequentially reading
process.
Yes it doesn't happen with direct IO and some, but unfortunately not
all backup tools use DIO. Additionally not all backup jobs have a whole
night, and this can really be a decision maker if you can quickly pump
out your 100 TB main database in 10 or 20 minutes.
So here comes the problem, due to the 50% preserved I assume it comes
into trouble allocating that page cache memory in time. So much that it
even slows down the load - meaning long enough to let the application
completely consume the data already read and then still letting it wait.
More about that below.
Now IMHO this feels comparable to a classic backup job, and by loosing
60% Throughput (more than a Gb/s) is seems neither red nor smells like
fish to me.
>> I personally just don't feel too good knowing that 50% of my memory
>> might hang around unused for many hours while they could be of some use.
>> I absolutely agree with the old intention and see how the patch helped
>> with the latency issue Elladan brought up in the past - but it just
>> looks way too aggressive to protect it "forever" for some server use
>> cases.
>
> So far we have seen exactly one workload where it helps
> to reduce the size of the active file list, and that is
> not due to any need for caching more inactive pages.
>
> On the contrary, it is because ALL OF THE INACTIVE PAGES
> are in flight to disk, all under IO at the same time.
Ok this time I think I got your point much better - sorry for
being confused.
Discard my patch, but I'd really like to clarify and verify your
assumption in conjunction with my findings and would be happy
if you can help me with that.
As mentioned the case that suffers from the 50% memory protected is
iozone read - so it would be "in flight FROM disk", but I guess that
it is not important if it is from or to right ?
Effectively I have two read cases, one with caches dropped which then
has almost full memory for page cache in the read case. And the other
one with a few writes before filling up the protected 50% leading to a
read case with only half of the memory for page cache.
Now if I really got you right this time the issue is caused by the
fact that the parallel read ahead on all 16 disks creates so much I/O
in flight that the 128M (=50% that are left) are not enough.
>From the past we know that the time lost for the -60% Throughput was
spent in a loop around direct_reclaim&congestion_wait trying to get the
memory for the page cache reads - would you consider it possible that
we now run into a scenario splitting the memory like this?:
- 50% active file protected
- a lot of the other half related to I/O that is currently
in flight from the disk -> not free-able too?
- almost nothing to free when allocating for the next read to page
cache (can only take pages above low watermark) -> waiting
I updated my old counter patch, that I used to verify the old issue were
we spent so much time in a full timeout of congestion wait. Thanks to
Mel this was fixed (I have his watermark wait patch applied), but I
assume having 50% protected I just run into the shortened wait more
often or wait longer for watermarks to still be an issue (due to 50%
not free-able).
See the patch inlined at the end of the mail for details what/how
it is exactly counted.
As before the scenario is iozone on 16 disks in parallel with 1 iozone
child per disk.
I ran:
- write, write, write, read -> bad case
- drop cache, read -> good case
Read throughput still drops by ~60% comparing good to bad case.
Here are the numbers I got for those two cases by my counters and
meminfo:
Value Initial state Write 1 Write 2 Write 3 Read after writes (bad) Read after DC (good)
watermark_wait_duration (ns) 0 9,902,333,643 12,288,444,574 24,197,098,221 317,175,021,553 35,002,926,894
watermark_wait 0 24102 26708 35285 29720 15515
pages_direct_reclaim 0 59195 65010 86777 90883 66672
failed_pages_direct_reclaim 0 24144 26768 35343 29733 15525
failed_pages_direct_reclaim_but_progress 0 24144 26768 35343 29733 15525
MemTotal: 248912 248912 248912 248912 248912 248912
MemFree: 185732 4868 5028 3780 3064 7136
Buffers: 536 33588 65660 84296 81868 32072
Cached: 9480 145252 111672 93736 98424 149724
Active: 11052 43920 76032 89084 87780 38024
Inactive: 6860 142628 108980 96528 100280 151572
Active(anon): 5092 4452 4428 4364 4516 4492
Inactive(anon): 6480 6608 6604 6604 6604 6604
Active(file): 5960 39468 71604 84720 83264 33532
Inactive(file): 380 136020 102376 89924 93676 144968
Unevictable: 3952 3952 3952 3952 3952 3952
Real Time passed in seconds 48.83 49.38 50.35 40.62 22.61
AVG wait time waitduration/# 410,851 460,104 685,762 10,672,107 2,256,070 => x5 longer waits in avg
-52.20% bad case runs about twice as often into waits
These numbers seem to point toward my assumption, that the 50% preserved
cause the system to be unable to find memory fast enough.
Happening twice as often to run into the wait after a direct_reclaim
that made progress, but not finding a free page.
And then in average waiting about 5 times longer to get things freed up
enough reaching the watermark and get woken up.
####
Eventually I'd also really like to completely understand why the active
file pages grow when I execute the same iozone write load three times.
They effectively write the same files in the same directories without
being a journaling file system (The effect can be seen in the table
above as well).
If one of these write runs would use more than ~30M active file pages
they would be allocated and afterwards protected, but they aren't.
Then after the second run I see ~60M active file pages.
As mentioned before I would assume that it either just reuses what is
in memory from the first run, or if it really uses new stuff then the
time has come to throw the old away.
Therefore I would assume that it should never get much more after the
first run as long as they are essentially doing the same.
Does someone already know or has a good assumption what might be
growing in these buffers?
Is there a good interface to check what is buffered and protected atm?
> Caching has absolutely nothing to do with the regression
> you ran into.
As mentioned above not by means of "having it in the cache for another
fast access" yes.
But maybe by "not getting memory for reads into page cache fast enough".
--
GrĂ¼sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
#### patch for the counters shown in table above ######
Subject: [PATCH][DEBUGONLY] mm: track allocation waits
From: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
This patch adds some debug counters to track how often a system runs into
waits after direct reclaim (happens in case of did_some_progress & !page)
and how much time it spends there waiting.
#for debugging only#
Signed-off-by: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
---
[diffstat]
include/linux/sysctl.h | 1
kernel/sysctl.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 17 ++++++++++++++
3 files changed, 75 insertions(+)
[diff]
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/include/linux/sysctl.h linux-2.6.32.11-0.3.99.6.626e022/include/linux/sysctl.h
--- linux-2.6.32.11-0.3.99.6.626e022.orig/include/linux/sysctl.h 2010-04-27 12:01:54.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/include/linux/sysctl.h 2010-04-27 12:03:56.000000000 +0200
@@ -68,6 +68,7 @@
CTL_BUS=8, /* Busses */
CTL_ABI=9, /* Binary emulation */
CTL_CPU=10, /* CPU stuff (speed scaling, etc) */
+ CTL_PERF=11, /* Performance counters and timer sums for debugging */
CTL_XEN=123, /* Xen info and control */
CTL_ARLAN=254, /* arlan wireless driver */
CTL_S390DBF=5677, /* s390 debug */
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/kernel/sysctl.c linux-2.6.32.11-0.3.99.6.626e022/kernel/sysctl.c
--- linux-2.6.32.11-0.3.99.6.626e022.orig/kernel/sysctl.c 2010-04-27 14:26:04.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/kernel/sysctl.c 2010-04-27 15:44:54.000000000 +0200
@@ -183,6 +183,7 @@
.default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
};
+static struct ctl_table perf_table[];
static struct ctl_table kern_table[];
static struct ctl_table vm_table[];
static struct ctl_table fs_table[];
@@ -236,6 +237,13 @@
.mode = 0555,
.child = dev_table,
},
+ {
+ .ctl_name = CTL_PERF,
+ .procname = "perf",
+ .mode = 0555,
+ .child = perf_table,
+ },
+
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
@@ -254,6 +262,55 @@
static int max_sched_shares_ratelimit = NSEC_PER_SEC; /* 1 second */
#endif
+extern unsigned long perf_count_watermark_wait;
+extern unsigned long perf_count_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim_but_progress;
+extern unsigned long perf_count_watermark_wait_duration;
+static struct ctl_table perf_table[] = {
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_watermark_wait_duration",
+ .data = &perf_count_watermark_wait_duration,
+ .mode = 0666,
+ .maxlen = sizeof(unsigned long),
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_watermark_wait",
+ .data = &perf_count_watermark_wait,
+ .mode = 0666,
+ .maxlen = sizeof(unsigned long),
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_pages_direct_reclaim",
+ .data = &perf_count_pages_direct_reclaim,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0666,
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_failed_pages_direct_reclaim",
+ .data = &perf_count_failed_pages_direct_reclaim,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0666,
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_failed_pages_direct_reclaim_but_progress",
+ .data = &perf_count_failed_pages_direct_reclaim_but_progress,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0666,
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ { .ctl_name = 0 }
+};
+
static struct ctl_table kern_table[] = {
{
.ctl_name = CTL_UNNUMBERED,
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/mm/page_alloc.c linux-2.6.32.11-0.3.99.6.626e022/mm/page_alloc.c
--- linux-2.6.32.11-0.3.99.6.626e022.orig/mm/page_alloc.c 2010-04-27 12:01:55.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/mm/page_alloc.c 2010-04-27 14:06:40.000000000 +0200
@@ -191,6 +191,7 @@
wake_up_interruptible(&watermark_wq);
}
+unsigned long perf_count_watermark_wait = 0;
/**
* watermark_wait - Wait for watermark to go above low
* @timeout: Wait until watermark is reached or this timeout is reached
@@ -202,6 +203,7 @@
long ret;
DEFINE_WAIT(wait);
+ perf_count_watermark_wait++;
prepare_to_wait(&watermark_wq, &wait, TASK_INTERRUPTIBLE);
/*
@@ -1725,6 +1727,10 @@
return page;
}
+unsigned long perf_count_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim_but_progress = 0;
+
/* The really slow allocator path where we enter direct reclaim */
static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1761,6 +1767,13 @@
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
migratetype);
+
+ perf_count_pages_direct_reclaim++;
+ if (!page)
+ perf_count_failed_pages_direct_reclaim++;
+ if (!page && *did_some_progress)
+ perf_count_failed_pages_direct_reclaim_but_progress++;
+
return page;
}
@@ -1841,6 +1854,7 @@
return alloc_flags;
}
+unsigned long perf_count_watermark_wait_duration = 0;
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1961,8 +1975,11 @@
/* Check if we should retry the allocation */
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+ unsigned long t1;
/* Too much pressure, back off a bit at let reclaimers do work */
+ t1 = get_clock();
watermark_wait(HZ/50);
+ perf_count_watermark_wait_duration += ((get_clock() - t1) * 125) >> 9;
goto rebalance;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists