linux-kernel - Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <4ED2A28E.2070206@redhat.com>
Date:	Sun, 27 Nov 2011 15:50:22 -0500
From:	Rik van Riel <riel@...hat.com>
To:	Mel Gorman <mgorman@...e.de>
CC:	Andrea Arcangeli <aarcange@...hat.com>,
	Linux-MM <linux-mm@...ck.org>,
	Minchan Kim <minchan.kim@...il.com>, Jan Kara <jack@...e.cz>,
	Andy Isaacson <adi@...apodia.org>,
	Johannes Weiner <jweiner@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated
 without blocking within ->migratepage

On 11/24/2011 07:21 AM, Mel Gorman wrote:
> On Thu, Nov 24, 2011 at 02:19:43AM +0100, Andrea Arcangeli wrote:

>> But funny thing grow_dev_page already sets __GFP_MOVABLE. That's
>> pretty weird and it's probably source of a few not movable pages in
>> the movable block. But then many bh are movable... most of them are,
>> it's just the superblock that isn't.
>>
>> But considering grow_dev_page sets __GFP_MOVABLE, any worry about pins
>> from the fs on the block_dev.c pagecache shouldn't be a concern...
>>
>
> Except in quantity. We can cope with some pollution of MIGRATE_MOVABLE
> but if it gets excessive, it will cause a lot of trouble. Superblock
> bh's may not be movable but there are not many of them and they are
> long lived.

We're potentially doomed either way :)

If we allocate a lot of movable pages in non-movable
blocks, we can end up with a lot of slightly polluted
blocks even after reclaiming all the reclaimable page
cache.

If we allocate a few non-movable pages in movable
blocks, we can end up with the same situation.

Either way, we can potentially end up with a lot of
memory that cannot be defragmented.

Of course, it could take the mounting of a lot of
filesystems for this problem to be triggered, but we
know there are people doing that.

>> __GFP_MOVABLE missing block_dev also was not
>> so common and it most certainly contributed to a reclaim more
>> aggressive than it would have happened with that fix. I think you can
>> push things one at time without urgency here, and I'd prefer maybe if
>> block_dev patch is applied and the other reversed in vmscan.c or
>> improved to start limiting only if we're above 8*high or some
>> percentage check to allow a little more reclaim than rc2 allows
>
> The limiting is my current preferred option - at least until it is
> confirmed that it really is ok to mark block_dev pages movable and that
> Rik is ok with the revert.

I am fine with replacing the compaction checks with free limit
checks. Funny enough, the first iteration of the patch I submitted
to limit reclaim used a free limit check :)

I also suspect we will want to call shrink_slab regardless of
whether or not a memory zone is already over its free limit for
direct reclaim, since that has the potential to free an otherwise
unmovable page.

>> (i.e. no reclaim at all which likely results in a failure in hugepage
>> allocation). Not unlimited as 3.1 is ok with me but if kswapd can free
>> a percentage I don't see why reclaim can't (consdiering more free
>> pages in movable pageblocks are needed to succeed compaction). The
>> ideal is to improve the compaction rate and at the same time reduce
>> reclaim aggressiveness. Let's start with the parts that are more
>> obviously right fixes and that don't risk regressions, we don't want
>> compaction regressions :).
>>
>
> I don't think there are any "obviously right fixes" right now until the
> block_dev patch is proven to be ok and that reverting does not regress
> Rik's workload. Going to take time.

Ironically the test Andrea is measuring THP allocations with
(dd from /dev/sda to /dev/null) is functionally equivalent to
me running KVM guests with cache=writethrough directly from
a block device.

The difference is that Andrea is measuring THP allocation
success rate, while I am watching how well the programs (and
KVM guests) actually run.

Not surprisingly, swapping out the working set has a pretty
catastrophic effect on performance, even if it helps THP
allocation success :)

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/