linux-kernel - Re: MGLRU premature memcg OOM on slow writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240308212212.GA38843@cmpxchg.org>
Date: Fri, 8 Mar 2024 16:22:12 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: Axel Rasmussen <axelrasmussen@...gle.com>
Cc: Chris Down <chris@...isdown.name>, cgroups@...r.kernel.org,
	kernel-team@...com, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, yuzhao@...gle.com
Subject: Re: MGLRU premature memcg OOM on slow writes

On Fri, Mar 08, 2024 at 11:18:28AM -0800, Axel Rasmussen wrote:
> On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@...isdown.name> wrote:
> >
> > Axel Rasmussen writes:
> > >A couple of dumb questions. In your test, do you have any of the following
> > >configured / enabled?
> > >
> > >/proc/sys/vm/laptop_mode
> > >memory.low
> > >memory.min
> >
> > None of these are enabled. The issue is trivially reproducible by writing to
> > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > is also susceptible to this on global reclaim (although it's less likely due to
> > page diversity).
> >
> > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > >looks like it simply will not do this.
> > >
> > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > >makes sense to me at least that doing writeback every time we age is too
> > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> >
> > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > thing at a time :-)
> 
> 
> Hmm, so I have a patch which I think will help with this situation,
> but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> then I can verify the patch fixes it).
> 
> If I understand the issue right, all we should need to do is get a
> slow filesystem, and then generate a bunch of dirty file pages on it,
> while running in a tightly constrained memcg. To that end, I tried the
> following script. But, in reality I seem to get little or no
> accumulation of dirty file pages.
> 
> I thought maybe fio does something different than rsync which you said
> you originally tried, so I also tried rsync (copying /usr/bin into
> this loop mount) and didn't run into an OOM situation either.
> 
> Maybe some dirty ratio settings need tweaking or something to get the
> behavior you see? Or maybe my test has a dumb mistake in it. :)
> 
> 
> 
> #!/usr/bin/env bash
> 
> echo 0 > /proc/sys/vm/laptop_mode || exit 1
> echo y > /sys/kernel/mm/lru_gen/enabled || exit 1
> 
> echo "Allocate disk image"
> IMAGE_SIZE_MIB=1024
> IMAGE_PATH=/tmp/slow.img
> dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1
> 
> echo "Setup loop device"
> LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1
> LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1
> 
> echo "Create dm-slow"
> DM_NAME=dm-slow
> DM_DEV=/dev/mapper/$DM_NAME
> echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1
> 
> echo "Create fs"
> mkfs.ext4 "$DM_DEV" || exit 1
> 
> echo "Mount fs"
> MOUNT_PATH="/tmp/$DM_NAME"
> mkdir -p "$MOUNT_PATH" || exit 1
> mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1
> 
> echo "Generate dirty file pages"
> systemd-run --wait --pipe --collect -p MemoryMax=32M \
>         fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \
>         -numjobs=10 -nrfiles=90 -filesize=1048576 \
>         -fallocate=posix \
>         -blocksize=4k -ioengine=mmap \
>         -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \
>         -runtime=300 -time_based

By doing only the writes in the cgroup, you might just be running into
balance_dirty_pages(), which wakes the flushers and slows the
writing/allocating task before hitting the cg memory limit.

I think the key to what happens in Chris's case is:

1) The cgroup has a certain share of dirty pages, but in aggregate
they are below the cgroup dirty limit (dirty < mdtc->avail * ratio)
such that no writeback/dirty throttling is triggered from
balance_dirty_pages().

2) An unthrottled burst of (non-dirtying) allocations causes reclaim
demand that suddenly exceeds the reclaimable clean pages on the LRU.

Now you get into a situation where allocation and reclaim rate exceeds
the writeback rate and the only reclaimable pages left on the LRU are
dirty. In this case reclaim needs to wake the flushers and wait for
writeback instead of blowing through the priority cycles and OOMing.

Chris might be causing 2) from the read side of the copy also being in
the cgroup. Especially if he's copying larger files that can saturate
the readahead window and cause bigger allocation bursts. Those
readahead pages are accounted to the cgroup and on the LRU as soon as
they're allocated, but remain locked and unreclaimable until the read
IO finishes.