linux-kernel - Re: [PATCH] backing_dev_info: introduce min_bw/max

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAH9Oa-YxL1iu_TVn6bL3Nd4qzYSVDPaO9a96sX4u7dhq+ewasA@mail.gmail.com>
Date:   Tue, 22 Jun 2021 14:29:32 +0200
From:   Michael Stapelberg <stapelberg+linux@...gle.com>
To:     Jan Kara <jack@...e.cz>
Cc:     Miklos Szeredi <miklos@...redi.hu>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org, linux-mm <linux-mm@...ck.org>,
        linux-fsdevel@...r.kernel.org, Tejun Heo <tj@...nel.org>,
        Dennis Zhou <dennis@...nel.org>, Jens Axboe <axboe@...nel.dk>,
        Roman Gushchin <guro@...com>,
        Johannes Thumshirn <johannes.thumshirn@....com>,
        Song Liu <song@...nel.org>, David Sterba <dsterba@...e.com>
Subject: Re: [PATCH] backing_dev_info: introduce min_bw/max_bw limits

Thanks for taking a look! Comments inline:

On Tue, 22 Jun 2021 at 14:12, Jan Kara <jack@...e.cz> wrote:
>
> On Mon 21-06-21 11:20:10, Michael Stapelberg wrote:
> > Hey Miklos
> >
> > On Fri, 18 Jun 2021 at 16:42, Miklos Szeredi <miklos@...redi.hu> wrote:
> > >
> > > On Fri, 18 Jun 2021 at 10:31, Michael Stapelberg
> > > <stapelberg+linux@...gle.com> wrote:
> > >
> > > > Maybe, but I don’t have the expertise, motivation or time to
> > > > investigate this any further, let alone commit to get it done.
> > > > During our previous discussion I got the impression that nobody else
> > > > had any cycles for this either:
> > > > https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com/
> > > >
> > > > Have you had a look at the China LSF report at
> > > > http://bardofschool.blogspot.com/2011/?
> > > > The author of the heuristic has spent significant effort and time
> > > > coming up with what we currently have in the kernel:
> > > >
> > > > """
> > > > Fengguang said he draw more than 10K performance graphs and read even
> > > > more in the past year.
> > > > """
> > > >
> > > > This implies that making changes to the heuristic will not be a quick fix.
> > >
> > > Having a piece of kernel code sitting there that nobody is willing to
> > > fix is certainly not a great situation to be in.
> >
> > Agreed.
> >
> > >
> > > And introducing band aids is not going improve the above situation,
> > > more likely it will prolong it even further.
> >
> > Sounds like “Perfect is the enemy of good” to me: you’re looking for a
> > perfect hypothetical solution,
> > whereas we have a known-working low risk fix for a real problem.
> >
> > Could we find a solution where medium-/long-term, the code in question
> > is improved,
> > perhaps via a Summer Of Code project or similar community efforts,
> > but until then, we apply the patch at hand?
> >
> > As I mentioned, I think adding min/max limits can be useful regardless
> > of how the heuristic itself changes.
> >
> > If that turns out to be incorrect or undesired, we can still turn the
> > knobs into a no-op, if removal isn’t an option.
>
> Well, removal of added knobs is more or less out of question as it can
> break some userspace. Similarly making them no-op is problematic unless we
> are pretty certain it cannot break some existing setup. That's why we have
> to think twice (or better three times ;) before adding any knobs. Also
> honestly the knobs you suggest will be pretty hard to tune when there are
> multiple cgroups with writeback control involved (which can be affected by
> the same problems you observe as well). So I agree with Miklos that this is
> not the right way to go. Speaking of tunables, did you try tuning
> /sys/devices/virtual/bdi/<fuse-bdi>/min_ratio? I suspect that may
> workaround your problems...

Back then, I did try the various tunables (vm.dirty_ratio and
vm.dirty_background_ratio on the global level,
/sys/class/bdi/<bdi>/{min,max}_ratio on the file system level), and
they have had no observable effect on the problem at all in my tests.

>
> Looking into your original report and tracing you did (thanks for that,
> really useful), it seems that the problem is that writeback bandwidth is
> updated at most every 200ms (more frequent calls are just ignored) and are
> triggered only from balance_dirty_pages() (happen when pages are dirtied) and
> inode writeback code so if the workload tends to have short spikes of activity
> and extended periods of quiet time, then writeback bandwidth may indeed be
> seriously miscomputed because we just won't update writeback throughput
> after most of writeback has happened as you observed.
>
> I think the fix for this can be relatively simple. We just need to make
> sure we update writeback bandwidth reasonably quickly after the IO
> finishes. I'll write a patch and see if it helps.

Thank you! Please keep us posted.