linux-kernel - Re: [PATCH v11 0/3] cachestat: a new syscall for page cache state of files

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230315191459.f3z3gahxdew4dwrv@awork3.anarazel.de>
Date:   Wed, 15 Mar 2023 12:14:59 -0700
From:   Andres Freund <andres@...razel.de>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Nhat Pham <nphamcs@...il.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, bfoster@...hat.com,
        willy@...radead.org, arnd@...db.de, linux-api@...r.kernel.org,
        kernel-team@...a.com
Subject: Re: [PATCH v11 0/3] cachestat: a new syscall for page cache state of
 files

Hi,

On 2023-03-15 13:09:34 -0400, Johannes Weiner wrote:
> On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote:
> > A while ago I asked about the security implications - could cachestat()
> > be used to figure out what parts of a file another user is reading.
> > This also applies to mincore(), but cachestat() newly permits user A to
> > work out which parts of a file user B has *written* to.
>
> The caller of cachestat() must have the file open for reading. If they
> can read the contents that B has written, is the fact that they can
> see dirty state really a concern?

Random idea: Only fill ->dirty/writeback if the fd is open for writing.


> > Secondly, I'm not seeing description of any use cases.  OK, it's faster
> > and better than mincore(), but who cares?  In other words, what
> > end-user value compels us to add this feature to Linux?
>
> Years ago there was a thread about adding dirty bits to mincore(), I
> don't know if you remember this:
>
> https://lkml.org/lkml/2013/2/10/162
>
> In that thread, Rusty described a usecase of maintaining a journaling
> file alongside a main file. The idea for testing the dirty state isn't
> to call sync but to see whether the journal needs to be updated.
>
> The efficiency of mincore() was touched on too. Andres Freund (CC'd,
> hopefully I got the email address right) mentioned that Postgres has a
> usecase for deciding whether to do an index scan or query tables
> directly, based on whether the index is cached. Postgres works with
> files rather than memory regions, and Andres mentioned that the index
> could be quite large.

This is still relevant, FWIW. And not just for deciding on the optimal query
plan, but also for reporting purposes. We can show the user what part of the
query has done how much IO, but that can end up being quite confusing because
we're not aware of how much IO was fullfilled by the page cache.


> Most recently, the database team at Meta reached out to us and asked
> about the ability to query dirty state again. The motivation for this
> was twofold. One was simply visibility into the writeback algorithm,
> i.e. trying to figure out what it's doing when investigating
> performance problems.
>
> The second usecase they brought up was to advise writeback from
> userspace to manage the tradeoff between integrity and IO utilization:
> if IO capacity is available, sync more frequently; if not, let the
> work batch up. Blindly syncing through the file in chunks doesn't work
> because you don't know in advance how much IO they'll end up doing (or
> how much they've done, afterwards.) So it's difficult to build an
> algorithm that will reasonably pace through sparsely dirtied regions
> without the risk of overwhelming the IO device on dense ones. And it's
> not straight-forward to do this from the kernel, since it doesn't know
> the IO headroom the application needs for reading (which is dynamic).

We ended up building something very roughly like that in userspace - each
backend tracks the last N writes, and once the numbers reaches a certain
limit, we sort and collapse the outstanding ranges and issue
sync_file_range(SYNC_FILE_RANGE_WRITE) for them. Different types of tasks have
different limits. Without that latency in write heavy workloads is ... not
good (to this day, but to a lesser degree than 5-10 years ago).


> Another query we get almost monthly is service owners trying to
> understand where their memory is going and what's causing unexpected
> pressure on a host. They see the cache in vmstat, but between a
> complex application, shared libraries or a runtime (jvm, hhvm etc.)
> and a myriad of host management agents, there is so much going on on
> the machine that it's hard to find out who is touching which
> files. When it comes to disk usage, the kernel provides the ability to
> quickly stat entire filesystem subtrees and drill down with tools like
> du. It sure would be useful to have the same for memory usage.

+1

Greetings,

Andres Freund