linux-kernel - Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130801043220.GN7118@dastard>
Date:	Thu, 1 Aug 2013 14:32:20 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Theodore Ts'o <tytso@....edu>, Namjae Jeon <linkinjeon@...il.com>,
	adilger.kernel@...ger.ca, bpm@....com, elder@...nel.org,
	hch@...radead.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org,
	xfs@....sgi.com, a.sangwan@...sung.com,
	Namjae Jeon <namjae.jeon@...sung.com>
Subject: Re: [PATCH 1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE

On Thu, Aug 01, 2013 at 12:06:05AM -0400, Theodore Ts'o wrote:
> On Thu, Aug 01, 2013 at 12:59:14PM +1000, Dave Chinner wrote:
> > 
> > This funtionality is not introducing any new problems w.r.t. mmap().
> > In terms of truncating a mmap'd file, that can already occur and
> > the behaviour is well known.
> 
> That's not race I'm worried about.  Consider the following scenario.
> We have a 10 meg file, and the first five megs are mapped read/write.
> We do a collapse range of one meg starting at the one meg boundary.
> Menwhile another process is constantly modifying the pages between the
> 3 meg and 4 meg mark, via a mmap'ed region of memory.
> 
> You can have the code which does the collapse range call
> filemap_fdatawrite_range() to flush out to disk all of the inode's
> dirty pages in the page cache, but between the call to
> filemap_fdatawrite_range() and truncate_page_cache_range(), more pages
> could have gotten dirtied, and those changes will get lost.

That's one of the many race conditions that Jan's patches are
intended to solve. Locking the range and then unmapping the pages
will then cause the refault to block until the range is unlocked.

While the range is locked, we can then writeback all the dirty
pages, invalidate the page cache and modify the underlying extents
without having to care about user page faults racing with the
operation.

Let's face it, without that we current cannot serialise *any*
multi-page page cache operation against mmap() - direct IO has
exactly the same problems as you are describing, as does hole
punching, and anything else that relies on the i_mutex or other
IO-based locks for serialisation.

> The difference between punch_hole and collapse_range is that
> previously, we only needed to call truncate_page_cache_range() on the
> pages that were going to be disappearing from the page cache, so it
> didn't matter if you had someone dirtying the pages racing with the
> punch_hole operation.  But in the collapse_range case, we need to call
> truncate_page_cache_range() on data that is not going to disappear,
> but rather be _renumbered_.

Sure - as it said up front "the range of the pages being operated
on" need to be invalidating. And, if you look at the ext4 patch, is
actaully does exactly that....

> I'll also note that depending on the use case, doing the renumbering
> of the pages by throwing all of the pages from the page cache and
> forcing them to be read back in from disk might not all be friendly to
> a performance sensitive application.

Application problem.

Besides, this is for accelerating video stream editting (i.e. NLE
workflows). Such applications tend to use direct IO for recording
and playback, and they most certainly won't have concurrent access
to the video stream either by mmap or direct/buferred IO while the
non linear editor is punching the advertisements out of the recorded
video stream......

> In the case where the page size == the fs block size, instead of
> throwing away all of the pages, we could also have VM code which
> remaps the pages in the page cache (including potentially editing the
> page tables for any mmap'ed pages).

That way lies insanity. Not to mention that it's completely
unnecessary for the use case this ioctl is actually accelerating.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/