lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <E1JHbs1-00025n-Ac@pomaz-ex.szeredi.hu>
Date:	Wed, 23 Jan 2008 10:26:09 +0100
From:	Miklos Szeredi <miklos@...redi.hu>
To:	salikhmetov@...il.com
CC:	linux-mm@...ck.org, jakob@...hought.net,
	linux-kernel@...r.kernel.org, valdis.kletnieks@...edu,
	riel@...hat.com, ksm@...dk, staubach@...hat.com,
	jesper.juhl@...il.com, torvalds@...ux-foundation.org,
	a.p.zijlstra@...llo.nl, akpm@...ux-foundation.org,
	protasnb@...il.com, miklos@...redi.hu, r.e.wolff@...wizard.nl,
	hidave.darkstar@...il.com, hch@...radead.org
Subject: Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

I think it would be more logical to move this patch forward, before
the two patches which this document is actually describing.

> Add a document, which describes how the POSIX requirements on updating
> memory-mapped file times are addressed in Linux.
> 
> Signed-off-by: Anton Salikhmetov <salikhmetov@...il.com>
> ---
>  Documentation/vm/00-INDEX  |    2 +
>  Documentation/vm/msync.txt |  117 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 119 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
> index 2131b00..2726c8d 100644
> --- a/Documentation/vm/00-INDEX
> +++ b/Documentation/vm/00-INDEX
> @@ -6,6 +6,8 @@ hugetlbpage.txt
>  	- a brief summary of hugetlbpage support in the Linux kernel.
>  locking
>  	- info on how locking and synchronization is done in the Linux vm code.
> +msync.txt
> +	- the design document for memory-mapped file times update
>  numa
>  	- information about NUMA specific code in the Linux vm.
>  numa_memory_policy.txt
> diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
> new file mode 100644
> index 0000000..571a766
> --- /dev/null
> +++ b/Documentation/vm/msync.txt
> @@ -0,0 +1,117 @@
> +
> +	The msync() system call and memory-mapped file times
> +
> +	Copyright (C) 2008 Anton Salikhmetov
> +
> +The POSIX standard requires that any write reference to memory-mapped file
> +data should result in updating the ctime and mtime for that file. Moreover,
> +the standard mandates that updated file times should become visible to the
> +world no later than at the next call to msync().
> +
> +Failure to meet this requirement creates difficulties for certain classes
> +of important applications. For instance, database backup systems fail to
> +pick up the files modified via the mmap() interface. Also, this is a
> +security hole, which allows forging file data in such a manner that proving
> +the fact that file data was modified is not possible.
> +
> +Briefly put, this requirement can be stated as follows:
> +
> +	once the file data has changed, the operating system
> +	should acknowledge this fact by updating file metadata.
> +
> +This document describes how this POSIX requirement is addressed in Linux.
> +
> +1. Requirements
> +
> +1.1) the POSIX standard requires updating ctime and mtime not later
> +than at the call to msync() with MS_SYNC or MS_ASYNC flags;
> +
> +1.2) in existing POSIX implementations, ctime and mtime
> +get updated not later than at the call to fsync();
> +
> +1.3) in existing POSIX implementation, ctime and mtime
> +get updated not later than at the call to sync(), the "auto-update" feature;
> +
> +1.4) the customers require and the common sense suggests that
> +ctime and mtime should be updated not later than at the call to munmap()
> +or exit(), the latter function implying an implicit call to munmap();
> +
> +1.5) the (1.1) item should be satisfied if the file is a block device
> +special file;
> +
> +1.6) the (1.1) item should be satisfied for files residing on
> +memory-backed filesystems such as tmpfs, too.
> +
> +The following operating systems were used as the reference platforms
> +and are referred to as the "existing implementations" above:
> +HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
> +
> +2. Lazy update
> +
> +Many attempts before the current version implemented the "lazy update" approach
> +to satisfying the requirements given above. Within the latter approach, ctime
> +and mtime get updated at last moment allowable.
> +
> +Since we don't update the file times immediately, some Flag has to be
> +used. When up, this Flag means that the file data was modified and
> +the file times need to be updated as soon as possible.
> +
> +Any existing "dirty" flag which, when up, mean that a page has been written to,
> +is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
> +would have to reset this "dirty" flag after updating ctime and mtime.
> +The sys_msync() function itself is basically a no-op in the MS_ASYNC case.
> +Thereby, the synchronization routines relying upon this "dirty" flag
> +would lose data. Therefore, a new Flag has to be introduced.
> +
> +The (1.5) item coupled with (1.3) requirement leads to hard work with
> +the block device inodes. Specifically, during writeback it is impossible to
> +tell which block device file was originally mapped. Therefore, we need to
> +traverse the list of "active" devices associated with the block device inode.
> +This would lead to updating file times for block device files, which were not
> +taking part in the data transfer.
> +
> +Also all versions prior to version 6 failed to correctly process ctime and
> +mtime for files on the memory-backed filesystems such as tmpfs. So the (1.6)
> +requirement was not satisfied.

Version -8 also fails: for ram backed filesystems page tables are not
write protected initially, nor after a sync.  This patch does write
protect them after an MS_ASYNC, but that's a bug in the current
context.

> +
> +If a write reference has occurred between two consecutive calls to msync()
> +with MS_ASYNC, the second call to the latter function should take into
> +account the last write reference. The last write reference can not be caught
> +if no pagefault occurs. Hence a pagefault needs to be forced. This can be done
> +using two different approaches. The first one is to synchronize data even when
> +msync() was called with MS_ASYNC. This is not acceptable because the current
> +design of the sys_msync() routine forbids starting I/O for the MS_ASYNC case.

I don't think anyone forbids starting I/O, it's just too expensive,
especially if it means, waiting for previous writeback on page to
finish first.

> +The second approach is to write protect the page for triggering a pagefault
> +at the next write reference. Note that the dirty flag for the page should not
> +be cleared thereby.
> +
> +In the "lazy update" approach, the requirements (1.1), (1.2), (1.3), and (1.4)
> +taken together result in adding code at least to the following kernel routines:
> +sys_msync(), do_fsync(), some routine in the unmap() call path, some routine
> +in the sync() call path.
> +
> +Finally, a file_update_time()-like function would have to be created for
> +processing the inode objects, not file objects. This is due to the fact that
> +during the sync() operation, the file object may not exist any more, only
> +the inode is known.
> +
> +To sum up: this "lazy" approach leads to massive changes, incurs overhead in
> +the block device case, and requires complicated design decisions.
> +
> +3. Immediate update
> +
> +OK, still reading? There's a better way.
> +
> +In a fashion analogous to what happens at write(2), react to the fact
> +that the page gets dirtied by updating the file times immediately.
> +Thereby any page writeback happens when the write reference has already
> +been accounted for from the view point of file times.
> +
> +The only problem which remains is to force refreshing file times at the write
> +reference following a call to msync() with MS_ASYNC. As mentioned above, all
> +that is needed here is to force a pagefault.
> +
> +The vma_wrprotect() routine introduced in this patch series is called
> +from sys_msync() in the MS_ASYNC case. The former routine is essentially
> +a version of existing page_mkclean_one() function from mm/rmap.c. Unlike
> +the latter function, the vma_wrprotect() does not touch the dirty bit.

Benchmark results should be also added to the relevant sections, I
think.  There is a very definite cost to all this, and a 10x slowdown
is usually not taken lightly...

Great document, btw :)

Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ