linux-kernel - Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFwzEs3F6-n-Q=DrJgCScYwQ29PVZdCrGAO2y+FPC3Eq=Q@mail.gmail.com>
Date:	Wed, 28 Nov 2012 12:47:58 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Mikulas Patocka <mpatocka@...hat.com>
Cc:	Jens Axboe <axboe@...nel.dk>,
	Jeff Chua <jeff.chua.linux@...il.com>,
	Lai Jiangshan <laijs@...fujitsu.com>, Jan Kara <jack@...e.cz>,
	lkml <linux-kernel@...r.kernel.org>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH] Introduce a method to catch mmap_region (was: Recent
 kernel "mount" slow)

On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
> do unspeakable things to your family and pets.

Btw, *if* this approach works, I suspect we could just switch the
bd_block_size_semaphore semaphore to be a regular rw-sem.

Why? Because now it's no longer ever gotten in the cached IO paths, we
only get it when we're doing much more expensive things (ie actual IO,
and buffer head allocations etc etc). As long as we just work with the
page cache, we never get to the whole lock at all.

Which means that the whole percpu-optimized thing is likely no longer
all that relevant.

But that's an independent thing, and it's only true *if* my patch
works. It looks fine on paper, but maybe there's something
fundamentally broken about it.

One big change my patch does is to move the sync_bdev/kill_bdev to
*after* changing the block size. It does that so that it can guarantee
that any old data (which didn't see the new block size) will be
sync'ed even if there is new IO coming in as we change the block size.

The old code locked the whole sync() region, which doesn't work with
my approach, since the sync will do IO and would thus cause potential
deadlocks while holding the rwsem for writing.

So with this patch, as the block size changes, you can actually have
some old pages with the old block size *and* some different new pages
with the new  block size all at the same time. It should all be
perfectly fine, but it's worth pointing out.

(It probably won't trigger in practice, though, since doing IO while
somebody else is changing the blocksize is fundamentally an odd thing
to do, but whatever. I also suspect that we *should* perhaps use the
inode->i_sem thing to serialize concurrent block size changes, but
that's again an independent issue)

                           Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/