linux-kernel - Re: [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ophjhrnyzl7jp55qj35kgiz3zflopsiuiwg4dhxmkyua2w46nz@5p2e2djo3vw4>
Date: Wed, 2 Apr 2025 13:47:06 +0200
From: Jan Kara <jack@...e.cz>
To: James Bottomley <James.Bottomley@...senpartnership.com>
Cc: Jan Kara <jack@...e.cz>, Christian Brauner <brauner@...nel.org>, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, mcgrof@...nel.org, 
	hch@...radead.org, david@...morbit.com, rafael@...nel.org, djwong@...nel.org, 
	pavel@...nel.org, peterz@...radead.org, mingo@...hat.com, will@...nel.org, 
	boqun.feng@...il.com
Subject: Re: [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative
 to down_read

On Tue 01-04-25 08:52:02, James Bottomley wrote:
> On Tue, 2025-04-01 at 13:20 +0200, Jan Kara wrote:
> > On Mon 31-03-25 21:13:20, James Bottomley wrote:
> > > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote:
> [...]
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index b379a46b5576..528e73f192ac 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct
> > > > super_block *sb, int level)
> > > >  static inline void __sb_start_write(struct super_block *sb, int
> > > > level)
> > > >  {
> > > >         percpu_down_read_freezable(sb->s_writers.rw_sem + level -
> > > > 1,
> > > > -                                  level == SB_FREEZE_WRITE);
> > > > +                                  (level == SB_FREEZE_WRITE ||
> > > > +                                   level ==
> > > > SB_FREEZE_PAGEFAULT));
> > > >  }
> > > 
> > > Yes, I was about to tell Jan that the condition here simply needs
> > > to be true.  All our rwsem levels need to be freezable to avoid a
> > > hibernation failure.
> > 
> > So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired
> > under mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under
> > some other filesystem locks.
> 
> Just for SB_FREEZE_INTERNAL, I think there's no case of
> sb_start_intwrite() that can ever hold in D wait because by the time we
> acquire the semaphore for write, the internal freeze_fs should have
> been called and the filesystem should have quiesced itself.  On the
> other hand, if that theory itself is true, there's no real need for
> sb_start_intwrite() at all because it can never conflict.

This is not true. Sure, userspace should all be blocked, dirty pages
written back, but you still have filesystem background tasks like lazy
initialization of inode tables, inode garbage collection, regular lazy
updates of statistics in the superblock. These generally happen from
kthreads / work queues and they can still be scheduled and executed
although freeze_super() has started blocking SB_FREEZE_WRITE and
SB_FREEZE_PAGEFAULT levels... And generally this freeze level is there
exactly because it needs to be acquired from locking context which doesn't
allow usage of SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT levels.

> >  So if you freeze the filesystem, a task can block on frozen
> > filesystem with e.g. mmap_sem held and if some other task then blocks
> > on grabbing that mmap_sem, hibernation fails because we'll be unable
> > to hibernate the task waiting for mmap_sem. So if you'd like to
> > completely avoid these hibernation failures, you'd have to make a
> > slew of filesystem related locks use freezable sleeping. I don't
> > think that's feasible.
> 
> I wouldn't see that because I'm on x86_64 and that takes the vma_lock
> in page faults not the mmap_lock.  The granularity of all these locks
> is process level, so it's hard to see what they'd be racing with ...

I agree that because of vma_lock it would be much harder to see this. But
as far as I remember mmap_sem is still a fallback option when we race with
VMA modification even for x86 so this problem is possible to hit, just much
more unlikely.

> even if I conjecture two threads trying to write to something, they'd
> have to have some internal co-ordination which would likely prevent the
> second one from writing if the first got stuck on the page fault. 
> 
> > I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being
> > freezable would be rare enough but you've proven they are quite
> > frequent. We can try making SB_FREEZE_PAGEFAULT level (or even
> > SB_FREEZE_INTERNAL) freezable and see whether that works good
> > enough...
> 
> I'll try to construct a more severe test than systemd-journald ... it
> looks to be single threaded in its operation.

OK, thanks!

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR