lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <17d719b79f9.d89bf95117881.5882353172682156775@mykernel.net>
Date:   Wed, 01 Dec 2021 00:09:14 +0800
From:   Chengguang Xu <cgxu519@...ernel.net>
To:     "Jan Kara" <jack@...e.cz>
Cc:     "Miklos Szeredi" <miklos@...redi.hu>,
        "Amir Goldstein" <amir73il@...il.com>,
        "linux-fsdevel" <linux-fsdevel@...r.kernel.org>,
        "overlayfs" <linux-unionfs@...r.kernel.org>,
        "linux-kernel" <linux-kernel@...r.kernel.org>,
        "ronyjin" <ronyjin@...cent.com>,
        "charliecgxu" <charliecgxu@...cent.com>
Subject: Re: [RFC PATCH v5 06/10] ovl: implement overlayfs' ->write_inode
 operation


 ---- 在 星期二, 2021-11-30 19:22:06 Jan Kara <jack@...e.cz> 撰写 ----
 > On Fri 19-11-21 14:12:46, Chengguang Xu wrote:
 > >  ---- 在 星期五, 2021-11-19 00:43:49 Jan Kara <jack@...e.cz> 撰写 ----
 > >  > On Thu 18-11-21 20:02:09, Chengguang Xu wrote:
 > >  > >  ---- 在 星期四, 2021-11-18 19:23:15 Jan Kara <jack@...e.cz> 撰写 ----
 > >  > >  > On Thu 18-11-21 14:32:36, Chengguang Xu wrote:
 > >  > >  > > 
 > >  > >  > >  ---- 在 星期三, 2021-11-17 14:11:29 Chengguang Xu <cgxu519@...ernel.net> 撰写 ----
 > >  > >  > >  >  ---- 在 星期二, 2021-11-16 20:35:55 Miklos Szeredi <miklos@...redi.hu> 撰写 ----
 > >  > >  > >  >  > On Tue, 16 Nov 2021 at 03:20, Chengguang Xu <cgxu519@...ernel.net> wrote:
 > >  > >  > >  >  > >
 > >  > >  > >  >  > >  ---- 在 星期四, 2021-10-07 21:34:19 Miklos Szeredi <miklos@...redi.hu> 撰写 ----
 > >  > >  > >  >  > >  > On Thu, 7 Oct 2021 at 15:10, Chengguang Xu <cgxu519@...ernel.net> wrote:
 > >  > >  > >  >  > >  > >  > However that wasn't what I was asking about.  AFAICS ->write_inode()
 > >  > >  > >  >  > >  > >  > won't start write back for dirty pages.   Maybe I'm missing something,
 > >  > >  > >  >  > >  > >  > but there it looks as if nothing will actually trigger writeback for
 > >  > >  > >  >  > >  > >  > dirty pages in upper inode.
 > >  > >  > >  >  > >  > >  >
 > >  > >  > >  >  > >  > >
 > >  > >  > >  >  > >  > > Actually, page writeback on upper inode will be triggered by overlayfs ->writepages and
 > >  > >  > >  >  > >  > > overlayfs' ->writepages will be called by vfs writeback function (i.e writeback_sb_inodes).
 > >  > >  > >  >  > >  >
 > >  > >  > >  >  > >  > Right.
 > >  > >  > >  >  > >  >
 > >  > >  > >  >  > >  > But wouldn't it be simpler to do this from ->write_inode()?
 > >  > >  > >  >  > >  >
 > >  > >  > >  >  > >  > I.e. call write_inode_now() as suggested by Jan.
 > >  > >  > >  >  > >  >
 > >  > >  > >  >  > >  > Also could just call mark_inode_dirty() on the overlay inode
 > >  > >  > >  >  > >  > regardless of the dirty flags on the upper inode since it shouldn't
 > >  > >  > >  >  > >  > matter and results in simpler logic.
 > >  > >  > >  >  > >  >
 > >  > >  > >  >  > >
 > >  > >  > >  >  > > Hi Miklos,
 > >  > >  > >  >  > >
 > >  > >  > >  >  > > Sorry for delayed response for this, I've been busy with another project.
 > >  > >  > >  >  > >
 > >  > >  > >  >  > > I agree with your suggesion above and further more how about just mark overlay inode dirty
 > >  > >  > >  >  > > when it has upper inode? This approach will make marking dirtiness simple enough.
 > >  > >  > >  >  > 
 > >  > >  > >  >  > Are you suggesting that all non-lower overlay inodes should always be dirty?
 > >  > >  > >  >  > 
 > >  > >  > >  >  > The logic would be simple, no doubt, but there's the cost to walking
 > >  > >  > >  >  > those overlay inodes which don't have a dirty upper inode, right?  
 > >  > >  > >  > 
 > >  > >  > >  > That's true.
 > >  > >  > >  > 
 > >  > >  > >  >  > Can you quantify this cost with a benchmark?  Can be totally synthetic,
 > >  > >  > >  >  > e.g. lookup a million upper files without modifying them, then call
 > >  > >  > >  >  > syncfs.
 > >  > >  > >  >  > 
 > >  > >  > >  > 
 > >  > >  > >  > No problem, I'll do some tests for the performance.
 > >  > >  > >  > 
 > >  > >  > > 
 > >  > >  > > Hi Miklos,
 > >  > >  > > 
 > >  > >  > > I did some rough tests and the results like below.  In practice,  I don't
 > >  > >  > > think that 1.3s extra time of syncfs will cause significant problem.
 > >  > >  > > What do you think?
 > >  > >  > 
 > >  > >  > Well, burning 1.3s worth of CPU time for doing nothing seems like quite a
 > >  > >  > bit to me. I understand this is with 1000000 inodes but although that is
 > >  > >  > quite a few it is not unheard of. If there would be several containers
 > >  > >  > calling sync_fs(2) on the machine they could easily hog the machine... That
 > >  > >  > is why I was originally against keeping overlay inodes always dirty and
 > >  > >  > wanted their dirtiness to at least roughly track the real need to do
 > >  > >  > writeback.
 > >  > >  > 
 > >  > > 
 > >  > > Hi Jan,
 > >  > > 
 > >  > > Actually, the time on user and sys are almost same with directly excute syncfs on underlying fs.
 > >  > > IMO, it only extends syncfs(2) waiting time for perticular container but not burning cpu.
 > >  > > What am I missing?
 > >  > 
 > >  > Ah, right, I've missed that only realtime changed, not systime. I'm sorry
 > >  > for confusion. But why did the realtime increase so much? Are we waiting
 > >  > for some IO?
 > >  > 
 > > 
 > > There are many places to call cond_resched() in writeback process,
 > > so sycnfs process was scheduled several times.
 > 
 > I was thinking about this a bit more and I don't think I buy this
 > explanation. What I rather think is happening is that real work for syncfs
 > (writeback_inodes_sb() and sync_inodes_sb() calls) gets offloaded to a flush
 > worker. E.g. writeback_inodes_sb() ends up calling
 > __writeback_inodes_sb_nr() which does:
 > 
 > bdi_split_work_to_wbs()
 > wb_wait_for_completion()
 > 
 > So you don't see the work done in the times accounted to your test
 > program. But in practice the flush worker is indeed burning 1.3s worth of
 > CPU to scan the 1 million inode list and do nothing.
 > 

That makes sense. However, in real container use case,  the upper dir is always empty,
so I don't think there is meaningful difference compare to accurately marking overlay
inode dirty.  

I'm not very familiar with other use cases of overlayfs except container, should we consider
other use cases? Maybe we can also ignore the cpu burden because those use cases don't
have density deployment like container.



Thanks,
Chengguang



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ