[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAO8a2ShwBbMau+N_GstVbbrDPF9gwg30Ztg8keTEva2fZp=Q3Q@mail.gmail.com>
Date: Tue, 8 Jul 2025 12:20:42 +0300
From: Alex Markuze <amarkuze@...hat.com>
To: Viacheslav Dubeyko <Slava.Dubeyko@....com>
Cc: "ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>,
"abinashlalotra@...il.com" <abinashlalotra@...il.com>, Xiubo Li <xiubli@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "idryomov@...il.com" <idryomov@...il.com>,
"abinashsinghlalotra@...il.com" <abinashsinghlalotra@...il.com>
Subject: Re: [PATCH RFC] fs/ceph : fix build warning in ceph_writepages_start()
My comment is about the fact that we are reallocating a 816B
ceph_osd_request in __ceph_sync_read in a loop.
I would just allocate the memory once and reuse the struct. Its
allocated in ceph_osdc_alloc_request where this logic is used:
if (use_mempool) {
BUG_ON(num_ops > CEPH_OSD_SLAB_OPS);
req = mempool_alloc(osdc->req_mempool, gfp_flags);
} else if (num_ops <= CEPH_OSD_SLAB_OPS) {
req = kmem_cache_alloc(ceph_osd_request_cache, gfp_flags);
} else {
BUG_ON(num_ops > CEPH_OSD_MAX_OPS);
req = kmalloc(struct_size(req, r_ops, num_ops), gfp_flags);
}
IMHO we should keep a consistent allocation logic throughout, @Abinash
Singh please take a look at the allocation logic mentioned here.
On Mon, Jul 7, 2025 at 11:10 PM Viacheslav Dubeyko
<Slava.Dubeyko@....com> wrote:
>
> On Mon, 2025-07-07 at 22:57 +0300, Alex Markuze wrote:
> > Well, kmem_cache is a bit faster AFAIK. TBH, I would use a magazine
> > allocator here which would bridge the two, but that's a different
> > discussion.
> > We already have an allocation of a big struct in the read path as well
> > where we just kmalloc and it should change as well to
> > kmemcache/mempool.
>
> Which structure in the read path do you mean here? :)
>
> I assume that mempool could be useful under heavy memory pressure to guarantee
> the memory availability. But performance is really important point here too.
>
> Thanks,
> Slava.
>
> >
> > On Mon, Jul 7, 2025 at 9:24 PM Viacheslav Dubeyko <Slava.Dubeyko@....com> wrote:
> > >
> > > On Mon, 2025-07-07 at 13:11 +0300, Alex Markuze wrote:
> > > > Hi Abinash,
> > > >
> > > > Thanks for your patch, you’ve correctly identified a real concern
> > > > around stack usage in ceph_writepages_start().
> > > >
> > > > However, dynamically allocating ceph_writeback_ctl inside .writepages
> > > > isn't ideal. This function may be called in memory reclaim paths or
> > > > under writeback pressure, where any GFP allocation (even GFP_NOFS)
> > > > could deadlock or fail unexpectedly.
> > > >
> > > > Instead of allocating the struct on each call, I’d suggest one of the following:
> > > >
> > > > Add a dedicated kmem_cache for ceph_writeback_ctl, initialized during
> > > > Ceph FS client init.
> > > > This allows reuse across calls without hitting the slab allocator each time.
> > > >
> > >
> > > I had discussion with Ilya several days ago related to this issue. :) I
> > > considered namely to add the dedicated kmem_cache for ceph_writeback_ctl.
> > > However, Ilya mentioned that CONFIG_FRAME_WARN defaulting to 2048 on x86_64. So,
> > > you even could not see such warning on x86_64 systems. But, yes, this rework
> > > makes sense from my point of view.
> > >
> > > By the way, do we need to consider mempool here instead of kmem_cache? The
> > > writeback is pretty intensive and critical code path.
> > >
> > > Thanks,
> > > Slava.
> > >
> > > > Embed a ceph_writeback_ctl struct inside ceph_inode_info, if it's
> > > > feasible to manage lifetimes and synchronization. Note that
> > > > .writepages can run concurrently on the same inode, so this approach
> > > > would require proper locking or reuse guarantees.
> > > >
> > > > On Sat, Jul 5, 2025 at 6:54 PM Abinash Singh <abinashlalotra@...il.com> wrote:
> > > > >
> > > > > The function `ceph_writepages_start()` triggers
> > > > > a large stack frame warning:
> > > > > ld.lld: warning:
> > > > > fs/ceph/addr.c:1632:0: stack frame size (1072) exceeds limit (1024) in function 'ceph_writepages_start.llvm.2555023190050417194'
> > > > >
> > > > > fix it by dynamically allocating `ceph_writeback_ctl` struct.
> > > > >
> > > > > Signed-off-by: Abinash Singh <abinashsinghlalotra@...il.com>
> > > > > ---
> > > > > The high stack usage of ceph_writepages_start() was further
> > > > > confirmed by using -fstack-usage flag :
> > > > > ...
> > > > > fs/ceph/addr.c:1837:ceph_netfs_check_write_begin 104 static
> > > > > fs/ceph/addr.c:1630:ceph_writepages_start 648 static
> > > > > ...
> > > > > After optimzations it may go upto 1072 as seen in warning.
> > > > > After allocating `ceph_writeback_ctl` struct it is reduced to:
> > > > > ...
> > > > > fs/ceph/addr.c:1630:ceph_writepages_start 288 static
> > > > > fs/ceph/addr.c:81:ceph_dirty_folio 72 static
> > > > > ...
> > > > > Is this fun used very frequently ?? or dynamic allocation is
> > > > > a fine fix for reducing the stack usage ?
> > > > >
> > > > > Thank You
> > > > > ---
> > > > > fs/ceph/addr.c | 82 ++++++++++++++++++++++++++------------------------
> > > > > 1 file changed, 43 insertions(+), 39 deletions(-)
> > > > >
> > > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > > index 60a621b00c65..503a86c1dda6 100644
> > > > > --- a/fs/ceph/addr.c
> > > > > +++ b/fs/ceph/addr.c
> > > > > @@ -1633,9 +1633,13 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > > > struct inode *inode = mapping->host;
> > > > > struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > > > > struct ceph_client *cl = fsc->client;
> > > > > - struct ceph_writeback_ctl ceph_wbc;
> > > > > + struct ceph_writeback_ctl *ceph_wbc __free(kfree) = NULL;
> > > > > int rc = 0;
> > > > >
> > > > > + ceph_wbc = kmalloc(sizeof(*ceph_wbc), GFP_NOFS);
> > > > > + if (!ceph_wbc)
> > > > > + return -ENOMEM;
> > > > > +
> > > > > if (wbc->sync_mode == WB_SYNC_NONE && fsc->write_congested)
> > > > > return 0;
> > > > >
> > > > > @@ -1648,7 +1652,7 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > > > return -EIO;
> > > > > }
> > > > >
> > > > > - ceph_init_writeback_ctl(mapping, wbc, &ceph_wbc);
> > > > > + ceph_init_writeback_ctl(mapping, wbc, ceph_wbc);
> > > > >
> > > > > if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
> > > > > rc = -EIO;
> > > > > @@ -1656,7 +1660,7 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > > > }
> > > > >
> > > > > retry:
> > > > > - rc = ceph_define_writeback_range(mapping, wbc, &ceph_wbc);
> > > > > + rc = ceph_define_writeback_range(mapping, wbc, ceph_wbc);
> > > > > if (rc == -ENODATA) {
> > > > > /* hmm, why does writepages get called when there
> > > > > is no dirty data? */
> > > > > @@ -1665,55 +1669,55 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > > > }
> > > > >
> > > > > if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
> > > > > - tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
> > > > > + tag_pages_for_writeback(mapping, ceph_wbc->index, ceph_wbc->end);
> > > > >
> > > > > - while (!has_writeback_done(&ceph_wbc)) {
> > > > > - ceph_wbc.locked_pages = 0;
> > > > > - ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
> > > > > + while (!has_writeback_done(ceph_wbc)) {
> > > > > + ceph_wbc->locked_pages = 0;
> > > > > + ceph_wbc->max_pages = ceph_wbc->wsize >> PAGE_SHIFT;
> > > > >
> > > > > get_more_pages:
> > > > > - ceph_folio_batch_reinit(&ceph_wbc);
> > > > > + ceph_folio_batch_reinit(ceph_wbc);
> > > > >
> > > > > - ceph_wbc.nr_folios = filemap_get_folios_tag(mapping,
> > > > > - &ceph_wbc.index,
> > > > > - ceph_wbc.end,
> > > > > - ceph_wbc.tag,
> > > > > - &ceph_wbc.fbatch);
> > > > > + ceph_wbc->nr_folios = filemap_get_folios_tag(mapping,
> > > > > + &ceph_wbc->index,
> > > > > + ceph_wbc->end,
> > > > > + ceph_wbc->tag,
> > > > > + &ceph_wbc->fbatch);
> > > > > doutc(cl, "pagevec_lookup_range_tag for tag %#x got %d\n",
> > > > > - ceph_wbc.tag, ceph_wbc.nr_folios);
> > > > > + ceph_wbc->tag, ceph_wbc->nr_folios);
> > > > >
> > > > > - if (!ceph_wbc.nr_folios && !ceph_wbc.locked_pages)
> > > > > + if (!ceph_wbc->nr_folios && !ceph_wbc->locked_pages)
> > > > > break;
> > > > >
> > > > > process_folio_batch:
> > > > > - rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
> > > > > + rc = ceph_process_folio_batch(mapping, wbc, ceph_wbc);
> > > > > if (rc)
> > > > > goto release_folios;
> > > > >
> > > > > /* did we get anything? */
> > > > > - if (!ceph_wbc.locked_pages)
> > > > > + if (!ceph_wbc->locked_pages)
> > > > > goto release_folios;
> > > > >
> > > > > - if (ceph_wbc.processed_in_fbatch) {
> > > > > - ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
> > > > > + if (ceph_wbc->processed_in_fbatch) {
> > > > > + ceph_shift_unused_folios_left(&ceph_wbc->fbatch);
> > > > >
> > > > > - if (folio_batch_count(&ceph_wbc.fbatch) == 0 &&
> > > > > - ceph_wbc.locked_pages < ceph_wbc.max_pages) {
> > > > > + if (folio_batch_count(&ceph_wbc->fbatch) == 0 &&
> > > > > + ceph_wbc->locked_pages < ceph_wbc->max_pages) {
> > > > > doutc(cl, "reached end fbatch, trying for more\n");
> > > > > goto get_more_pages;
> > > > > }
> > > > > }
> > > > >
> > > > > - rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
> > > > > + rc = ceph_submit_write(mapping, wbc, ceph_wbc);
> > > > > if (rc)
> > > > > goto release_folios;
> > > > >
> > > > > - ceph_wbc.locked_pages = 0;
> > > > > - ceph_wbc.strip_unit_end = 0;
> > > > > + ceph_wbc->locked_pages = 0;
> > > > > + ceph_wbc->strip_unit_end = 0;
> > > > >
> > > > > - if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
> > > > > - ceph_wbc.nr_folios =
> > > > > - folio_batch_count(&ceph_wbc.fbatch);
> > > > > + if (folio_batch_count(&ceph_wbc->fbatch) > 0) {
> > > > > + ceph_wbc->nr_folios =
> > > > > + folio_batch_count(&ceph_wbc->fbatch);
> > > > > goto process_folio_batch;
> > > > > }
> > > > >
> > > > > @@ -1724,38 +1728,38 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > > > * we tagged for writeback prior to entering this loop.
> > > > > */
> > > > > if (wbc->nr_to_write <= 0 && wbc->sync_mode == WB_SYNC_NONE)
> > > > > - ceph_wbc.done = true;
> > > > > + ceph_wbc->done = true;
> > > > >
> > > > > release_folios:
> > > > > doutc(cl, "folio_batch release on %d folios (%p)\n",
> > > > > - (int)ceph_wbc.fbatch.nr,
> > > > > - ceph_wbc.fbatch.nr ? ceph_wbc.fbatch.folios[0] : NULL);
> > > > > - folio_batch_release(&ceph_wbc.fbatch);
> > > > > + (int)ceph_wbc->fbatch.nr,
> > > > > + ceph_wbc->fbatch.nr ? ceph_wbc->fbatch.folios[0] : NULL);
> > > > > + folio_batch_release(&ceph_wbc->fbatch);
> > > > > }
> > > > >
> > > > > - if (ceph_wbc.should_loop && !ceph_wbc.done) {
> > > > > + if (ceph_wbc->should_loop && !ceph_wbc->done) {
> > > > > /* more to do; loop back to beginning of file */
> > > > > doutc(cl, "looping back to beginning of file\n");
> > > > > /* OK even when start_index == 0 */
> > > > > - ceph_wbc.end = ceph_wbc.start_index - 1;
> > > > > + ceph_wbc->end = ceph_wbc->start_index - 1;
> > > > >
> > > > > /* to write dirty pages associated with next snapc,
> > > > > * we need to wait until current writes complete */
> > > > > - ceph_wait_until_current_writes_complete(mapping, wbc, &ceph_wbc);
> > > > > + ceph_wait_until_current_writes_complete(mapping, wbc, ceph_wbc);
> > > > >
> > > > > - ceph_wbc.start_index = 0;
> > > > > - ceph_wbc.index = 0;
> > > > > + ceph_wbc->start_index = 0;
> > > > > + ceph_wbc->index = 0;
> > > > > goto retry;
> > > > > }
> > > > >
> > > > > - if (wbc->range_cyclic || (ceph_wbc.range_whole && wbc->nr_to_write > 0))
> > > > > - mapping->writeback_index = ceph_wbc.index;
> > > > > + if (wbc->range_cyclic || (ceph_wbc->range_whole && wbc->nr_to_write > 0))
> > > > > + mapping->writeback_index = ceph_wbc->index;
> > > > >
> > > > > dec_osd_stopping_blocker:
> > > > > ceph_dec_osd_stopping_blocker(fsc->mdsc);
> > > > >
> > > > > out:
> > > > > - ceph_put_snap_context(ceph_wbc.last_snapc);
> > > > > + ceph_put_snap_context(ceph_wbc->last_snapc);
> > > > > doutc(cl, "%llx.%llx dend - startone, rc = %d\n", ceph_vinop(inode),
> > > > > rc);
> > > > >
> > > > > --
> > > > > 2.43.0
> > > > >
> > > > >
> > > >
>
> --
> Viacheslav Dubeyko <Slava.Dubeyko@....com>
Powered by blists - more mailing lists