[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140911154539.GK3052@yliu-dev.sh.intel.com>
Date: Thu, 11 Sep 2014 23:45:39 +0800
From: Yuanhan Liu <yuanhan.liu@...ux.intel.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>,
Yuanhan Liu <yuanhan.liu@...ux.intel.com>
Subject: Re: [PATCH] jbd2: remove __jbd2_journal_clean_checkpoint_list to
solve scalability issue
On Thu, Sep 11, 2014 at 03:51:47PM +0200, Jan Kara wrote:
> On Thu 11-09-14 18:09:53, Yuanhan Liu wrote:
> > Firstly, if I read the code correctly(I have a feeling I missed something),
> > __jbd2_journal_clean_checkpoint_list is not necessary. jbd2_log_do_checkpoint
> > will remove them for the checkpointed transaction. In another word,
> > __jbd2_journal_clean_checkpoint_list can't drop an transaction if it's not
> > checkpointed.
> Yes, that's correct. __jbd2_journal_clean_checkpoint_list() is there only
> to free up buffers that were already checkpointed.
Yes, and that might be something I missed: if a transaction is
checkpointed, those buffers attached to the transaction should also be
released, right? If that's ture, what __jbd2_journal_clean_checkpoint_list()
is for?
>
> > Secondly, I noticed severe scalability limit caused by this function with fsmark:
> > $ fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096
> >
> > It mainly writes tons of small files, each with 4K, and calls fsync after
> > each write. It's one thread only, and it's tested on a 32G memory disk(brd).
> >
> > "perf top" shows that journal_clean_one_cp_list() is very hot, like 40%.
> > However, that function is quite cheap, actually. But __jbd2_journal_clean_checkpoint_list()
> > will repeatedly call it for each transaction on the j_checkpoint_transactions.
> > The list becomes longer and longer in linear with time. It soon grows to
> > a list with 3,000 and more transactions. What's worse, we will iterate all
> > those transactions on this list every time before committing a transaction,
> > aka, fsync a file in our case.
> OK, this is kind of a pathological workload (generating lots of tiny
> transactions) but it's not insane. I'd also note that we should eventually
> reach a steady state once the journal fills up and we will be forced to
> checkpoint transactions from the journal to make space for new ones.
> However at that point we will have few thousands of transactions in the
> journal and I agree it takes a while to scan them in
> __jbd2_journal_clean_checkpoint_list().
>
> I'm not yet convinced that just dropping
> __jbd2_journal_clean_checkpoint_list() is the right thing to do.
Yes, I feel the same. BTW, I should add a RFC tag before the subject.
This patch is mainly for informing you guys there might be a scalability
issue with current JBD2 code.
--yliu
> Some
> periodic cleaning of checkpointed entries would seem reasonable so that we
> don't save all that work for the moment we run out of space in the journal.
> I'll think how to do that in a more efficient way...
>
> Honza
>
> > So, that's roughly the reason why journal_clean_one_cp_list() is hot. However,
> > I'm wondering why the j_checkpoint_transactions list keeps growing. Is there
> > a bug? I did more investigates, and I guess I found the root cause.
> >
> > In this case, we invoke fsync after each write, so, it basically means one
> > transaction for one file. One transaction normally contains few blocks of meta
> > data, like bitmap, inode and so on. All files in same group shares one bitmap
> > block. But one inode table block could contains 16 files. Hence, it's possible
> > that 2 different file points to same meta blocks.
> >
> > For example, if file A and B uses same meta blocks, and if we write A and B in
> > following style:
> > write(A);
> > fsync(A);
> >
> > write(B);
> > fsync(B);
> >
> > then, when we are about to commit transation for B, and assume transaction of A
> > is not checkpointted yet, we can safely drop transaction A and replace it with
> > transaction B. Hence, the j_checkpoint_transactions grows by 1 only.
> >
> > And then assume A is the last inode in one inode block, hence, B will use another
> > inode table block. Thus transaction A and B is different. Hence, both A and B are
> > inserted to the j_checkpoint_transactions; the list grows by 2.
> >
> > Here I also got the proves; I added a trace point in jbd2_log_do_checkpoint(),
> > and here are some of them:
> >
> > fs_mark-2646 [000] .... 5.830990: jbd2_start_checkpoint: tid=20
> > fs_mark-2646 [000] .... 5.832391: jbd2_start_checkpoint: tid=36
> > fs_mark-2646 [000] .... 5.833794: jbd2_start_checkpoint: tid=52
> > fs_mark-2646 [000] .... 5.835153: jbd2_start_checkpoint: tid=68
> > fs_mark-2646 [000] .... 5.836517: jbd2_start_checkpoint: tid=84
> > fs_mark-2646 [000] .... 5.837982: jbd2_start_checkpoint: tid=100
> > fs_mark-2646 [000] .... 5.839464: jbd2_start_checkpoint: tid=116
> > fs_mark-2646 [000] .... 5.840820: jbd2_start_checkpoint: tid=132
> >
> > As you can see, the tid jumps by 16 each time, and other transactions are just
> > replaced by the way I described above.
> >
> > Step by step like above, that list grows. And that's why journal_clean_one_cp_list() is hot.
> >
> > It removes the scalability issue when I removed this function, and the fsmark
> > result(Files/sec) jumps from 9000 to 16000, which is an increase about 80%.
> >
> > Thoughts?
> >
> > CC: Andi Kleen <ak@...ux.intel.com>
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@...ux.intel.com>
> > ---
> > fs/jbd2/checkpoint.c | 115 ---------------------------------------------------
> > fs/jbd2/commit.c | 9 ----
> > include/linux/jbd2.h | 1 -
> > 3 files changed, 125 deletions(-)
> >
> > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> > index 9ffb19c..31fce78 100644
> > --- a/fs/jbd2/checkpoint.c
> > +++ b/fs/jbd2/checkpoint.c
> > @@ -83,26 +83,6 @@ static inline void __buffer_relink_io(struct journal_head *jh)
> > }
> >
> > /*
> > - * Try to release a checkpointed buffer from its transaction.
> > - * Returns 1 if we released it and 2 if we also released the
> > - * whole transaction.
> > - *
> > - * Requires j_list_lock
> > - */
> > -static int __try_to_free_cp_buf(struct journal_head *jh)
> > -{
> > - int ret = 0;
> > - struct buffer_head *bh = jh2bh(jh);
> > -
> > - if (jh->b_transaction == NULL && !buffer_locked(bh) &&
> > - !buffer_dirty(bh) && !buffer_write_io_error(bh)) {
> > - JBUFFER_TRACE(jh, "remove from checkpoint list");
> > - ret = __jbd2_journal_remove_checkpoint(jh) + 1;
> > - }
> > - return ret;
> > -}
> > -
> > -/*
> > * __jbd2_log_wait_for_space: wait until there is space in the journal.
> > *
> > * Called under j-state_lock *only*. It will be unlocked if we have to wait
> > @@ -412,101 +392,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
> >
> > /* Checkpoint list management */
> >
> > -/*
> > - * journal_clean_one_cp_list
> > - *
> > - * Find all the written-back checkpoint buffers in the given list and
> > - * release them.
> > - *
> > - * Called with the journal locked.
> > - * Called with j_list_lock held.
> > - * Returns number of buffers reaped (for debug)
> > - */
> > -
> > -static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
> > -{
> > - struct journal_head *last_jh;
> > - struct journal_head *next_jh = jh;
> > - int ret, freed = 0;
> > -
> > - *released = 0;
> > - if (!jh)
> > - return 0;
> > -
> > - last_jh = jh->b_cpprev;
> > - do {
> > - jh = next_jh;
> > - next_jh = jh->b_cpnext;
> > - ret = __try_to_free_cp_buf(jh);
> > - if (ret) {
> > - freed++;
> > - if (ret == 2) {
> > - *released = 1;
> > - return freed;
> > - }
> > - }
> > - /*
> > - * This function only frees up some memory
> > - * if possible so we dont have an obligation
> > - * to finish processing. Bail out if preemption
> > - * requested:
> > - */
> > - if (need_resched())
> > - return freed;
> > - } while (jh != last_jh);
> > -
> > - return freed;
> > -}
> > -
> > -/*
> > - * journal_clean_checkpoint_list
> > - *
> > - * Find all the written-back checkpoint buffers in the journal and release them.
> > - *
> > - * Called with the journal locked.
> > - * Called with j_list_lock held.
> > - * Returns number of buffers reaped (for debug)
> > - */
> > -
> > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
> > -{
> > - transaction_t *transaction, *last_transaction, *next_transaction;
> > - int ret = 0;
> > - int released;
> > -
> > - transaction = journal->j_checkpoint_transactions;
> > - if (!transaction)
> > - goto out;
> > -
> > - last_transaction = transaction->t_cpprev;
> > - next_transaction = transaction;
> > - do {
> > - transaction = next_transaction;
> > - next_transaction = transaction->t_cpnext;
> > - ret += journal_clean_one_cp_list(transaction->
> > - t_checkpoint_list, &released);
> > - /*
> > - * This function only frees up some memory if possible so we
> > - * dont have an obligation to finish processing. Bail out if
> > - * preemption requested:
> > - */
> > - if (need_resched())
> > - goto out;
> > - if (released)
> > - continue;
> > - /*
> > - * It is essential that we are as careful as in the case of
> > - * t_checkpoint_list with removing the buffer from the list as
> > - * we can possibly see not yet submitted buffers on io_list
> > - */
> > - ret += journal_clean_one_cp_list(transaction->
> > - t_checkpoint_io_list, &released);
> > - if (need_resched())
> > - goto out;
> > - } while (transaction != last_transaction);
> > -out:
> > - return ret;
> > -}
> >
> > /*
> > * journal_remove_checkpoint: called after a buffer has been committed
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index b73e021..1ebdd70 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -504,15 +504,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > jbd2_journal_refile_buffer(journal, jh);
> > }
> >
> > - /*
> > - * Now try to drop any written-back buffers from the journal's
> > - * checkpoint lists. We do this *before* commit because it potentially
> > - * frees some memory
> > - */
> > - spin_lock(&journal->j_list_lock);
> > - __jbd2_journal_clean_checkpoint_list(journal);
> > - spin_unlock(&journal->j_list_lock);
> > -
> > jbd_debug(3, "JBD2: commit phase 1\n");
> >
> > /*
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 0dae71e..c41ab38 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -1042,7 +1042,6 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> > extern void jbd2_journal_commit_transaction(journal_t *);
> >
> > /* Checkpoint list management */
> > -int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
> > int __jbd2_journal_remove_checkpoint(struct journal_head *);
> > void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
> >
> > --
> > 1.9.0
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@...r.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <jack@...e.cz>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists