Profile TW performance with large task DBs #3618

djmitche · 2024-09-07T02:33:10Z

...Probably running a profiler is the action item from this issue...

Originally posted by @djmitche in #3329 (comment)

I think this would be best done from Taskwarrior, treating the Rust code as a "black box". It would be great to learn which calls into Rust are taking the longest -- and what non-Rust things are taking a long time, too. I can think of a few possibilities we might learn, but the truth is probably something different:

Opening the DB might take a long time (partly due to mtime of taskchampion.sqlite3 got changed when calling task list #3418)
Writes may be super-slow (Expose operations from TaskChampion taskchampion#373 should help with this, by doing writes in fewer transactions)
We may be calling into Rust repeatedly in a way that is wasteful (e.g., TDB2::pending_tasks)
The translation between Rust and C++ data structures may be slow (C++'s std::string and Rust's String are not compatible)

I don't know much about profiling in C++, but perhaps you, dear reader, do?

The text was updated successfully, but these errors were encountered:

felixschurk · 2024-09-13T16:48:08Z

I think in order to have it somehow comparable we should use here the existing load, run_perf tasks which are in the performance folder. But probably (at least for a first run) in a bit cut down size.. Currently it would create 8510 tasks, which is fairly large :D

I am planning to take a look into this, however I will probably not have time before end of September :/

felixschurk · 2024-09-27T17:31:51Z

I did let it run on my local machine which runs on Fedora, having an i5-1145G7 processor and 16 GiB of RAM. Not using the machine for anything else super intense at the profiling time and then it came up with some call graph like this:

At least based on that what takes the most of the time is the rendering, starting, and the rebuilding also the working set of taskwarrior.

I am not really sure how the best way to share results of performance profiling etc.

I also did let run the same performance suite (the one which is with taskwarrior) on the release of 2.6.2 and there the total runtime was only 2.7s vs. 4.25s on the current develop branch. I mean the measurements are not about the current absolute time nor anything (I only did let them run once so far, so in order to have there something super consistent multiple runs and then some average or so would be needed) but still one sees that taskwarrior 3 is noticeably slower, but solely from the profiling I can not really pin it down to one method where the problem might be coming from.

The performance script also returns some counters for garbage collection etc. and there the counters are an order more for the taskwarrior 3, but I am not aware on how they are computed so might also report something wrong.

felixschurk · 2024-09-27T18:44:23Z

Okay, I just noticed that one "counter" or timer was lost due to the transition to the TDB2 backend. More specifically, the timer which tracked how long it takes to load some task set into Taskwarrior.

The counter was in Line 419 of TDB2.cpp.
5bb9857#diff-d6d65d1f59ba33ef5268d5bc8a49ed26f397b899b902332404aee34e800cbc0bL419

Context::getContext ().time_load_us += timer.total_us ();

This explains why the reporting of the performance always states load:0 for the latest taskwarrior.

djmitche · 2024-09-29T20:45:11Z

That flame graph seems to have multiple commands in it -- is it a concatenation from multiple runs? It's a little difficult to interpret the graphic, since things are cut off. Is there a way to share the source file from which that was drawn?

#3635 should restore the load timer.

felixschurk · 2024-11-02T10:19:19Z

I will come back and see how to extract it better to a simple output.

felixschurk · 2024-11-15T20:28:08Z

Okay, I finally found some time to look a bit back into it.

I run the next report on the performance database once and got an overview of the call graph with callgrind

Category                   Data          
Pending                    2128          
Waiting                    0
Recurring                  0             
Completed                  5957
Deleted                    0             
Total                      8085
Annotations                425           
Unique tags                292
Projects                   230           
Blocked tasks              0
Blocking tasks             0             
Undo transactions          8510
Sync backlog transactions  71007         
Tasks tagged               99.2%
Oldest task                2024-11-15    
Newest task                2024-11-15
Task used for              4min          
Task added every
Task completed every                     
Average time pending       21min
Average desc length        125 characters

The results can be seen in the following pdf, however it is a bit messy and mainly shows there the counter of the calls. However, reproducing it following the manual of callgrind is quite straight forward.
For viewing of the attached file I would recommend kcachegrind.

Long story short, I know to little of the code in TW to properly understand the output or really get an overview which call might be unnecessary or what really limits the performance. I will try to understand and wrap my head around it.

On my laptop, using the same databse it takes approx 0,2 s with v2.6.2 and 0,6s with v3.2, meaning we are currently still three times slower than v2.6.2.

call-graph.pdf
callgrind.out.zip

djmitche · 2024-11-16T14:37:47Z

This suggests that most of that time is spent in the gc operation, and the included PDF agrees: gc is basically rebuilding the working set, and a lot of the calls are doing that. So that's probably a good place to start! GothenburgBitFactory/taskchampion#414 may be a start there, but there is probably room for more.

Here's the code for rebuilding the working set:

/// Rebuild the working set using a function to identify tasks that should be in the set.  This
/// renumbers the existing working-set tasks to eliminate gaps, and also adds any tasks that
/// are not already in the working set but should be.  The rebuild occurs in a single
/// trasnsaction against the storage backend.
pub fn rebuild<F>(txn: &mut dyn StorageTxn, in_working_set: F, renumber: bool) -> Result<()>
where                                   
    F: Fn(&TaskMap) -> bool,
{                                                       
    let mut new_ws = vec![None]; // index 0 is always None
    let mut seen = HashSet::new();
                                                                               
    // The goal here is for existing working-set items to be "compressed' down to index 1, so
    // we begin by scanning the current working set and inserting any tasks that should still
    // be in the set into new_ws, implicitly dropping any tasks that are no longer in the
    // working set.             
    for elt in txn.get_working_set()?.drain(1..) {
        if let Some(uuid) = elt {
            if let Some(task) = txn.get_task(uuid)? {
                if in_working_set(&task) {    
                    new_ws.push(Some(uuid));
                    seen.insert(uuid);               
                    continue;
                }                           
            }                        
        }
                                                                               
        // if we are not renumbering, then insert a blank working-set entry here
        if !renumber {
            new_ws.push(None);
        }                 
    }    

    // if renumbering, clear the working set and re-add
    if renumber {             
        txn.clear_working_set()?;                                     
        for elt in new_ws.drain(1..new_ws.len()).flatten() {
            txn.add_to_working_set(elt)?;
        }       
    } else {                           
        // ..otherwise, just clear the None items determined above from the working set
        for (i, elt) in new_ws.iter().enumerate().skip(1) {
            if elt.is_none() {         
                txn.set_working_set_item(i, None)?;
            }            
        }        
    }         
                                                                               
    // Now go hunting for tasks that should be in this list but are not, adding them at the
    // end of the list, whether renumbering or not
    for (uuid, task) in txn.all_tasks()? {
        if !seen.contains(&uuid) && in_working_set(&task) {                                                                                                    
            txn.add_to_working_set(uuid)?;            
        }                                                             
    }           
                                                                                                                                                               
    txn.commit()?;                        
    Ok(())                                                                  
}

Drilling down into the working set rebuild, the majority of the time is in all_tasks (called once at the end but returning a lot of data) and get_task (called once for each task in the existing working set). There's a little time spent in adding tasks to the working set (perhaps fixed by the issue linked above), but for the most part it's the fetch-all-tasks bit that is taking a lot of time.

Going back to main and drilling down through the initialization, I see that CmdCustom::execute (the "custom report" command) calls handleUntil which handles deleting pending tasks with until dates in the past. It takes a while because it calls pending_tasks to fetch all pending tasks from the DB. This is probably faster due to #3632. But, this is surprising -- that's a kind of garbage-collection, but is not guarded by rc:gc! This function calls pending_tasks, but we cache the pending tasks, which is probably called by the custom report later, so that's not wasted time.

taskwarrior/src/TDB2.cpp

Lines 267 to 286 in 01ced32

    
           //////////////////////////////////////////////////////////////////////////////// 
        
           const std::vector<Task> TDB2::pending_tasks() { 
        
             if (!_pending_tasks) { 
        
               Timer timer; 
        
               auto pending_tctasks = replica()->pending_task_data(); 
        
               std::vector<Task> result; 
        
               for (auto& maybe_tctask : pending_tctasks) { 
        
                 auto tctask = maybe_tctask.take(); 
        
                 result.push_back(Task(std::move(tctask))); 
        
               } 
        
               dependency_scan(result); 
        
               Context::getContext().time_load_us += timer.total_us(); 
        
               _pending_tasks = result; 
        
             } 
        
             return *_pending_tasks; 
        
           }

A few thoughts off the top of my head:

Maybe we could cache some kind of "no changes since last working-set rebuild" flag that would allow us to make it a no-op
- This might be useful mixed with the ability to open a replica in read-only mode: TW could open in read-only mode by default, see if a working-set rebuild is required, and only re-open in read-write mode if it is required.
We could probably do a better job of this at the DB level. The only data we need from the DB is the set of uuids for tasks with status:pending - that's enough to know what to add and remove from the working-set table.
The handleUntil check should be part of gc. It will almost always make no changes, so the same read-only/read-write considerations apply.

What do you think?

github-project-automation bot added this to Taskwarrior Development Sep 7, 2024

github-project-automation bot moved this to Backlog in Taskwarrior Development Sep 7, 2024

djmitche added the topic:performance Issues related to performance label Sep 7, 2024

djmitche moved this from Backlog to In progress in Taskwarrior Development Nov 16, 2024

djmitche assigned felixschurk Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile TW performance with large task DBs #3618

Profile TW performance with large task DBs #3618

djmitche commented Sep 7, 2024 •

edited

Loading

felixschurk commented Sep 13, 2024 •

edited

Loading

felixschurk commented Sep 27, 2024 •

edited

Loading

felixschurk commented Sep 27, 2024 •

edited

Loading

djmitche commented Sep 29, 2024

felixschurk commented Nov 2, 2024

felixschurk commented Nov 15, 2024

djmitche commented Nov 16, 2024

Profile TW performance with large task DBs #3618

Profile TW performance with large task DBs #3618

Comments

djmitche commented Sep 7, 2024 • edited Loading

felixschurk commented Sep 13, 2024 • edited Loading

felixschurk commented Sep 27, 2024 • edited Loading

felixschurk commented Sep 27, 2024 • edited Loading

djmitche commented Sep 29, 2024

felixschurk commented Nov 2, 2024

felixschurk commented Nov 15, 2024

djmitche commented Nov 16, 2024

djmitche commented Sep 7, 2024 •

edited

Loading

felixschurk commented Sep 13, 2024 •

edited

Loading

felixschurk commented Sep 27, 2024 •

edited

Loading

felixschurk commented Sep 27, 2024 •

edited

Loading