Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile TW performance with large task DBs #3618

Open
djmitche opened this issue Sep 7, 2024 · 7 comments
Open

Profile TW performance with large task DBs #3618

djmitche opened this issue Sep 7, 2024 · 7 comments
Assignees
Labels
topic:performance Issues related to performance

Comments

@djmitche
Copy link
Collaborator

djmitche commented Sep 7, 2024

...Probably running a profiler is the action item from this issue...

Originally posted by @djmitche in #3329 (comment)


I think this would be best done from Taskwarrior, treating the Rust code as a "black box". It would be great to learn which calls into Rust are taking the longest -- and what non-Rust things are taking a long time, too. I can think of a few possibilities we might learn, but the truth is probably something different:

I don't know much about profiling in C++, but perhaps you, dear reader, do?

@felixschurk
Copy link
Collaborator

felixschurk commented Sep 13, 2024

I think in order to have it somehow comparable we should use here the existing load, run_perf tasks which are in the performance folder. But probably (at least for a first run) in a bit cut down size.. Currently it would create 8510 tasks, which is fairly large :D

I am planning to take a look into this, however I will probably not have time before end of September :/

@felixschurk
Copy link
Collaborator

felixschurk commented Sep 27, 2024

I did let it run on my local machine which runs on Fedora, having an i5-1145G7 processor and 16 GiB of RAM. Not using the machine for anything else super intense at the profiling time and then it came up with some call graph like this:
image

At least based on that what takes the most of the time is the rendering, starting, and the rebuilding also the working set of taskwarrior.

I am not really sure how the best way to share results of performance profiling etc.

I also did let run the same performance suite (the one which is with taskwarrior) on the release of 2.6.2 and there the total runtime was only 2.7s vs. 4.25s on the current develop branch. I mean the measurements are not about the current absolute time nor anything (I only did let them run once so far, so in order to have there something super consistent multiple runs and then some average or so would be needed) but still one sees that taskwarrior 3 is noticeably slower, but solely from the profiling I can not really pin it down to one method where the problem might be coming from.

The performance script also returns some counters for garbage collection etc. and there the counters are an order more for the taskwarrior 3, but I am not aware on how they are computed so might also report something wrong.

@felixschurk
Copy link
Collaborator

felixschurk commented Sep 27, 2024

Okay, I just noticed that one "counter" or timer was lost due to the transition to the TDB2 backend. More specifically, the timer which tracked how long it takes to load some task set into Taskwarrior.

The counter was in Line 419 of TDB2.cpp.
5bb9857#diff-d6d65d1f59ba33ef5268d5bc8a49ed26f397b899b902332404aee34e800cbc0bL419

Context::getContext ().time_load_us += timer.total_us ();

This explains why the reporting of the performance always states load:0 for the latest taskwarrior.

@djmitche
Copy link
Collaborator Author

That flame graph seems to have multiple commands in it -- is it a concatenation from multiple runs? It's a little difficult to interpret the graphic, since things are cut off. Is there a way to share the source file from which that was drawn?

#3635 should restore the load timer.

@felixschurk
Copy link
Collaborator

I will come back and see how to extract it better to a simple output.

@felixschurk
Copy link
Collaborator

Okay, I finally found some time to look a bit back into it.

I run the next report on the performance database once and got an overview of the call graph with callgrind

Category                   Data          
Pending                    2128          
Waiting                    0
Recurring                  0             
Completed                  5957
Deleted                    0             
Total                      8085
Annotations                425           
Unique tags                292
Projects                   230           
Blocked tasks              0
Blocking tasks             0             
Undo transactions          8510
Sync backlog transactions  71007         
Tasks tagged               99.2%
Oldest task                2024-11-15    
Newest task                2024-11-15
Task used for              4min          
Task added every
Task completed every                     
Average time pending       21min
Average desc length        125 characters

The results can be seen in the following pdf, however it is a bit messy and mainly shows there the counter of the calls. However, reproducing it following the manual of callgrind is quite straight forward.
For viewing of the attached file I would recommend kcachegrind.

Long story short, I know to little of the code in TW to properly understand the output or really get an overview which call might be unnecessary or what really limits the performance. I will try to understand and wrap my head around it.

On my laptop, using the same databse it takes approx 0,2 s with v2.6.2 and 0,6s with v3.2, meaning we are currently still three times slower than v2.6.2.

call-graph.pdf
callgrind.out.zip

@djmitche
Copy link
Collaborator Author

This suggests that most of that time is spent in the gc operation, and the included PDF agrees: gc is basically rebuilding the working set, and a lot of the calls are doing that. So that's probably a good place to start! GothenburgBitFactory/taskchampion#414 may be a start there, but there is probably room for more.
image

Here's the code for rebuilding the working set:

/// Rebuild the working set using a function to identify tasks that should be in the set.  This
/// renumbers the existing working-set tasks to eliminate gaps, and also adds any tasks that
/// are not already in the working set but should be.  The rebuild occurs in a single
/// trasnsaction against the storage backend.
pub fn rebuild<F>(txn: &mut dyn StorageTxn, in_working_set: F, renumber: bool) -> Result<()>
where                                   
    F: Fn(&TaskMap) -> bool,
{                                                       
    let mut new_ws = vec![None]; // index 0 is always None
    let mut seen = HashSet::new();
                                                                               
    // The goal here is for existing working-set items to be "compressed' down to index 1, so
    // we begin by scanning the current working set and inserting any tasks that should still
    // be in the set into new_ws, implicitly dropping any tasks that are no longer in the
    // working set.             
    for elt in txn.get_working_set()?.drain(1..) {
        if let Some(uuid) = elt {
            if let Some(task) = txn.get_task(uuid)? {
                if in_working_set(&task) {    
                    new_ws.push(Some(uuid));
                    seen.insert(uuid);               
                    continue;
                }                           
            }                        
        }
                                                                               
        // if we are not renumbering, then insert a blank working-set entry here
        if !renumber {
            new_ws.push(None);
        }                 
    }    

    // if renumbering, clear the working set and re-add
    if renumber {             
        txn.clear_working_set()?;                                     
        for elt in new_ws.drain(1..new_ws.len()).flatten() {
            txn.add_to_working_set(elt)?;
        }       
    } else {                           
        // ..otherwise, just clear the None items determined above from the working set
        for (i, elt) in new_ws.iter().enumerate().skip(1) {
            if elt.is_none() {         
                txn.set_working_set_item(i, None)?;
            }            
        }        
    }         
                                                                               
    // Now go hunting for tasks that should be in this list but are not, adding them at the
    // end of the list, whether renumbering or not
    for (uuid, task) in txn.all_tasks()? {
        if !seen.contains(&uuid) && in_working_set(&task) {                                                                                                    
            txn.add_to_working_set(uuid)?;            
        }                                                             
    }           
                                                                                                                                                               
    txn.commit()?;                        
    Ok(())                                                                  
}         

Drilling down into the working set rebuild, the majority of the time is in all_tasks (called once at the end but returning a lot of data) and get_task (called once for each task in the existing working set). There's a little time spent in adding tasks to the working set (perhaps fixed by the issue linked above), but for the most part it's the fetch-all-tasks bit that is taking a lot of time.
image

Going back to main and drilling down through the initialization, I see that CmdCustom::execute (the "custom report" command) calls handleUntil which handles deleting pending tasks with until dates in the past. It takes a while because it calls pending_tasks to fetch all pending tasks from the DB. This is probably faster due to #3632. But, this is surprising -- that's a kind of garbage-collection, but is not guarded by rc:gc! This function calls pending_tasks, but we cache the pending tasks, which is probably called by the custom report later, so that's not wasted time.

taskwarrior/src/TDB2.cpp

Lines 267 to 286 in 01ced32

////////////////////////////////////////////////////////////////////////////////
const std::vector<Task> TDB2::pending_tasks() {
if (!_pending_tasks) {
Timer timer;
auto pending_tctasks = replica()->pending_task_data();
std::vector<Task> result;
for (auto& maybe_tctask : pending_tctasks) {
auto tctask = maybe_tctask.take();
result.push_back(Task(std::move(tctask)));
}
dependency_scan(result);
Context::getContext().time_load_us += timer.total_us();
_pending_tasks = result;
}
return *_pending_tasks;
}

A few thoughts off the top of my head:

  • Maybe we could cache some kind of "no changes since last working-set rebuild" flag that would allow us to make it a no-op
    • This might be useful mixed with the ability to open a replica in read-only mode: TW could open in read-only mode by default, see if a working-set rebuild is required, and only re-open in read-write mode if it is required.
  • We could probably do a better job of this at the DB level. The only data we need from the DB is the set of uuids for tasks with status:pending - that's enough to know what to add and remove from the working-set table.
  • The handleUntil check should be part of gc. It will almost always make no changes, so the same read-only/read-write considerations apply.

What do you think?

@djmitche djmitche moved this from Backlog to In progress in Taskwarrior Development Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:performance Issues related to performance
Projects
Status: In progress
Development

No branches or pull requests

2 participants