Design Considerations for OS Libraries (Life Modelling Problem) #33

lewisfogden · 2021-01-16T23:51:58Z

lewisfogden
Jan 16, 2021

@alecloudenback, @Houstonwp and I have looked at some options for the life modelling problem: Gist Here, evaluating R & Julia using a non-recursive approach (which should be about as fast as you can get), and I've tried out nim (a statically type python-like language which compiles to C) which runs at broadly the same speed with memoized recursion.

Some of my thoughts

Statically typed + compiled code is fastest, but readability suffers
Non-recursive will usually beat recursive on speed, if you can solve it analytically (it may be harder to add features to a non-recursive model)
Vectorising recursive code could be a powerful option but requires vectorised operations (i.e. in python need to use compiled numpy operators to avoid falling back to python speed)
If your models are going to be read, modified and used by non-coders then readability is likely more important than run-time speed.
There are interesting alternatives like creating a domain specific language (DSL) or using slightly less robust speed-up methods like pypy JIT.

fumitoh · 2021-01-18T14:05:06Z

fumitoh
Jan 18, 2021

Great summary!

Let me address the challenge in @lewisfogden 's comment on LinkedIn

The other challenge is avoiding stack overflows from recursion - my way around this is to calculate from t=0 onwards, however you could do some fancy graph solving - Excel does something along these lines which works until the spreadsheet gets too complex.

I had the same challenge when developing modelx. On Linux,sys.setrecursionlimit(30000) seems to work.
On Windows, there seems no way to increase the stack memory size of the main thread (once it's invoked), so I circumvented the limitation by creating another thread and calling threading.stack_size.
I added a script to the sample notebook to demonstrate deep recursion.
More details can be found here: fumitoh/modelx#7

0 replies

Houstonwp · 2021-01-18T15:43:12Z

Houstonwp
Jan 18, 2021

@fumitoh, seems like a hack to get around Python's lack of built in tail call optimization but doesn't solve the underlying issue. Have you looked at trampolines or other ways to implement that optimization instead? How about scrapping recursion entirely?

0 replies

lewisfogden · 2021-01-18T21:59:23Z

lewisfogden
Jan 18, 2021
Author

@fumitoh said:

I had the same challenge when developing modelx. On Linux,sys.setrecursionlimit(30000) seems to work.

@Houstonwp said:

@fumitoh, seems like a hack to get around Python's lack of built in tail call optimization but doesn't solve the underlying issue. Have you looked at trampolines or other ways to implement that optimization instead? How about scrapping recursion entirely?

I've developed heavymodel with novice programmers in mind, so TCO/trampolines are probably a step to far (at least for me to implement!). I've decided to accept the recursive nature of the model, and to look for ways of minimising the risk of stack overflows.

My thinking for heavymodel has been:

Supposing you have at most F functions (in our simple example this would be F=5).
A projection will always start with the initial knowns/data at t=0.
Can evaluate each function at t=1 then store these (i.e. memoization) - this will make the stack depth at most F-1, so little risk of overflowing.
Can repeat this at t=2 etc, using previously stored values - these are now effectively data.
I've implemented this evaluation by looping through all all methods with one parameter (t), and evaluating these in an abitrary order (likely the order defined in the code, as they are scanned on instance initiation).
As an alternative I tested a targetted version where you could specify which function to evaluate (e.g. _run(proj_len=120, target_variables=["net_cashflow", "claims"]; I found this didn't give much of a speed-up, although it could be useful if you ended writing a large codebase that wasn't all necessarily used (which I would not recommend - much better to use inheritance for optional code)
Downside is that if there are methods with more than one parameter (e.g. t, duration) then these aren't explicitly evaluated: I skip these as I can't generally tell what the second parameter should be. I rely on an aggregation function being written at the t step to evaluate these regularly such as:

def total_sick(t):
    return sum(sick_by_dur(t, dur) for dur in range(t))

In practice as you will usually want to see aggregated results (e.g. the net cashflow), so I think this risk is low. The user could do a manual projection to avoid this.

As to coming up with an alternative without recursion - that might need a completely different way of thinking about these kinds of problems (i.e. multi-state models). Could do a loop with a lot of state information but this becomes messy (and code order is vital), and it is close to recursion:

#initialise state 0
if0, d0, l0 = 1, 0, 0   # in force, deaths, lapses
for t in range(N):
  d1 = q[t] * if0
  l1 = w[t] * if0
  if1 = if0 - d1 - l1
  # store states here, e.g. assign to dict/list
  if0, d0, l0 = if1, d1, l1     # move everything along one

0 replies

mglinicka · 2021-01-20T15:26:47Z

mglinicka
Jan 20, 2021

@lewisfogden thanks for sharing this. I've been working on projection models flow management for quite a while now and I find extremely interesting to hear other people's ideas.

For me the most challenging part is how to manage the flow of bigger models that cannot be solved analytically and where each time-dependent function can refer at time T1 to any other function (or to itself) in any other time T2. Where I got to is I build a graph of dependencies between functions additionally marking the dependencies (edges of the graph) as either forward-looking (where function F1 at times T1 refers to function F2 at times T2 and T1<=T2) or backwards-looking (similarly, where a function F1 at times T1 refers to function F2 at times T2 and T1>=T2). Having that graph built, I'm working on heuristics that would know what functions at which times (mostly either t=0 or t=T_MAX) and in what order would need to be called to ensure minimal numbers of calls while staying away from stack overflow error (something like the targeted version you've mentioned, but the list would potentially be time-dependent and would be generated automatically.

0 replies

fumitoh · 2021-01-20T16:45:02Z

fumitoh
Jan 20, 2021

@Houstonwp said:

@fumitoh, seems like a hack to get around Python's lack of built in tail call optimization but doesn't solve the underlying issue. Have you looked at trampolines or other ways to implement that optimization instead? How about scrapping recursion entirely?

modelx has a dependency trace feature for debugging like this:

>>> net_cashflow.preds(6)
[Model1.sample.premiums(t=6)=49.35859623121889,
 Model1.sample.claims(t=6)=61.69824528902361]

>>> premiums.succs(6)
[Model1.sample.net_cashflow(t=6)=-12.33964905780472]

In order to implement the feature, modelx internally implements its own callstack to record callers and callees with their arguments in newtorkx directional graph, as well as its own memoization mechanism. This graph is also used to auto-clear the dependents of a node when the node value has changed.

Is it possible to implement a feature to auto-detect tail calls in user code and implement TCO/trampolines where possible, given that the graph creation relies on the internal callstack?

0 replies

paddyhoran · 2021-01-21T02:06:21Z

paddyhoran
Jan 21, 2021

I added Rust implementation to the gist earlier today if anyone is interested.

0 replies

alecloudenback · 2021-01-22T03:25:31Z

alecloudenback
Jan 22, 2021

For those not following the linked gist, there's been an interesting comparison as different approaches and languages have been used to calculate the sample problem (simple 10-period npv). Here's a current comparison of the results:

code	algorithm	relative	absolute timing (mean)
Julia	accumulating for loop	1x	135 nanoseconds
Rust	accumulating for loop	3x	360 nanoseconds
Julia	vector	4x	500 nanoseconds
Nim (C compiled)	memoised recursive	37x	5 microseconds
Python numpy	vector	155x	21 microseconds
R vectorized	vector	520x	70 microseconds
R data.table	data.table vector	22000x	3 milliseconds

0 replies

lewisfogden · 2021-01-22T08:35:46Z

lewisfogden
Jan 22, 2021
Author

Thanks @alecloudenback - I've added in Nim recursive (which should be close in speed to C). I would like to see a few more of the recursive versions (as they can be applied to more general problems - the loop only works on a subset), so will add in Python.

@mglinicka: interesting: have you any bi-directional problems (i.e. referring to both t-1 and t+1), or are they all either one or the other (such as projecting or discounting)? If so would you be able to share an example on gist?

I thought about doing this - with python I could look at the ast to see what each function was looking up.

0 replies

alecloudenback · 2021-01-23T02:16:04Z

alecloudenback
Jan 23, 2021

@lewisfogden re:Nim

I didn't add Nim to the table, because with the memoizaton, the timings don't really measure "how fast can I calculate this actuarial problem" to how fast can I hash and look up stuff.

In practice, I think it could be a valid technique so don't think it's totally unfair to include it in the table. But the question is how to measure it? The typical benchmarking doesn't really work because it usually assumes that you are rerunning the same computation each time. WIth the memo-ization, it should be "slow" the first time and then run subsequently very fast. Thus the mean would be really sensitive to how many times the benchmark ran the computation. So use the median for everything?

0 replies

paddyhoran · 2021-01-23T02:33:13Z

paddyhoran
Jan 23, 2021

Benchmarking is really difficult at the best of times and in this case we are all running on different hardware/OS, etc. Julia is running on Roseta2 on M1 (ARM) and I'm on x86_64 (windows) for instance.

I could not understand why Julia would be able to beat Rust (as there is no technical reason I know of why this would be the case even if Julia's GC is never used). I asked on the Rust user forum and seems it's most likely just due to my laptop being old or my dev setup. Another user re-ran the benchmarks on his/her system and Rust was twice as fast as Julia coming in at 76 ns. So hard to conclude too much on such a small problem and so many differences in environment. Interesting and kinda fun though...

0 replies

lewisfogden · 2021-01-23T13:35:05Z

lewisfogden
Jan 23, 2021
Author

@alecloudenback

I didn't add Nim to the table, because with the memoizaton, the timings don't really measure "how fast can I calculate this actuarial problem" to how fast can I hash and look up stuff.

In practice, I think it could be a valid technique so don't think it's totally unfair to include it in the table. But the question is how to measure it? The typical benchmarking doesn't really work because it usually assumes that you are rerunning the same computation each time. WIth the memo-ization, it should be "slow" the first time and then run subsequently very fast. Thus the mean would be really sensitive to how many times the benchmark ran the computation. So use the median for everything?

You still need to calculate the problem (all values), but with memoisation you only calculate them once. I've outlined an example on my site digitalactuary, and there are a few guide if you look for dynamic programming (e.g. geeksforgeeks)

For benchmarking with memoization, you need to clear the cache between benchmarks so that the full problem is recalculated (or with heavymodel library, create a new model instance with empty cache) - in the nim version this is the call resetCachenet_cashflow etc.

It's sometimes a little tricky to thing about this, as both brains and Excel use memoisation by default 😁.

Overall, I think we are conflating a few different really interesting topics here (at least!):

how can you model a life projection in a given language?
how can you make it go faster? (within that language)?
how do methods compare (within a language)?
which languages are fastest? (which might be compiled > JIT compiled > bytecode/VM)
who has the fastest pc? 😛

@paddyhoran : agree - the julia benchmarks put Julia and Rust fairly close.

0 replies

mglinicka · 2021-01-23T22:16:45Z

mglinicka
Jan 23, 2021

@mglinicka: interesting: have you any bi-directional problems (i.e. referring to both t-1 and t+1), or are they all either one or the other (such as projecting or discounting)? If so would you be able to share an example on gist?

I thought about doing this - with python I could look at the ast to see what each function was looking up.

@lewisfogden There are obviously more possibilities than being purely forwards/backwards looking. Besides referring to both t+1 and t-1 you could also have functions like sum(funcX[t0:t1]) that refer to a range of times T or even T that could be calculated in the runtime (so we cannot decide how's that affecting the flow). The other thin is that the dependencies between functions can be dependent on (static) data, so sometimes it might be useful to take this into account as well. Given the variety of dependency types, there's no one correct way of solving that and I would usually use more or less refined methods of flow management that fit the problem.

But generally speaking it all goes like this: so in an ideal world we would like to calculate the model policy by policy from t=0 to t=TMAX. So given the graph, as a first step you would look for all the functions that you could calculate it that way. Then you remove these functions from the graph and check if there are any functions that you could calculated backwards, assuming the removed ones are already calculated. Then you remove these functions and go back to looking for the forward-looking ones. And that continues until all the functions are covered. All of the "problematic" ones are edge cases - in the simplest approach you might choose to call them for the their full projection length one by one. But again, there's always a way to get to a heuristic that would work better in most cases.
As for the functions calls recognition - I'm not sure what's the optimal approach for Python, but for the simplest approach even stating that somewhere explicitly could work. I would usually use C# or Java and I would either go with the explicit (manual) stating of the dependencies or the dependencies get discovered when the functions' code is compiled (from a custom language to a language in which the model will we run - e.g. Java or C++).

I think the other advantage of building that graph is the possibility of speeding up the runs when you run sensitivities or stochastic scenarios. When you know the dependencies you are able to figure out which functions require to be recalculated when you change certain assumption. That gives massive performance improvement comparing to the standard models and mode of calculation. Also sometimes you only need to calculate part of the model (given the functions that you want in your output), then the graph also comes in handy - it would only cover the functions that are needed to get the values of the functions requested in the output.

I'm afraid I do not have anything to share at hand - these were always a part of a bigger system and would not be understandable standalone.

0 replies

paddyhoran · 2021-01-25T02:43:24Z

paddyhoran
Jan 25, 2021

@alecloudenback my curiosity got the better of me, I could not figure out why my impl was so "slow" on my machine. Turns out it's the powi function. The impl on windows must be relatively slow (the person on the Rust forums was using Arch linux). Raising to a power is not needed as you know the previous power from the previous loop. I re-wrote the function like this:

pub fn npv(mortality_rates: &[f64], lapse_rates: &[f64], interest_rate: f64, sum_assured: f64, premium: f64, init_pols: f64, term: Option<usize>) -> f64 {

    let term = term.unwrap_or_else(|| mortality_rates.len());
    let mut result = 0.0;
    let mut inforce = init_pols;
    let v = 1.0 / (1.0 + interest_rate);
    let mut v_t = v;

    for (t, (q, w)) in mortality_rates.iter().zip(lapse_rates).enumerate() {
        let no_deaths = if t < term {inforce * q} else {0.0};
        let no_lapses = if t < term {inforce * w} else {0.0};
        let premiums = inforce * premium;
        let claims = no_deaths * sum_assured;
        let net_cashflow = premiums - claims;
        result += net_cashflow * v_t;
        v_t *= v;
        inforce = inforce - no_deaths - no_lapses;
    }

    result
}

This version runs in 31 ns. I'm curious as the same optimization should speed up the Julia version too?

0 replies

alecloudenback · 2021-02-13T04:07:17Z

alecloudenback
Feb 13, 2021

Been swamped at work, so just getting back to this now. Indeed, the exponentiation is expensive. You are a performance wizard, @paddyhoran!

The following runs in 15ns, compared with 15ns for the rust code above on my machine.

@inline function npv3(q,w,P,S,r,term=nothing)
    term = term === nothing ? length(q) + 1 : term + 1
    inforce = 1.0
    result = 0.0
    v = (1 / ( 1 + r))
    v_t = v
    
    for (t,(q,w)) in enumerate(zip(q,w))
        deaths = t < term ? inforce * q : 0.0
        lapses = t < term ? inforce * w : 0.0
        premiums = inforce * P
        claims = deaths * S
        ncf = premiums - claims
        result += ncf *  v_t
        v_t *= v
        inforce = inforce - deaths - lapses
    end
    
    return result
end

Note that I added @inline to nudge the compiler to inline the function. Without that, it takes 16ns on average.

Once Julia is available for Mac native, I will try this again. I couldn't get Rust working on my Windows machine, so I had to test on the laptop.

0 replies

alecloudenback · 2021-02-13T04:54:49Z

alecloudenback
Feb 13, 2021

And starting to sacrifce readability, but this version is 12ns:

@inline function npv4(q,w,P,S,r,term=nothing)
    term = term === nothing ? length(q) : term
    inforce = 1.0
    result = 0.0
    v = (1 / ( 1 + r))
    v_t = v
    
    for (t,(q,w)) in enumerate(zip(q,w))
        t > term && return result
        result += inforce * (P - S * q) * v_t
        inforce -= inforce * q + inforce * w
        v_t *= v
    end
    
    return result
end

0 replies

paddyhoran · 2021-02-21T01:23:18Z

paddyhoran
Feb 21, 2021

Pretty interesting, I'd say we have pushed it as far as it goes. It's crazy how good the emulation is on M1 there's basically no overhead..

0 replies

alecloudenback · 2021-04-17T21:17:09Z

alecloudenback
Apr 17, 2021

I started to benchmark the code everyone shared on the same hardware for comparisons' sake. See code here: https://github.com/JuliaActuary/Learn/tree/master/LifeModelingProblemBenchmarks

Times are in nanoseconds:

┌────────────────┬─────────────┬───────────────┬──────────┬──────────┐
│           lang │   algorithm │ function_name │   median │     mean │
│         String │      String │        String │ Float64? │ Float64? │
├────────────────┼─────────────┼───────────────┼──────────┼──────────┤
│ R (data.table) │  Vectorized │           npv │ 770554.0 │ 842767.3 │
│              R │  Vectorized │      npv base │   4264.0 │  46617.0 │
│              R │ Accumulator │      npv_loop │   4346.0 │  62275.7 │
│           Rust │ Accumulator │           npv │     24.0 │  missing │
│ Python (NumPy) │  Vectorized │           npv │  missing │   6823.3 │
│         Python │ Accumulator │      npv_loop │  missing │   1486.0 │
│          Julia │  Vectorized │          npv1 │    235.3 │    228.2 │
│          Julia │  Vectorized │          npv2 │    235.8 │    218.4 │
│          Julia │ Accumulator │          npv3 │     14.5 │     14.5 │
│          Julia │ Accumulator │          npv4 │     10.8 │     10.8 │
│          Julia │ Accumulator │          npv5 │     11.5 │     11.5 │
│          Julia │ Accumulator │          npv6 │      9.0 │      9.0 │
│          Julia │ Accumulator │          npv7 │      7.9 │      7.9 │
│          Julia │ Accumulator │          npv8 │      7.4 │      7.4 │
│          Julia │ Accumulator │          npv9 │      6.4 │      6.4 │
└────────────────┴─────────────┴───────────────┴──────────┴──────────┘

And in graphical format (mean if available, otherwise median):

The distance comparisons inspired by Grace Hopper trying to explain what a nanosecond is like.

Ideally I'd get other samples from some of the other examples shared and bundle it up into a single runnable script assuming you have the requsite languages/packages installed.

I had a hard time getting some of the data; e.g. Rust only appears to log the median time to the terminal, Python only average.

Software

All languages/libraries are Mac M1 native unless otherwise noted

Julia

Julia Version 1.7.0-DEV.938
Commit 2b4c088ee7* (2021-04-16 20:37 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)

Rust

1.53.0-nightly (b0c818c5e 2021-04-16)

Python:

Python 3.9.4 (default, Apr  4 2021, 17:42:23) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin

0 replies

fumitoh · 2021-04-18T05:40:20Z

fumitoh
Apr 18, 2021

Are Julia and Rust ones single-threaded? I also wonder what the comparison using the standard model would look like.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Considerations for OS Libraries (Life Modelling Problem) #33

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 18 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Design Considerations for OS Libraries (Life Modelling Problem) #33

Replies: 18 comments

lewisfogden Jan 18, 2021 Author

lewisfogden Jan 22, 2021 Author

lewisfogden Jan 23, 2021 Author

Software

lewisfogden
Jan 18, 2021
Author

lewisfogden
Jan 22, 2021
Author

lewisfogden
Jan 23, 2021
Author