Skip to content

Commit

Permalink
detailed documentation examples
Browse files Browse the repository at this point in the history
  • Loading branch information
maniospas committed Jun 6, 2024
1 parent dc10db2 commit 1cb8fa7
Show file tree
Hide file tree
Showing 14 changed files with 393 additions and 219 deletions.
97 changes: 72 additions & 25 deletions docs/advanced/autotuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,47 +8,94 @@ through a `pygrank.Tuner` base class, which wraps
any kind of node ranking algorithm. Ideally, this would wrap end-product
algorithms.

!!! warning
!!! info
Tuners differ from benchmarks in that they select node ranking algorithms
on-the-fly based on input data. They may overfit even with train-validation-test splits.
on-the-fly based on the graph signal input.

## Getting started

An exhaustive list of ready-to-use tuners can be found [here](../generated/tuners.md).
After initialization with the appropriate
parameters, these can run with the same pattern as other node ranking algorithms.
Tuner instances with default arguments use commonly seen base settings.
After initialization, these can run with the same pattern as other node ranking algorithms.
Tuner instances with default arguments use common base settings.
For example, the following code separates training and evaluation
data of a provided personalization signal and then uses a tuner that
data for a provided personalization signal and then uses a tuner that
by default creates a `GenericGraphFilter` instance with ten parameters.

```python
import pygrank as pg
graph, personalization = ...
training, evaluation = pg.split(pg.to_signal(graph, personalization, training_samples=0.5))
scores_pagerank = pg.PageRank()(graph, training)
scores_tuned = pg.ParameterTuner()(graph, training)
auc_pagerank = pg.AUC(evaluation, exclude=training).evaluate(scores_pagerank)
auc_tuned = pg.AUC(evaluation, exclude=training).evaluate(scores_tuned)
assert auc_pagerank <= auc_tuned
# True

_, graph, group = pg.load_one("eucore")
signal = pg.to_signal(graph, group)

train, test = pg.split(signal, training_samples=0.5)

scores_pagerank = pg.PageRank(max_iters=1000)(train)
scores_tuned = pg.ParameterTuner()(train)

measure = pg.AUC(test, exclude=train)
pg.benchmark_print_line("Pagerank", measure(scores_pagerank))
pg.benchmark_print_line("Tuned", measure(scores_tuned))
# Pagerank .83
# Tuned .91
```

Specific algorithms can also be tuned on specific parameter values, given
a method to instantiate the algorithm from a given set of parameters
(at worst, a lambda expression). For example, the following code defines and runs
a tuner with the same training personalization of the
previous example. The tuner finds the optimal alpha value of personalized
PageRank that optimizes NDCG (tuners optimize AUC be default if no measure is provided).
Instead of repeating the whole optimization
process each time a tuner runs, you may
want to tune once and use the created node ranking
algorithm later. This can be achieved with the following pattern:

```python
import pygrank as pg
graph, personalization = ...
algorithm_from_params = lambda params: pg.PageRank(alpha=params[0])
scores_tuned = pg.ParameterTuner(algorithm_from_params,
algorithm_tuned = pg.ParameterTuner().tune(training)
scores_tuned = algorithm_tuned(training)
```

## Customization

Tune your algorithms by passing to the `ParameterTuner`
a method (or lambda expression) that constructs them
given a list of parameters. Also provide corresponding
upper and lower bounds for the parameters.
An example follows:

```python
def custom_algorithm(params):
assert len(params) == 1
return pg.PageRank(alpha=params[0])

algorithm = pg.ParameterTuner(custom_algorithm,
max_vals=[0.99],
min_vals=[0.5],
measure=pg.NDCG).tune(personalization)
measure=pg.NDCG)
```


In the above snippet, we used the NDCG as the measure of choice for tuning.
If no measure is provided, AUC is the default. If the application calls
for it and you want to create a measure that is tied to a specific graph signal
with the `as_supervised_method` like below, set *fraction_of_training=1* for the tuner. This
forces the tuner to use the whole personalization to produce node ranks internally,
since we perform the validation split a priori.

```python
import pygrank as pg

_, graph, group = pg.load_one("eucore")
signal = pg.to_signal(graph, group)

train, test = pg.split(signal, training_samples=0.5)
train, valid = pg.split(train, training_samples=0.5)

tuner = pg.ParameterTuner(lambda params: pg.PageRank(alpha=params[0]),
max_vals=[0.99],
min_vals=[0.5],
fraction_of_training=1,
measure=pg.NDCG(valid, exclude=train+test).as_supervised_method())

scores_pagerank = pg.PageRank(max_iters=1000)(train)
scores_tuned = tuner(train)
```

## Optimizations

Graph convolutions are the most computationally-intensive operations
node ranking algorithms employ, as their running time scales linearly with the
Expand Down
75 changes: 45 additions & 30 deletions docs/advanced/graph_preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,50 +5,63 @@ that performs symmetric (i.e. Laplacian-like) normalization
for undirected graphs and column-wise normalization that
follows a true probabilistic formulation of transition probabilities
for directed graphs, such as `DiGraph` instances. The type of
normalization can be specified by passing a `normalization`
argument to constructors of ranking algorithms. This parameter can
assume values of:
* *"auto"* for the above-described default behavior
* *"col"* for column-wise normalization
* *"symmetric"* for symmetric normalization
* *"none"* for avoiding any normalization, for example because edge weights already hold the normalization.

In combination to the above types of normalization, ranking
algorithms can be made to perform the renormalization trick
often employed by graph neural networks,
which shrinks their spectrum by adding self-loops to nodes
before extracting the adjacency matrix and its normalization.
To enable this behavior, you can use `renormalization=True`
alongside any other `normalization` argument.
normalization can be specified by passing a *normalization*
argument to constructors of ranking algorithms. This parameter
can have the following values:

| Normalization | Description |
|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `"auto"` | The above-described default behavior. |
| `"col"` | Column-wise normalization. |
| `"symmetric"` | Symmetric normalization. |
| `"none"` | (A string with text "none".) Avoids any normalization, for example, because edge weights already hold the normalization. |
| callable | A callable applied to a `scipy` sparse adjacency matrix of the "numpy" backend (irrespective of the actually active backend). When applied, it ignores the preprocessor's *reduction* argument. |

Additionally, a *renormalization* argument may be provided
to add a multiple of the unit matrix to the adjacency matrix,
a concept called the renormalization trick.
This by default 0, but can help shrink the spectrum.
Furthermore, a *transform_adjacency* method can be provided
to modify the final adjacency matrix. For example,
you can use these arguments to use the Laplacian matrix
instead of the adjacency for an algorithm class:

```python
alg = Algorithm(transform_adjacency=lambda x:-x, renormalization=-1)
```


In all cases, adjacency matrix normalization involves the
computationally intensive operation of converting the graph
into a scipy sparse matrix each time the `rank(G, personalization)`
method of ranking algorithms is called. The `pygrank` package
into a scipy sparse matrix each time node
ranking algorithms are called. `pygrank`
provides a way to avoid recomputing the normalization
during large-scale experiments by the same algorithm for
the same graphs by passing an argument `assume_immutability=True`
to the algorithms's constructor, which indicates that
the the graph does not change between runs of the algorithm
and hence computes the normalization only once for each given
graph, a process known as hashing.

:warning: Hashing only uses the Python object's hash method,
graph, a process known as hashing.
Hashing only uses the Python object's hash method,
so a different instance of the same graph will recompute the
normalization if it points at a different memory location.

:warning: Do not alter graph objects after passing them to
`rank(...)` methods of algorithms with
`assume_immutability=True` for the first time. If altering the
graph is necessary midway through your code, create a copy
instance with one of *networkx*'s in-built methods and
edit that one.

!!! warning
Do not alter graph objects after passing them to
`rank(...)` methods of algorithms with
`assume_immutability=True` for the first time. If altering the
graph is necessary midway through your code, create a copy
instance with one of *networkx*'s in-built methods and
edit that one.

For example, hashing the outcome of graph normalization to
speed up multiple calls to the same graph can be achieved
as per the following code:

```python
import pygrank as pg

graph, personalization1, personalization2 = ...
algorithm = pg.PageRank(alpha=0.85, normalization="col", assume_immutability=True)
ranks1 = algorithm(graph, personalization1)
Expand Down Expand Up @@ -82,6 +95,7 @@ to speed up multiple rank calls to the same graph by
different ranking algorithms can be done as:
```python
import pygrank as pg

graph, personalization1, personalization2 = ...
pre = pg.preprocessor(normalization="col", assume_immutability=True)
algorithm1 = pg.PageRank(alpha=0.85, preprocessor=pre)
Expand All @@ -90,7 +104,8 @@ ranks1 = algorithm1(graph, personalization1)
ranks2 = algorithm2(graph, personalization2) # does not re-compute the normalization
```

:bulb: When benchmarking, in the above code you can call `pre(graph)`
before the first `rank(...)` call to make sure that that call
does not also perform the first normalization whose outcome will
be hashed and immediately retrieved by subsequent calls.
!!! info
When benchmarking in the above code you can call `pre(graph)`
before the first `rank(...)` call to make sure that that call
does not also perform the first normalization whose outcome will
be hashed and immediately retrieved by subsequent calls.
Loading

0 comments on commit 1cb8fa7

Please sign in to comment.