Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
maniospas committed Jun 6, 2024
1 parent 9e283aa commit dc10db2
Show file tree
Hide file tree
Showing 14 changed files with 424 additions and 344 deletions.
79 changes: 2 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,80 +18,8 @@ Fast node ranking algorithms on large graphs.
# :hammer_and_wrench: Installation
`pygrank` works with Python 3.9 or later. The latest version can be installed with pip per:

```
pip install --upgrade pygrank
```

To run the library on backpropagateable backends,
either change the automatically created
configuration file (follow the instructions in the stderr console)
or run parts of your code within a
[context manager](https://book.pythontips.com/en/latest/context_managers.html)
to override other configurations like this:

```python
import pygrank as pg
with pg.Backend("tensorflow"):
... # run your pygrank code here
```

Otherwise, everything runs on top of `numpy`, which
is faster for forward passes. Node ranking algorithms
can be defined outside contexts and run inside.

# :zap: Quickstart
Before looking at details, here is fully functional
pipeline that scores the importance of a node in relation to
a list of "seed" nodes within a graph's structure:

```python
import pygrank as pg
graph, seeds, node = ...

pre = pg.preprocessor(assume_immutability=True, normalization="symmetric")
algorithm = pg.PageRank(alpha=0.85)+pre >> pg.Sweep() >> pg.Ordinals()
ranks = algorithm(graph, seeds)
print(ranks[node])
print(algorithm.cite())
```

The graph can be created with `networkx` or, for faster computations,
with the `pygrank.fastgraph` module. Nodes can hold any
kind of object or data type (you don't need to convert them to integers).

The above snippet starts by defining a `preprocessor`,
which controls how graph adjacency matrices are normalized.
In this case, a symmetric normalization
is applied (which is ideal for undirected graphs) and we also
assume graph immutability, i.e., that it will not change in the future.
When this assumption is declared, the preprocessor hashes a lot of
computations to considerably speed up experiments or autotuning.

The snippet uses the [chain operator](docs/basics/functional.md)
to wrap node ranking algorithms by various kinds of postprocessors.
You can also put algorithms into each other's constructors
if you are not a fan of functional programming.
The chain starts from a pagerank graph filter with diffusion parameter
0.85. Other filters can be declared, including automatically tuned ones.

The produced algorithm is run as a callable,
yielding a map between nodes and values
(in graph signal processing, such maps are called graph signals)
and the value of a node is printed. Graph signals can
also be created and directly parsed by algorithms, for example as:
```
signal = pg.to_signal(graph, {v: 1. for v in seeds})
ranks = algorithm(signal)
```

Finally, the snippet prints a recommended citation for the algorithm.

### More examples

[Showcase](docs/advanced/quickstart.md) <br>
[Big data FAQ](docs/tips/big.md) <br>
[Downstream tasks](https://github.com/maniospas/pygrank-downstream) <br>

# :link: Documentation
**https://pygrank.readthedocs.io**

# :brain: Overview
Analyzing graph edges (links) between graph nodes can help
Expand Down Expand Up @@ -123,9 +51,6 @@ Some of the library's advantages are:
5. **Modular** components to be combined and a functional chain interface for complex combinations.
6. **Fast** running time with highly optimized operations

# :link: Material
[Tutorials & Documentation](documentation/documentation.md) <br>
[Functional Interface](docs/basics/functional.md)

# :fire: Features
* Graph filters
Expand Down
50 changes: 22 additions & 28 deletions docs/advanced/autotuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,13 @@ through a `pygrank.Tuner` base class, which wraps
any kind of node ranking algorithm. Ideally, this would wrap end-product
algorithms.

:bulb: Tuners differ from benchmarks in that best node ranking algorithms
can be selected on-the-fly.
!!! warning
Tuners differ from benchmarks in that they select node ranking algorithms
on-the-fly based on input data. They may overfit even with train-validation-test splits.

An exhaustive list of ready-to-use tuners can be found [here](../generated/tuners.md).
After initialization with the appropriate
parameters, these can run with the same pattern as other node ranking algorithms.
Tuner instances with default arguments use commonly seen base settings.
For example, the following code separates training and evaluation
data of a provided personalization signal and then uses a tuner that
Expand Down Expand Up @@ -45,11 +49,6 @@ scores_tuned = pg.ParameterTuner(algorithm_from_params,
measure=pg.NDCG).tune(personalization)
```

An exhaustive list of ready-to-use tuners can be found [here](../generated/tuners.md).
After initialization with the appropriate
parameters, these can be used interchangeably in the above example.

## Tuning speedup

Graph convolutions are the most computationally-intensive operations
node ranking algorithms employ, as their running time scales linearly with the
Expand All @@ -58,17 +57,14 @@ aim to optimize algorithms involving graph filters extending the
`ClosedFormGraphFilter` class, graph filtering is decomposed into
weighted sums of naturally occurring
Krylov space base elements {*M<sup>n</sup>p*, *n=0,1,...*}.

To speed up computation time (by many times in some settings) `pygrank`
provides the ability to save the generation of this Krylov space base
so that future runs do *not* recompute it, effectively removing the need
to perform graph convolutions all but once for each personalization.

:warning: When applying this speedup outside of tuners, it requires
explicitly passing a graph signal object to graph filters
(e.g. it does not work with dictionary inputs) since this is the only
way to hash both the personalization and the graph
on one persistent object.
!!! info
This speedup can be applied outside of tuners too;
explicitly pass a graph signal object to node ranking algorithms.

To enable this behavior, a dictionary needs to be passed to closed form
graph filter constructors through an `optimization_dict` argument.
Expand All @@ -85,18 +81,16 @@ tuner = pg.ParameterTuner(error_type="iters",
scores = tuner(graph, personalization)
```

:warning: Similarly to the `assume_immutability=True` option
for preprocessors, this requires that graphs signals are not altered in
the interim, although it is possible to clear signal values.
In particular, to remove
allocated memory, you can keep a reference to the dictionary and clear
it afterwards with `optimization_dict.clear()`.

:warning: Using optimization dictionaries multiplies (e.g. at least doubles)
the amount of used memory, which the system may run out of for large graphs.

:bulb: The default algorithms provided by tuners make use of the class
*pygrank.SelfClearDict* instead of a normal dictionary. This keeps track only
of the last personalization and only optimizes runs for the last personalization.
This way optimization becomes fast while allocating the minimum memory required
for tuning.
!!! warning
Similarly to the `assume_immutability=True` option
for preprocessors, the optimization dictionary requires that graphs signals are not altered in
the interim, although it is possible to clear signal values.
Furthermore, using optimization dictionaries multiplies (e.g. at least doubles)
the amount of used memory, which the system may run out of for large graphs.
To remove allocated memory, keep a reference to the dictionary and clear
it afterwards with `optimization_dict.clear()`.

!!! info
The default algorithms constructed by tuners (if none are provided) use
*pygrank.SelfClearDict* instead of a normal dictionary. This clears other entries when
a new personalization is inserted, therefore avoiding memory bloat.
86 changes: 78 additions & 8 deletions docs/advanced/convergence.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@ error and tolerance for numerical convergence. If no such argument is passed
to the constructor, a `pygrank.ConvergenceManager` object
is automatically instantiated by borrowing whichever extra arguments it can
from those passed to algorithm constructors. These arguments can be:
- `tol` to indicate the numerical tolerance level required for convergence (default is 1.E-6).
- `error_type` to indicate how differences between two graph signals are computed. The default value is `pygrank.Mabs` but any other supervised [measure](#evaluation) that computes the differences between consecutive iterations can be used. The string "iters" can also be used to make the algorithm stop only when max_iters are reached (see below).
- `max_iters` to indicate the maximum number of iterations the algorithm can run for (default is 100). This quantity works as a safety net to guarantee algorithm termination.

- *tol:* Indicates the numerical tolerance level required for convergence (default is 1.E-6).
- *error_type:* Indicates how differences between two graph signals are computed. The default value is `pygrank.Mabs` but any other supervised [measure](../basics/evaluation.md) that computes the differences between consecutive iterations can be used. The string "iters" can also be used to make the algorithm stop only when max_iters are reached (see below).
- *max_iters:* Indicates the maximum number of iterations the algorithm can run for (default is 100). This quantity works as a safety net to guarantee algorithm termination.

Sometimes, it suffices to reach a robust node rank order instead of precise
values. To cover such cases we have implemented a different convergence criterion
Expand All @@ -22,11 +23,80 @@ import pygrank as pg

G, personalization = ...
alpha = 0.85
ordered_ranker = pg.PageRank(alpha=alpha, convergence=pg.RankOrderConvergenceManager(alpha))
ordered_ranker = pg.Ordinals(ordered_ranker)
ranker = pg.PageRank(alpha=alpha, convergence=pg.RankOrderConvergenceManager(alpha))
ordered_ranker = ranker >> pg.Ordinals()
ordered_ranks = ordered_ranker(G, personalization)
```

:bulb: Since the node order is more important than the specific rank values,
a post-processing step has been added throught the wrapping expression
``ordered_ranker = pg.Ordinals(ordered_ranker)`` to output rank order.
!!! info
Since the node order was deemed more important than the specific rank values,
a postprocessing step was added.



# Demo

As a quick start, let us construct a graph
and a set of nodes. The graph's class can be
imported either from the `networkx` library or from
`pygrank` itself. The two are in large part interoperable
and both can be parsed by our algorithms.
But our implementation is tailored to graph signal
processing needs and thus tends to be faster and consume
only a fraction of the memory.

```python
from pygrank import Graph

graph = Graph()
graph.add_edge("A", "B")
graph.add_edge("B", "C")
graph.add_edge("C", "D")
graph.add_edge("D", "E")
graph.add_edge("A", "C")
graph.add_edge("C", "E")
graph.add_edge("B", "E")
seeds = {"A", "B"}
```

We now run a personalized PageRank
to score the structural relatedness of graph nodes to the ones of the given set.
First, let us import the library:

```python
import pygrank as pg
```

For instructional purposes,
we experiment with (personalized) *PageRank*
and make it output the node order of ranks.

```python
ranker = pg.PageRank(alpha=0.85, tol=1.E-6, normalization="auto") >> pg.Ordinals()
ranks = ranker(graph, {v: 1 for v in seeds})
```

How much time did it take for the base ranker to converge?
(Depends on backend and device characteristics.)

```python
print(ranker.convergence)
# 19 iterations (0.0021852000063518062 sec)
```

Since for this example only the node order is important,
we can use a different way to specify convergence:

```python
convergence = pg.RankOrderConvergenceManager(pagerank_alpha=0.85, confidence=0.98)
early_stop_ranker = pg.PageRank(alpha=0.85, convergence=convergence) >> pg.Ordinals()
ordinals = early_stop_ranker(graph, {v: 1 for v in seeds})
print(early_stop_ranker.convergence)
# 2 iterations (0.0005241000035312027 sec)
print(ordinals["B"], ordinals["D"], ordinals["E"])
# 3.0 5.0 4.0
```

Close to the previous results at a fraction of the time! For large graphs,
most ordinals would be near the ideal ones. Note that convergence time
does not take into account the time needed to preprocess graphs.
Loading

0 comments on commit dc10db2

Please sign in to comment.