Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract tutorials and link to tools #90

Merged
merged 13 commits into from
Jun 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .github/workflows/fetch_all_tutorials.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Fetch all tools

on:
workflow_dispatch:
schedule:
#Every Sunday at 8:00 am
- cron: "0 8 * * 0"

# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
concurrency:
group: "tutorials"
cancel-in-progress: false

permissions:
contents: write

jobs:
fetch-all-tutorials:
runs-on: ubuntu-20.04
environment: fetch-tutorials
name: Fetch all tutorials
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install requirement
run: python -m pip install -r requirements.txt
- name: Run script #needs PAT to access other repos
run: |
bash ./bin/extract_all_tutorials.sh
env:
PLAUSIBLE_API_KEY: ${{ secrets.PLAUSIBLE_API_TOKEN }}
- name: Commit all tools
# add or commit any changes in results if there was a change, merge with main and push as bot
run: |
git config user.name github-actions
git config user.email [email protected]
git pull --no-rebase -s recursive -X ours
git add results
git status
git diff --quiet && git diff --staged --quiet || (git commit -m "fetch all tutorials bot - step fetch")
git push
51 changes: 50 additions & 1 deletion .github/workflows/filter_communities.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,52 @@ permissions:
contents: write

jobs:
filter-all-tutorials:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v5
with:
python-version: '3.8'
- name: Install requirement
run: python -m pip install -r requirements.txt
- name: Run script
run: |
bash ./bin/get_community_tutorials.sh
- name: Commit results
# commit the new filtered data, only if stuff was changed
run: |
git config user.name github-actions
git config user.email [email protected]
git pull --no-rebase -s recursive -X ours
git add results
git status
git diff --quiet && git diff --staged --quiet || (git commit -m "fetch all tutorials / tools bot - step tutorial filter")
git push

update-tools-to-keep-exclude:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v5
with:
python-version: '3.8'
- name: Install requirement
run: python -m pip install -r requirements.txt
- name: Run script
run: |
bash ./bin/update_tools_to_keep_exclude.sh
- name: Commit results
# commit the new filtered data, only if stuff was changed
run: |
git config user.name github-actions
git config user.email [email protected]
git pull --no-rebase -s recursive -X ours
git add results
git status
git diff --quiet && git diff --staged --quiet || (git commit -m "fetch all tutorials / tools bot - step exluded/kept tool list update")
git push

filter-all-tools:
runs-on: ubuntu-20.04
steps:
Expand All @@ -45,5 +91,8 @@ jobs:
git pull --no-rebase -s recursive -X ours
git add results
git status
git diff --quiet && git diff --staged --quiet || (git commit -m "fetch all tools bot - step filter")
git diff --quiet && git diff --staged --quiet || (git commit -m "fetch all tools bot - step tool filter")
git push



10 changes: 8 additions & 2 deletions .github/workflows/run_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,17 @@ jobs:
python-version: ${{ matrix.python-version }}
- name: Install requirement
run: python -m pip install -r requirements.txt
- name: Run script
- name: Test tool extraction
# run: bash bin/extract_all_tools.sh
run: |
export GITHUB_API_KEY=${{ secrets.GH_API_TOKEN }}
bash ./bin/extract_all_tools_test.sh "${{ matrix.subset }}"
env:
GITHUB_API_KEY: ${{ secrets.GH_API_TOKEN }}
- name: Test tutorial extraction and filtering
run: |
bash ./bin/extract_filter_tutorial_test.sh
env:
PLAUSIBLE_API_KEY: ${{ secrets.PLAUSIBLE_API_TOKEN }}
- name: Commit all tools
# add or commit any changes in results if there was a change, merge with main and push as bot
run: |
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.DS_Store
__pycache__
37 changes: 35 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@ The tool performs the following steps:
$ python3 -m pip install -r requirements.txt
```

## Extract all tools
## Tools

### Extract all tools

1. Get an API key ([personal token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens)) for GitHub
2. Export the GitHub API key as an environment variable:
Expand Down Expand Up @@ -96,7 +98,7 @@ The script will generate a TSV file with each tool found in the list of GitHub r
15. Conda id
16. Conda version

## Filter tools based on their categories in the ToolShed
### Filter tools based on their categories in the ToolShed

1. Run the extraction as explained before
2. (Optional) Create a text file with ToolShed categories for which tools need to be extracted: 1 ToolShed category per row ([example for microbial data analysis](data/microgalaxy/categories))
Expand All @@ -118,6 +120,37 @@ The script will generate a TSV file with each tool found in the list of GitHub r
[--status <Path to a TSV file with tool status - 3 columns: ToolShed ids of tool suites, Boolean with True to keep and False to exclude, Boolean with True if deprecated and False if not>]
```

## Training

### Extract tutorials from GTN

1. Get an API key ([personal token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens)) for Plausible
2. Export the Plausible API key as an environment variable:

```
$ export PLAUSIBLE_API_KEY=<your GitHub API key>
```

3. Run the script

```
$ python bin/extract_all_tutorials.sh
```

### Filter tutorials based on tags

1. Run the extraction as explained before
2. Create a file named `tutorial_tags` in your community `data` folder with the list of tutorial tags to keep
3. Run the following command

```
$ python bin/extract_gtn_tutorials.py \
filtertutorials \
--all_tutorials "results/all_tutorials.json" \
--filtered_tutorials "results/<your community>/tutorials.tsv" \
--tags "data/communities/<your community>/tutorial_tags"
```

## Development

To make a test run of the tool to check its functionalities follow [Usage](#Usage) to set-up the environnement and the API key, then run
Expand Down
80 changes: 80 additions & 0 deletions bin/compare_tools.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/usr/bin/env python

import argparse
from pathlib import Path
from typing import Set

import pandas as pd
import shared_functions


def get_tutorials_tool_suites(tuto_fp: str, tool_fp: str) -> Set:
"""
Get tool suite ids for all tools in tutorials
"""
tutorials = pd.read_csv(tuto_fp, sep="\t", keep_default_na=False).to_dict("records")
all_tools = shared_functions.read_suite_per_tool_id(tool_fp)
print(all_tools)
tuto_tool_suites = set()
for tuto in tutorials:
tools = tuto["Tools"].split(", ")
for t in tools:
if t in all_tools:
tuto_tool_suites.add(all_tools[t]["Galaxy wrapper id"])
else:
print(f"{t} not found in all tools")
return tuto_tool_suites


def write_tool_list(tools: Set, fp: str) -> None:
"""
Write tool list with 1 element per row in a file
"""
tool_list = list(tools)
tool_list.sort()
with Path(fp).open("w") as f:
f.write("\n".join(tool_list))


def update_excl_keep_tool_lists(tuto_tool_suites: Set, excl_tool_fp: str, keep_tool_fp: str) -> None:
"""
Update the lists in to keep and exclude with tool suites in tutorials
"""
# exclude from the list of tools to exclude the tools that are in tutorials
excl_tools = set(shared_functions.read_file(excl_tool_fp)) - tuto_tool_suites
write_tool_list(excl_tools, excl_tool_fp)
# add from the list of tools to keep the tools that are in tutorials
keep_tools = set(shared_functions.read_file(keep_tool_fp)) | tuto_tool_suites
write_tool_list(keep_tools, keep_tool_fp)


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Update community-curated list of tools to keep and exclude with tools in community-curated tutorials"
)
parser.add_argument(
"--filtered_tutorials",
"-t",
required=True,
help="Filepath to TSV with filtered tutorials",
)
parser.add_argument(
"--exclude",
"-e",
help="Path to a file with ids of tools to exclude (one per line)",
)
parser.add_argument(
"--keep",
"-k",
help="Path to a file with ids of tools to keep (one per line)",
)
parser.add_argument(
"--all_tools",
"-a",
required=True,
help="Filepath to TSV with all extracted tools, generated by extractools command",
)
args = parser.parse_args()

tuto_tools = get_tutorials_tool_suites(args.filtered_tutorials, args.all_tools)
update_excl_keep_tool_lists(tuto_tools, args.exclude, args.keep)
7 changes: 7 additions & 0 deletions bin/extract_all_tutorials.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/usr/bin/env bash

python bin/extract_gtn_tutorials.py \
extracttutorials \
--all_tutorials "results/all_tutorials.json" \
--tools "results/all_tools.json" \
--api $PLAUSIBLE_API_KEY
32 changes: 32 additions & 0 deletions bin/extract_filter_tutorials_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/usr/bin/env bash

mkdir -p 'results/'

tsv_output="results/${1}_tools.tsv"
json_output="results/${1}_tools.json"

python bin/extract_gtn_tutorials.py \
extracttutorials \
--all_tutorials "results/test_tutorials.json" \
--tools "results/all_tools.json" \
--api $PLAUSIBLE_API_KEY \
--test

if [[ ! -f "results/test_tutorials.json" ]] ; then
echo 'File "results/test_tutorials.json" is not there, aborting.'
exit
fi

python bin/extract_gtn_tutorials.py \
filtertutorials \
--all_tutorials "results/test_tutorials.json" \
--filtered_tutorials "results/microgalaxy/test_tutorials.tsv" \
--tags "data/communities/microgalaxy/tutorial_tags"

if [[ ! -f "results/microgalaxy/test_tutorials.tsv" ]] ; then
echo 'File "results/microgalaxy/test_tutorials.tsv" is not there, aborting.'
exit
fi

rm "results/test_tutorials.json"
rm "results/microgalaxy/test_tutorials.tsv"
42 changes: 11 additions & 31 deletions bin/extract_galaxy_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

import pandas as pd
import requests
import shared_functions
import yaml
from github import Github
from github.ContentFile import ContentFile
Expand Down Expand Up @@ -91,22 +92,6 @@ def get_tool_stats_from_stats_file(tool_stats_df: pd.DataFrame, tool_ids: List[s
return int(agg_count)


def read_file(filepath: Optional[str]) -> List[str]:
"""
Read an optional file with 1 element per line

:param filepath: path to a file
"""
if filepath is None:
return []
fp = Path(filepath)
if fp.is_file():
with fp.open("r") as f:
return [x.rstrip() for x in f.readlines()]
else:
return []


def get_string_content(cf: ContentFile) -> str:
"""
Get string of the content from a ContentFile
Expand Down Expand Up @@ -507,13 +492,6 @@ def check_tools_on_servers(tool_ids: List[str], galaxy_server_url: str) -> int:
return counter


def format_list_column(col: pd.Series) -> pd.Series:
"""
Format a column that could be a list before exporting
"""
return col.apply(lambda x: ", ".join(str(i) for i in x))


def export_tools_to_json(tools: List[Dict], output_fp: str) -> None:
"""
Export tool metadata to TSV output file
Expand All @@ -537,17 +515,19 @@ def export_tools_to_tsv(
"""
df = pd.DataFrame(tools).sort_values("Galaxy wrapper id")
if format_list_col:
df["ToolShed categories"] = format_list_column(df["ToolShed categories"])
df["EDAM operation"] = format_list_column(df["EDAM operation"])
df["EDAM topic"] = format_list_column(df["EDAM topic"])
df["ToolShed categories"] = shared_functions.format_list_column(df["ToolShed categories"])
df["EDAM operation"] = shared_functions.format_list_column(df["EDAM operation"])
df["EDAM topic"] = shared_functions.format_list_column(df["EDAM topic"])

df["EDAM operation (no superclasses)"] = format_list_column(df["EDAM operation (no superclasses)"])
df["EDAM topic (no superclasses)"] = format_list_column(df["EDAM topic (no superclasses)"])
df["EDAM operation (no superclasses)"] = shared_functions.format_list_column(
df["EDAM operation (no superclasses)"]
)
df["EDAM topic (no superclasses)"] = shared_functions.format_list_column(df["EDAM topic (no superclasses)"])

df["bio.tool ids"] = format_list_column(df["bio.tool ids"])
df["bio.tool ids"] = shared_functions.format_list_column(df["bio.tool ids"])

# the Galaxy tools need to be formatted for the add_instances_to_table to work
df["Galaxy tool ids"] = format_list_column(df["Galaxy tool ids"])
df["Galaxy tool ids"] = shared_functions.format_list_column(df["Galaxy tool ids"])

# if add_usage_stats:
# df = add_usage_stats_for_all_server(df)
Expand Down Expand Up @@ -764,7 +744,7 @@ def reduce_ontology_terms(terms: List, ontology: Any) -> List:
with Path(args.tools).open() as f:
tools = json.load(f)
# get categories and tools to exclude
categories = read_file(args.categories)
categories = shared_functions.read_file(args.categories)
try:
status = pd.read_csv(args.status, sep="\t", index_col=0, header=None).to_dict("index")
except Exception as ex:
Expand Down
Loading