Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a howto on determining packages content for an image #474

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions docs/howto/determining-packages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Determining Packages in Pangeo Docker Images

For deployments of Pangeo Docker Images using JupyterHub or Binder it can be sometimes unclear to the user exactly which image and tag is being used. Here are the images available as Pangeo Docker Images:

| Image | Description |
|-----------------|-----------------------------------------------|
| [base-notebook](base-notebook/packages.txt) | minimally functional image for pangeo hubs |
jmunroe marked this conversation as resolved.
Show resolved Hide resolved
| [pangeo-notebook](pangeo-notebook/packages.txt) | base-notebook + core earth science analysis packages |
| [pytorch-notebook](pytorch-notebook/packages.txt) | pangeo-notebook + GPU-enabled pytorch |
| [ml-notebook](ml-notebook/packages.txt) | pangeo-notebook + GPU-enabled tensorflow2 |
| [forge](ml-notebook/packages.txt) | pangeo-notebook + [Apache Beam](https://beam.apache.org/) support|

Each image is also assigned a tag of the form `YYYY.MM.DD` indicating when it was last updated.

*Click on the image name in the table above for a current list of installed packages and versions for the most recent tag of each image*

## How to find my current image tag

The environment variable `JUPYTER_IMAGE_SPEC` will tell you what Pangeo Docker Image is currently in use.
jmunroe marked this conversation as resolved.
Show resolved Hide resolved

### Using the command line with Bash

```bash
echo JUPYTER_IMAGE_SPEC: $JUPYTER_IMAGE_SPEC

tmp_array=($(echo $JUPYTER_IMAGE_SPEC | tr "/:" "\n"))
image_name=${tmp_array[1]}
image_tag=${tmp_array[2]}

echo image_name: $image_name
echo image_tag: $image_tag
```

### Using a Jupyter notebook with Python

```python
import os
image_spec = os.environ['JUPYTER_IMAGE_SPEC']
print('JUPYTER_IMAGE_SPEC:', image_spec)

_, image_name, image_tag = image_spec.replace('/',':').split(':')

print('image_name:', image_name)
print('image_tag:', image_tag)
```

## How to find the list of available image tags:

Using GitHub the web site, the list of select a particular tag from the branches/tag drop down menu.

## How to finding list of packages in any given tag

Each image has an `packages.txt` file that lists exactly which packages and versions installed in the image.

### Using GitHub website

Using GitHub the web [site](https://github.com/pangeo-data/pangeo-docker-images), select a particular tag from the branches/tag drop down menu. Navigate to the image and select the `packages.txt` file.
jmunroe marked this conversation as resolved.
Show resolved Hide resolved

### Using the command line with Bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, just focusing on using the GitHub UI is good enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the user and what other tools they are used to using.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure! However, IMO this page is way too overwhelming now, with implementations in multiple languages that must be read and understood, copy pasted and run. It's also hard to keep middle-complexity code like this working fine over time, and there's no linting, etc (for example, there's a few linting violations in the python code). If we want to provide people code based support, I think the better way to do that is to make a python package they can quickly pip install and run, and that can have tests, etc. For now, let's get rid of the code here?


```bash
tmp_array=($(echo $JUPYTER_IMAGE_SPEC | tr "/:" "\n"))
image_name=${tmp_array[1]}
image_tag=${tmp_array[2]}

packages_url=https://raw.githubusercontent.com/pangeo-data/pangeo-docker-images/$image_tag/$image_name/packages.txt

curl -s $packages_url
```

### Using a Jupyter notebook with Python

```python
import os
image_spec = os.environ['JUPYTER_IMAGE_SPEC']
_, image_name, image_tag = image_spec.replace('/',':').split(':')

def get_package_list(image_name, image_tag):
"""Return a list of packages with versions in an image"""
packages_url = f"https://raw.githubusercontent.com/pangeo-data/pangeo-docker-images/{image_tag}/{image_name}/packages.txt"

import requests
packages_str = requests.get(packages_url).content.decode('utf8')

packages = []
for package in packages_str.splitlines():
if package[0] == '#': continue # skip the header
package_name, package_version = package.split('==')
packages.append((package_name, package_version))

return packages

packages = get_package_list(image_name, image_tag)
print(f"{len(packages)} packages found in {image_spec}\n")
for package in packages:
print(f"{package[0]}=={package[1]}")
```

## How to find difference in packages lists between any two tags

Suppose you want to find out what the differences are between

`pangeo/pangeo-notebook:2023.06.07`

and

`pangeo/pangeo-notebook:2023.06.20`

to determine which packages were changed, added, or removed?

### Using the command line with Bash
jmunroe marked this conversation as resolved.
Show resolved Hide resolved

```bash
image_name=pangeo-notebook
image_tag_A=2023.06.07
image_tag_B=2023.06.20

url_A=https://raw.githubusercontent.com/pangeo-data/pangeo-docker-images/$image_tag_A/$image_name/packages.txt
url_B=https://raw.githubusercontent.com/pangeo-data/pangeo-docker-images/$image_tag_B/$image_name/packages.txt

diff -y --suppress-common-lines --color <(curl -s $url_A) <(curl -s $url_B)

```

### Using a Jupyter notebook with Python

Using the `get_package_list` function defined above get a list of packages from both tags

```python
image_name = 'pangeo-notebook'
image_tag_A = '2023.06.07'
image_tag_B = '2023.06.20'

packages_A = get_package_list(image_name, image_tag_A)
packages_B = get_package_list(image_name, image_tag_B)
```

Determine which packages have been added or removed.

```python
import difflib
d = difflib.Differ()

added_packages = dict()
removed_packages = dict()
for package in d.compare(packages_A, packages_B):
p = package[3:-1].split(', ')
name = p[0][1:-1]
version = p[1][1:-1]
if package[0] == '+':
added_packages[name] = version
elif package[0] == '-':
removed_packages[name] = version
```

Summarizing

```python
print(f'Differences in `packages.txt` for {image_name} between {image_tag_A} and {image_tag_B}')

print(f'Packages added:')
for name in sorted(set(added_packages) - set(removed_packages)):
print(f" {name} {added_packages[name]}")

print(f'Packages removed:')
for name in sorted(set(removed_packages) - set(added_packages)):
print(f" {name} {removed_packages[name]}")

print(f'Packages changed:')
for name in sorted((set(added_packages).intersection(set(removed_packages)))):
print(f" {name} {removed_packages[name]} -> {added_packages[name]}")
```