Skip to content

Commit

Permalink
Remove .NET SDK installation requirement, other small changes
Browse files Browse the repository at this point in the history
  • Loading branch information
isaac091 committed May 10, 2024
1 parent bec0ade commit ab0450c
Show file tree
Hide file tree
Showing 6 changed files with 52 additions and 64 deletions.
11 changes: 1 addition & 10 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,9 @@ WORKDIR /app
ENV POETRY_HOME=/opt/poetry
ENV POETRY_VENV=/opt/poetry-venv
ENV POETRY_CACHE_DIR=/opt/.cache
ENV DOTNET_ROLL_FORWARD=LatestMajor
ENV PIP_DISABLE_PIP_VERSION_CHECK=on
ENV TZ=America/New_York
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
# Install .NET SDK
RUN apt-get update
RUN apt-get install --no-install-recommends -y wget
RUN wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb && \
dpkg -i packages-microsoft-prod.deb && \
rm packages-microsoft-prod.deb
# Install apt packages
RUN apt-get update
RUN apt-get upgrade -y
Expand All @@ -29,8 +22,7 @@ RUN apt-get install --no-install-recommends -y \
build-essential \
gdb \
curl \
unzip \
dotnet-sdk-7.0
unzip
# Make some useful symlinks that are expected to exist
RUN ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python3 & \
ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python
Expand All @@ -52,5 +44,4 @@ ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects
# Set environment variables
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
ENV SIL_NLP_DATA_PATH=/aqua-ml-data
ENV AWS_REGION="us-east-1"
CMD ["bash"]
3 changes: 2 additions & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@
"-v",
"${env:HOME}/.aws:/root/.aws", // Mount user's AWS credentials into the container
"-v",
"/home/clearml/.clearml/hf-cache:/root/.cache/huggingface"
"${env:HOME}/clearml/.clearml/hf-cache:/root/.cache/huggingface"
],
"containerEnv": {
"AWS_REGION": "${localEnv:AWS_REGION}",
"AWS_ACCESS_KEY_ID": "${localEnv:AWS_ACCESS_KEY_ID}",
"AWS_SECRET_ACCESS_KEY": "${localEnv:AWS_SECRET_ACCESS_KEY}",
"CLEARML_API_ACCESS_KEY": "${localEnv:CLEARML_API_ACCESS_KEY}",
Expand Down
13 changes: 0 additions & 13 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -55,14 +55,6 @@ RUN apt-get install --no-install-recommends -y \
RUN ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python3 & \
ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python

# Install .NET SDK
RUN wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb && \
dpkg -i packages-microsoft-prod.deb && \
rm packages-microsoft-prod.deb
RUN apt-get update && \
apt-get install --no-install-recommends -y dotnet-sdk-7.0
ENV DOTNET_ROLL_FORWARD=LatestMajor

# Install dependencies from poetry
COPY --from=builder /src/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && rm requirements.txt
Expand Down Expand Up @@ -111,11 +103,6 @@ RUN mkdir .cache/silnlp/projects
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects

# Other environment variables
ENV SIL_NLP_DATA_PATH=/aqua-ml-data
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
ENV AWS_REGION="us-east-1"

# Clone silnlp and make it the starting directory
RUN git clone https://github.com/sillsdev/silnlp.git
WORKDIR /root/silnlp
Expand Down
48 changes: 25 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,14 @@ SIL NLP provides a set of pipelines for performing experiments on various NLP ta
---

## SILNLP Prerequisites
These are the main requirements for the SILNLP code to run on a local machine. The SILNLP repo itself is hosted on Github, mainly written in Python and calls SIL.Machine.Tool. 'Machine' as we tend to call it, is a .NET application that has many functions for manipulating USFM data. Most of the language data we have for low resource languages in USFM format. Since Machine is a .Net application it depends upon the __.NET core SDK__ which works on Windows and Linux. Since there are many python packages that need to be used, with complex versioning requirements we use a Python package called Poetry to mangage all of those. So here is a rough heirarchy of SILNLP with the major dependencies.
These are the main requirements for the SILNLP code to run on a local machine. The SILNLP repo itself is hosted on Github, mainly written in Python and calls SIL.Machine.Tool. 'Machine' as we tend to call it, is an application that has many functions for manipulating USFM data. Most of the language data we have for low resource languages is in USFM format. Since there are many Python packages that need to be used with complex versioning requirements, we use a Python package called Poetry to mangage all of those. So here is a rough heirarchy of SILNLP with the major dependencies.

| Requirement | Reason |
| --------------------- | ----------------------------------------------------------------- |
| GIT | to get the repo from [github](https://github.com/sillsdev/silnlp) |
| Python | to run the silnlp code |
| Poetry | to manage all the Python packages and versions |
| SIL.Machine.Tool | to support many functions for data manipulation |
| .Net core SDK | Required by SIL.Machine.Tool |
| NVIDIA GPU | Required to run on a local machine |
| Nvidia drivers | Required for the GPU |
| CUDA Toolkit | Required for the Machine learning with the GPU |
Expand Down Expand Up @@ -58,32 +57,28 @@ These are the main requirements for the SILNLP code to run on a local machine. T
A docker container should be created. You should be able to see a container named 'silnlp' on the Containers page of Docker Desktop.

5. Create file for environment variables

__If you do not intend to use SILNLP with ClearML and AWS, you can skip this step. If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).__

Create a text file with the following content and insert your credentials.
Create a text file with the following content and edit as necessary:
```
CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
CLEARML_API_ACCESS_KEY=xxxxx
CLEARML_API_SECRET_KEY=xxxxx
AWS_REGION="us-east-1"
AWS_ACCESS_KEY_ID=xxxxx
AWS_SECRET_ACCESS_KEY=xxxxx
SIL_NLP_DATA_PATH="/aqua-ml-data"
```
* If you do not intend to use SILNLP with ClearML and/or AWS, you can leave out the respective variables. If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
* Note that this does not give you direct access to an AWS S3 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.

6. Start container

If you completed step 5: \
In a terminal, run:
```
docker start silnlp
docker exec -it --env-file path/to/env_vars_file silnlp bash
```
If you did not complete step 5: \
In a terminal, run:
```
docker start silnlp
docker exec -it silnlp bash
```

* After this step, the terminal should change to say `root@xxxxx:~/silnlp#`, where `xxxxx` is a string of letters and numbers, instead of your current working directory. This is the command line for the docker container, and you're able to run SILNLP scripts from here.
* To leave the container, run `exit`, and to stop it, run `docker stop silnlp`. It can be started again by repeating step 6. Stopping the container will not erase any changes made in the container environment, but removing it will.

Expand All @@ -110,35 +105,36 @@ Follow the instructions below to set up a Dev Container in VS Code. This is the
* Add your user to the docker group by using a terminal to run: `sudo usermod -aG docker $USER`
* Sign out and back in again so your changes take effect

3. Set up the [S3 bucket](s3_bucket_setup.md).
3. Set up [ClearML](clear_ml_setup.md).

4. Set up [ClearML](clear_ml_setup.md).
4. Define environment variables.

5. Define environment variables.

Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY
Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY. Additionally, set AWS_REGION. The typical value is "us-east-1".
* Windows users: see [here](https://github.com/sillsdev/silnlp/wiki/Install-silnlp-on-Windows-10#permanently-set-environment-variables) for instructions on setting environment variables permanently
* Linux users: To set environment variables permanently, add each variable as a new line to the `.bashrc` file in your home directory with the format
```
export VAR="VAL"
```
6. Install Visual Studio Code.
5. Install Visual Studio Code.
7. Clone the silnlp repo.
6. Clone the silnlp repo.
8. Open up silnlp folder in VS Code.
7. Open up silnlp folder in VS Code.
9. Install the Dev Containers extension for VS Code.
8. Install the Dev Containers extension for VS Code.
10. Build the dev container and open the silnlp folder in the container.
9. Build the dev container and open the silnlp folder in the container.
* Click on the Remote Indicator in the bottom left corner.
* Select "Reopen in Container" and choose the silnlp dev container if necessary. This will take a while the first time because the container has to build.
* If it was successful, the window will refresh and it will say "Dev Container: SILNLP" in the bottom left corner.
* Note: If you don't have a local GPU, you may need to comment out the `gpus --all` part of the `runArgs` field of the `.devcontainer/devcontainer.json` file.
11. Install and activate Poetry environment.
10. Install and activate Poetry environment.
* In the VS Code terminal, run `poetry install` to install the necessary Python libraries, and then run `poetry shell` to enter the environment in the terminal.
11. (Optional) Locally mount the S3 bucket. This will allow you to interact directly with the S3 bucket from your local terminal (outside of the dev container). See instructions [here](s3_bucket_setup.md).
To get back into the dev container and poetry environment each subsequent time, open the silnlp folder in VS Code, select the "Reopen in Container" option from the Remote Connection menu (bottom left corner), and use the `poetry shell` command in the terminal.
## Setting Up and Running Experiments
Expand All @@ -147,3 +143,9 @@ See the [wiki](https://github.com/sillsdev/silnlp/wiki) for information on setti
See [this](https://github.com/sillsdev/silnlp/wiki/Using-the-Python-Debugger) page for information on using the VS code debugger.
If you need to use a tool that is supported by SILNLP but is not installable as a Python library (which is probably the case if you get an error like "RuntimeError: eflomal is not installed."), follow the appropriate instructions [here](https://github.com/sillsdev/silnlp/wiki/Installing-External-Libraries).
## .NET Machine alignment models
If you need to run the .NET versions of the Machine alignment models, you will need to install .NET Core SDK 8.0.
* Windows: [.NET Core SDK](https://dotnet.microsoft.com/download)
* Linux: Installation instructions can be found [here](https://learn.microsoft.com/en-us/dotnet/core/install/linux-ubuntu-2004).
39 changes: 24 additions & 15 deletions manual_setup.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
# Manual Setup

## SILNLP Prerequisites
These are the main requirements for the SILNLP code to run on a local machine. The SILNLP repo itself is hosted on Github, mainly written in Python and calls SIL.Machine.Tool. 'Machine' as we tend to call it, is a .NET application that has many functions for manipulating USFM data. Most of the language data we have for low resource languages in USFM format. Since Machine is a .Net application it depends upon the __.NET core SDK__ which works on Windows and Linux. Since there are many python packages that need to be used, with complex versioning requirements we use a Python package called Poetry to mangage all of those. So here is a rough heirarchy of SILNLP with the major dependencies.
These are the main requirements for the SILNLP code to run on a local machine. The SILNLP repo itself is hosted on Github, mainly written in Python and calls SIL.Machine.Tool. 'Machine' as we tend to call it, is an application that has many functions for manipulating USFM data. Most of the language data we have for low resource languages is in USFM format. Since there are many Python packages that need to be used with complex versioning requirements, we use a Python package called Poetry to mangage all of those. So here is a rough heirarchy of SILNLP with the major dependencies.

| Requirement | Reason |
| --------------------- | ----------------------------------------------------------------- |
| GIT | to get the repo from [github](https://github.com/sillsdev/silnlp) |
| Python | to run the silnlp code |
| Poetry | to manage all the Python packages and versions |
| SIL.Machine.Tool | to support many functions for data manipulation |
| .Net core SDK | Required by SIL.Machine.Tool |
| NVIDIA GPU | Required to run on a local machine |
| Nvidia drivers | Required for the GPU |
| CUDA Toolkit | Required for the Machine learning with the GPU |
Expand Down Expand Up @@ -50,12 +49,7 @@ __Download and install__ the following before creating any projects or starting
```
export PATH="$HOME/.local/bin:$PATH"
```
5. .NET Core SDK
* The necessary versions are 7.0 and 3.1. If your machine is only able to install version 7.0, you can set the DOTNET_ROLL_FORWARD environment variable to "LatestMajor", which will allow you to run anything that depends on dotnet 3.1.
* Note - the .NET SDK is needed for [SIL.Machine.Tool](https://github.com/sillsdev/machine). Many of the scripts in this repo require this .Net package. The .Net package will be installed and updated when the silnlp is initialized in `__init__.py`.
* Windows: [.NET Core SDK](https://dotnet.microsoft.com/download)
* Linux: Installation instructions can be found [here](https://learn.microsoft.com/en-us/dotnet/core/install/linux-ubuntu-2004)
6. C++ Redistributable
5. C++ Redistributable
* Note - this may already be installed. If it is not installed you may get cryptic errors such as "System.DllNotFoundException: Unable to load DLL 'thot' or one of its dependencies"
* Windows: Download from https://support.microsoft.com/en-us/topic/the-latest-supported-visual-c-downloads-2647da03-1eea-4433-9aff-95f26a218cc0 and install
* Linux: Instead of installing the redistributable, run the following commands:
Expand Down Expand Up @@ -88,18 +82,33 @@ See [S3 bucket setup](s3_bucket_setup.md).

See [ClearML setup](clear_ml_setup.md).

### Create SILNLP cache
* Create the directory "/home/user/.cache/silnlp", replacing "user" with your username.
* Create the directory "/home/user/.cache/silnlp/experiments" and set the environment variable SIL_NLP_CACHE_EXPERIMENT_DIR to that path.
* Create the directory "/home/user/.cache/silnlp/projects" and set the environment variable SIL_NLP_CACHE_PROJECT_DIR to that path.

### Additional Environment Variables
Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY
* Windows users: see [here](https://github.com/sillsdev/silnlp/wiki/Install-silnlp-on-Windows-10#permanently-set-environment-variables) for instructions on setting environment variables permanently
* Linux users: To set environment variables permanently, add each variable as a new line to the `.bashrc` file in your home directory with the format
```
export VAR="VAL"
```
* Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.
* Set SIL_NLP_DATA_PATH to "/aqua-ml-data" and CLEARML_API_HOST to "https://api.sil.hosted.allegro.ai".

### Setting Up and Running Experiments

See the [wiki](https://github.com/sillsdev/silnlp/wiki) for information on setting up and running experiments. The most important pages for getting started are the ones on [file structure](https://github.com/sillsdev/silnlp/wiki/Folder-structure-and-file-naming-conventions), [model configuration](https://github.com/sillsdev/silnlp/wiki/Configure-a-model), and [running experiments](https://github.com/sillsdev/silnlp/wiki/NMT:-Usage). A lot of the instructions are specific to NMT, but are still helpful starting points for doing other things like [alignment](https://github.com/sillsdev/silnlp/wiki/Alignment:-Usage).

See [this](https://github.com/sillsdev/silnlp/wiki/Using-the-Python-Debugger) page for information on using the VS code debugger.

If you need to use a tool that is supported by SILNLP but is not installable as a Python library (which is probably the case if you get an error like "RuntimeError: eflomal is not installed."), follow the appropriate instructions [here](https://github.com/sillsdev/silnlp/wiki/Installing-External-Libraries).
If you need to use a tool that is supported by SILNLP but is not installable as a Python library (which is probably the case if you get an error like "RuntimeError: eflomal is not installed."), follow the appropriate instructions [here](https://github.com/sillsdev/silnlp/wiki/Installing-External-Libraries).

## Setting environment variables permanently
Windows users: see [here](https://github.com/sillsdev/silnlp/wiki/Install-silnlp-on-Windows-10#permanently-set-environment-variables) for instructions on setting environment variables permanently

Linux users: To set environment variables permanently, add each variable as a new line to the `.bashrc` file in your home directory with the format
```
export VAR="VAL"
```

## .NET Machine alignment models

To be able to run the .NET versions of the Machine alignment models, you will need to install .NET Core SDK 8.0.
* Windows: [.NET Core SDK](https://dotnet.microsoft.com/download)
* Linux: Installation instructions can be found [here](https://learn.microsoft.com/en-us/dotnet/core/install/linux-ubuntu-2004).
2 changes: 0 additions & 2 deletions s3_bucket_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,13 @@
We use Amazon S3 storage for storing our experiment data. Here is some workspace setup to enable a decent workflow.

### Install and configure AWS S3 storage
The following will allow the boto3 and S3Path libraries in Python correctly talk to the S3 bucket.
* Install the aws-cli from: https://aws.amazon.com/cli/
* In cmd, type: `aws configure` and enter your AWS access_key_id and secret_access_key and the region (we use region = us-east-1).
* The aws configure command will create a folder in your home directory named '.aws' it should contain two plain text files named 'config' and 'credentials'. The config file should contain the region and the credentials file should contain your access_key_id and your secret_access_key.
(Home directory on windows is usually C:\Users\<Username>\ and on linux it is /home/username)

### Install and configure rclone


**Windows**

The following will mount /aqua-ml-data on your S drive and allow you to explore, read and write.
Expand Down

0 comments on commit ab0450c

Please sign in to comment.