Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No documentation on how to run this project on Windows #13

Open
kpto opened this issue Oct 23, 2024 · 4 comments
Open

No documentation on how to run this project on Windows #13

kpto opened this issue Oct 23, 2024 · 4 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@kpto
Copy link
Collaborator

kpto commented Oct 23, 2024

Since this project uses pyspark to read parquet files, pyspark and it's dependencies spark and hadoop are required but the documentation is currently lacking a guideline of how to run the script on Windows

After some research I managed to run the script on Windows, here I jot down the process

Dependecies

  1. pyspark
  2. Java
  3. Spark
  4. Hadoop

pyspark

Easily installed by poetry, not a problem.

Java

At least version 8 but starting with version 23 you get the problem below

java.lang.UnsupportedOperationException: getSubject is supported only if a security manager is allowed

Also since the minimum version for neo4j-admin to run is 17, 17 seems to be a good choice which could be downloaded here:
https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html

Other versions between 17-23 or OpenJDK may also work but not tested.

Spark

Download the binary from here, 3.5.3 was the version I chose:
https://spark.apache.org/downloads.html

Extract the content to C:\spark.
Add a new environment variable with key SPARK_HOME and value C:\spark
Append path %SPARK_HOME%\bin to Path variable.

Hadoop

Hadoop must be installed together with something called winutils, since winutils is not an official product, you use it at your own risk. At the time of writing, winutils supports Hadoop up to 3.3.6. Since winutils must match the version of Hadoop, 3.3.6 was the version I installed.

At the time of writing, 3.3.6 could be downloaded from here:
https://hadoop.apache.org/releases.html

Extract the content to C:\hadoop.
Add a new environment variable with key HADOOP_HOME and value C:\hadoop
Append path %HADOOP_HOME%\bin to Path variable.

Download the matching winutils binary, for 3.3.6 the location is
https://github.com/cdarlint/winutils/tree/master/hadoop-3.3.6/bin

This online tool is handy for downloading a folder from a GitHub repo:
https://download-directory.github.io/

Extract the content to C:\hadoop\bin

One final very important step, copy the hadoop.dll to C:\Windows\System32

Source

The above come from mostly these sites, it's unknown that are all steps required or just part of them.
https://www.machinelearningplus.com/pyspark/install-pyspark-on-windows/
https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io

@kpto
Copy link
Collaborator Author

kpto commented Oct 23, 2024

Closing as this comes from the dependency on Hadoop of pyspark

@kpto kpto closed this as completed Oct 23, 2024
@kpto
Copy link
Collaborator Author

kpto commented Oct 23, 2024

Reopening as a guideline for running on Windows.

@kpto kpto reopened this Oct 23, 2024
@kpto kpto changed the title Java exception from pyspark with the latest Java version 23 No documentation on how to run this project on Windows Oct 23, 2024
@slobentanzer
Copy link
Collaborator

@kpto thanks, guidelines are good; I am wondering however if we shouldn't migrate this to the BioCypher docs, as it seems generally applicable, and the docs are a more natural place for it.

@slobentanzer slobentanzer added documentation Improvements or additions to documentation enhancement New feature or request labels Oct 25, 2024
@kpto
Copy link
Collaborator Author

kpto commented Oct 25, 2024

I don't think BioCypher is directly related to pyspark and it would be awkward to try to find a place in BioCypher's doc to house it.

Indeed if possible I even want to get rid of this Spark and Hadoop dependencies. The parquet file itself is open sourced and it seems that pyspark under the hood uses pyarrow to read parquet so I'd say it is possible, I opened an issue for this #17 .

@kpto kpto removed the priority low label Oct 25, 2024
@kpto kpto added this to OTAR3088 Oct 25, 2024
@slobentanzer slobentanzer added this to the Version v1 milestone Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

2 participants