No documentation on how to run this project on Windows #13

kpto · 2024-10-23T13:31:02Z

Since this project uses pyspark to read parquet files, pyspark and it's dependencies spark and hadoop are required but the documentation is currently lacking a guideline of how to run the script on Windows

After some research I managed to run the script on Windows, here I jot down the process

Dependecies

pyspark
Java
Spark
Hadoop

pyspark

Easily installed by poetry, not a problem.

Java

At least version 8 but starting with version 23 you get the problem below

java.lang.UnsupportedOperationException: getSubject is supported only if a security manager is allowed

Also since the minimum version for neo4j-admin to run is 17, 17 seems to be a good choice which could be downloaded here:
https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html

Other versions between 17-23 or OpenJDK may also work but not tested.

Spark

Download the binary from here, 3.5.3 was the version I chose:
https://spark.apache.org/downloads.html

Extract the content to C:\spark.
Add a new environment variable with key SPARK_HOME and value C:\spark
Append path %SPARK_HOME%\bin to Path variable.

Hadoop

Hadoop must be installed together with something called winutils, since winutils is not an official product, you use it at your own risk. At the time of writing, winutils supports Hadoop up to 3.3.6. Since winutils must match the version of Hadoop, 3.3.6 was the version I installed.

At the time of writing, 3.3.6 could be downloaded from here:
https://hadoop.apache.org/releases.html

Extract the content to C:\hadoop.
Add a new environment variable with key HADOOP_HOME and value C:\hadoop
Append path %HADOOP_HOME%\bin to Path variable.

Download the matching winutils binary, for 3.3.6 the location is
https://github.com/cdarlint/winutils/tree/master/hadoop-3.3.6/bin

This online tool is handy for downloading a folder from a GitHub repo:
https://download-directory.github.io/

Extract the content to C:\hadoop\bin

One final very important step, copy the hadoop.dll to C:\Windows\System32

Source

The above come from mostly these sites, it's unknown that are all steps required or just part of them.
https://www.machinelearningplus.com/pyspark/install-pyspark-on-windows/
https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io

The text was updated successfully, but these errors were encountered:

kpto · 2024-10-23T14:50:38Z

Closing as this comes from the dependency on Hadoop of pyspark

kpto · 2024-10-23T16:36:29Z

Reopening as a guideline for running on Windows.

slobentanzer · 2024-10-25T01:54:13Z

@kpto thanks, guidelines are good; I am wondering however if we shouldn't migrate this to the BioCypher docs, as it seems generally applicable, and the docs are a more natural place for it.

kpto · 2024-10-25T15:09:38Z

I don't think BioCypher is directly related to pyspark and it would be awkward to try to find a place in BioCypher's doc to house it.

Indeed if possible I even want to get rid of this Spark and Hadoop dependencies. The parquet file itself is open sourced and it seems that pyspark under the hood uses pyarrow to read parquet so I'd say it is possible, I opened an issue for this #17 .

kpto closed this as completed Oct 23, 2024

kpto reopened this Oct 23, 2024

kpto changed the title ~~Java exception from pyspark with the latest Java version 23~~ No documentation on how to run this project on Windows Oct 23, 2024

slobentanzer added documentation Improvements or additions to documentation enhancement New feature or request labels Oct 25, 2024

kpto added the priority low label Oct 25, 2024

kpto mentioned this issue Oct 25, 2024

Explore the possibility to remove the dependencies of Spark and Hadoop #17

Open

kpto removed the priority low label Oct 25, 2024

kpto added this to OTAR3088 Oct 25, 2024

kpto mentioned this issue Oct 30, 2024

Replace pyspark with duckdb #18

Open

slobentanzer added this to the Version v1 milestone Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No documentation on how to run this project on Windows #13

No documentation on how to run this project on Windows #13

kpto commented Oct 23, 2024 •

edited

Loading

kpto commented Oct 23, 2024

kpto commented Oct 23, 2024

slobentanzer commented Oct 25, 2024

kpto commented Oct 25, 2024

No documentation on how to run this project on Windows #13

No documentation on how to run this project on Windows #13

Comments

kpto commented Oct 23, 2024 • edited Loading

Dependecies

pyspark

Java

Spark

Hadoop

Source

kpto commented Oct 23, 2024

kpto commented Oct 23, 2024

slobentanzer commented Oct 25, 2024

kpto commented Oct 25, 2024

kpto commented Oct 23, 2024 •

edited

Loading