You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since this project uses pyspark to read parquet files, pyspark and it's dependencies spark and hadoop are required but the documentation is currently lacking a guideline of how to run the script on Windows
After some research I managed to run the script on Windows, here I jot down the process
Dependecies
pyspark
Java
Spark
Hadoop
pyspark
Easily installed by poetry, not a problem.
Java
At least version 8 but starting with version 23 you get the problem below
java.lang.UnsupportedOperationException: getSubject is supported only if a security manager is allowed
Extract the content to C:\spark.
Add a new environment variable with key SPARK_HOME and value C:\spark
Append path %SPARK_HOME%\bin to Path variable.
Hadoop
Hadoop must be installed together with something called winutils, since winutils is not an official product, you use it at your own risk. At the time of writing, winutils supports Hadoop up to 3.3.6. Since winutils must match the version of Hadoop, 3.3.6 was the version I installed.
Extract the content to C:\hadoop.
Add a new environment variable with key HADOOP_HOME and value C:\hadoop
Append path %HADOOP_HOME%\bin to Path variable.
@kpto thanks, guidelines are good; I am wondering however if we shouldn't migrate this to the BioCypher docs, as it seems generally applicable, and the docs are a more natural place for it.
I don't think BioCypher is directly related to pyspark and it would be awkward to try to find a place in BioCypher's doc to house it.
Indeed if possible I even want to get rid of this Spark and Hadoop dependencies. The parquet file itself is open sourced and it seems that pyspark under the hood uses pyarrow to read parquet so I'd say it is possible, I opened an issue for this #17 .
Since this project uses pyspark to read parquet files, pyspark and it's dependencies spark and hadoop are required but the documentation is currently lacking a guideline of how to run the script on Windows
After some research I managed to run the script on Windows, here I jot down the process
Dependecies
pyspark
Easily installed by poetry, not a problem.
Java
At least version 8 but starting with version 23 you get the problem below
Also since the minimum version for neo4j-admin to run is 17, 17 seems to be a good choice which could be downloaded here:
https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html
Other versions between 17-23 or OpenJDK may also work but not tested.
Spark
Download the binary from here, 3.5.3 was the version I chose:
https://spark.apache.org/downloads.html
Extract the content to
C:\spark
.Add a new environment variable with key
SPARK_HOME
and valueC:\spark
Append path
%SPARK_HOME%\bin
toPath
variable.Hadoop
Hadoop must be installed together with something called winutils, since
winutils
is not an official product, you use it at your own risk. At the time of writing,winutils
supports Hadoop up to 3.3.6. Sincewinutils
must match the version of Hadoop, 3.3.6 was the version I installed.At the time of writing, 3.3.6 could be downloaded from here:
https://hadoop.apache.org/releases.html
Extract the content to
C:\hadoop
.Add a new environment variable with key
HADOOP_HOME
and valueC:\hadoop
Append path
%HADOOP_HOME%\bin
toPath
variable.Download the matching
winutils
binary, for 3.3.6 the location ishttps://github.com/cdarlint/winutils/tree/master/hadoop-3.3.6/bin
This online tool is handy for downloading a folder from a GitHub repo:
https://download-directory.github.io/
Extract the content to
C:\hadoop\bin
One final very important step, copy the
hadoop.dll
toC:\Windows\System32
Source
The above come from mostly these sites, it's unknown that are all steps required or just part of them.
https://www.machinelearningplus.com/pyspark/install-pyspark-on-windows/
https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io
The text was updated successfully, but these errors were encountered: