Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark-session cannot be changed (any more ?) #301

Open
behrica opened this issue Dec 6, 2020 · 5 comments
Open

spark-session cannot be changed (any more ?) #301

behrica opened this issue Dec 6, 2020 · 5 comments

Comments

@behrica
Copy link
Collaborator

behrica commented Dec 6, 2020

Following the minikube guide:
https://github.com/zero-one-group/geni/blame/develop/docs/kubernetes_basic.md

the verification of line 118 fails.

It seems that I cannot change the spark-session, by calling
g/create-spark-session

I am pretty sure, that it worked at one moment.

@behrica
Copy link
Collaborator Author

behrica commented Dec 6, 2020

I just saw that this is not a Delay any more.

So we get the SparkSession initialized once we require the name space.
And I suppose, it cannot be changed any more.

So it cannot be re-configured.

@behrica behrica changed the title spark-session cannot be changed (any more) spark-session cannot be changed (any more ?) Dec 6, 2020
@anthony-khong
Copy link
Member

I see... I think we can change it back to a delay. Would you like to make a PR for that?

@behrica
Copy link
Collaborator Author

behrica commented Dec 6, 2020

Maybe there is a better way.

Maybe the default "configuration map" for the session https://github.com/behrica/geni/blob/482c4b934f037d32b849916211b509c94d89800e/src/clojure/zero_one/geni/defaults.clj#L5

Could become an "atom" , which can be changed if needed before
requiring the default name space 'zero-one.geni.core'

I think that the current feature to potentially change the session itself is not super usefull,
because Spark does not really support this cleanly, correct ?
If I have read it right, the spark session is meant to be instantiated ones in the lifetime of a JVM.
I can try this out to see if it works.

@behrica
Copy link
Collaborator Author

behrica commented Dec 6, 2020

I think it could work this way.

The issue would be to keep the "full automatic" session configuration of the geni-cli.
My opinion is, that the current way of the geni cli session initialization, which:

  • initialises a spark object session object with defaults
  • and can somehow be overridden with some "tricks" (delays, futures)

is brittle as it will not work always and depends on "order" of requiring ns / using functions.

I think we have three options for this:

  1. Not have it full automatic, but a methods which needs to be called (init-default-spark) or similar
    -> this could then allow changing config settings

  2. Allow to change spark session configuration from outside repl by either:

  • read a config file
  • take config options on the geni.sh

I still like the overall idea of the geni CLI as a quick user friendly entry point, but it needs to allow arbitrary session configs.
(or we do not allow any custom session config for the geni cli, and see it as a "demo")
The other spark shells can be fully configured from command line (and do neither allow to change session from inside)

@erp12
Copy link
Collaborator

erp12 commented Dec 7, 2020

Here is a link to our previous discussions for reference

I think that the current feature to potentially change the session itself is not super usefull,
because Spark does not really support this cleanly, correct ?
If I have read it right, the spark session is meant to be instantiated ones in the lifetime of a JVM.

You are correct. Typically, a user's spark session settings would be set during the call to spark-submit. The default session settings in geni will only be applied if no call to spark-submit is made (ie. running locally).

Most Spark usage (across all languages) happens by launching a spark "application" (for example, a Geni REPL) on an existing spark cluster. It is not expected that the spark application creates it's own cluster, and thus the session config is supplied when the .jar and main class are specified.

I'm not too familiar with Kubernetes, so I am having trouble following the guide. It looks like the Geni CLI is being started outside of spark-submit. I think the more traditional pattern would be to call spark-submit in the container for the cluster's driver and pass an uberjar of Geni and --class zero-one.geni.main along with any other spark session config you want.

I have had success with starting Geni REPLs on flintrock clusters using spark-submit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants