-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArityException with cookbook sample 4 #315
Comments
Huh... That's very odd... Thank you for flagging this up! I've added this to my TODO list. I think I'll have time to look into it next week. I hope that's okay! |
Hi Anthony,
thank you for your quick reply! I figured that I got to this due to a small glitch in Cookbook Chapter 4, where it reads the initial csv without {:kebab-columns true}. Just adding that "fixed" it and also is consistent with the chapter sample output. However, still interesting why it does not work without the kebap special char mangling, I believe it should.
Just started with spark and geni the other day, and I can already say it is one of the best APIs I used in years, congrats!
Such a lot of fun! I will for sure continue to use it. It also seems to be quite fast even on my local machine.
Hope you find the cause of this quickly, have a good day!
Ciao
...Jochen
… Am 25.02.2021 um 23:10 schrieb Anthony Khong ***@***.***>:
Huh... That's very odd... Thank you for flagging this up! I've added this to my TODO list. I think I'll have time to look into it next week. I hope that's okay!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#315 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AETGETQSM7C7BHKQARJFKPTTA3DEHANCNFSM4YEW56FQ>.
|
Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests. On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔 |
Hi Anthony...
thank you for digging into this! Yes you are right I think, it looks like it is indeed the described issue.
My feeling is to just document this in the geni docs and recommend kebab-case in these cases. It fixes it very well in the Cookbook example :-).
Ciao
...Jochen
… Am 26.02.2021 um 11:46 schrieb Anthony Khong ***@***.***>:
Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.
On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ <https://mungingdata.com/pyspark/avoid-dots-periods-column-names/>
Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#315 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AETGETV3GMO2F4B3XXXZRHDTA53XLANCNFSM4YEW56FQ>.
|
Hi Anthony...
just a quick followup on our recent discussion regarding the dot/backticks column name issue.
I read up a little about this in spark docs, in Spark SQL docs I found Spark to be really restrictive, allowing just letters (a-zA-Z), digits and underscore.
https://spark.apache.org/docs/latest/sql-ref-identifier.html <https://spark.apache.org/docs/latest/sql-ref-identifier.html>
So, your :kebab-columns is doing it perfect for clojure I think. An issue could be present for people exporting parquet files for use in other languages, that like underscores better.
Instead of :kebab-columns true one could use something like :snake-columns true.
Using {:convert-column-names :kebab | :snake} might be a bit more elegant, but well, this would change the API :-).
I appended my repl code (reusing your data-sources code in case you would like to play with it?!
Have a good day Anthony!
Ciao
...Jochen
(require '[zero-one.geni.core.data-sources :as ds]
'[zero-one.geni.interop :as interop]
'camel-snake-kebab.core)
(defn ->normalized-columns
"Returns a new Dataset with all columns renamed using passed rename-fn."
[dataset rename-fn]
(let [remove-punctuations #'ds/remove-punctuations ; access privates
deaccent #'ds/deaccent
new-columns (->> dataset
.columns
(map remove-punctuations)
(map deaccent)
(map rename-fn))]
(.toDF dataset (interop/->scala-seq new-columns))))
(comment
; plain
(-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv"))
(g/select "`Precip. Amount (mm)`"))
; kebap-columns
(-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv"))
(->normalized-columns camel-snake-kebab.core/->kebab-case)
(g/select "precip-amount-mm"))
; snake-columns
(-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv"))
(->normalized-columns camel-snake-kebab.core/->snake_case)
(g/select "precip_amount_mm"))
)
… Am 26.02.2021 um 11:46 schrieb Anthony Khong ***@***.***>:
Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.
On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ <https://mungingdata.com/pyspark/avoid-dots-periods-column-names/>
Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#315 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AETGETV3GMO2F4B3XXXZRHDTA53XLANCNFSM4YEW56FQ>.
|
Info
Problem / Steps to reproduce
Standard lein new geni …, bitnami/spark 3.0.2 docker, then used code from geni cookbook chapter 4.
The following code from cookbook example 4 fails with ArityException:
Exception is
Also manual select with column named "Precip. Amount (mm)" does not work. It seems that they have backticks around them internally.
I tried to rename all columns with
but the problem persists, still backticks.
Crashes:
(g/select raw-weather-mar-2012 "Precip. Amount (mm)")
Works:
(g/select raw-weather-mar-2012 "`Precip. Amount (mm)`")
(g/column-names (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`"))
yields "Precip. Amount (mm)" without backticks.
This lead me to believe that there is some issue in geni or spark with these column names.
The text was updated successfully, but these errors were encountered: