Many SQL functions and Column methods are overloaded to take either a keyword, a string or a Column instance as argument. For such cases, Geni implements Column coercion where
- Column instances are left as they are,
- strings and keywords are interpreted as column names and;
- other values are interpreted as a literal Column.
Because of this, basic arithmetic operations do not require lit
wrapping:
; The following two expressions are equivalent
(g/- (g// (g/sin Math/PI) (g/cos Math/PI)) (g/tan Math/PI))
(g/- (g// (g/sin (g/lit Math/PI)) (g/cos (g/lit Math/PI))) (g/tan (g/lit Math/PI)))
However, string literals do require lit
wrapping:
; The following fails, because "Nelson" is interpreted as a Column
(-> dataframe (g/filter (g/=== "SellerG" "Nelson")))
; The following works, as it checks the column "SellerG" against "Nelson" as a literal
(-> dataframe (g/filter (g/=== "SellerG" (g/lit "Nelson"))))
It may be useful to think of a Spark Dataset as a seq of maps, so that keywords can be idiomatically used to refer to columns (i.e. keys). For that reason, the predicate column above may be more idiomatically written as:
(g/=== :SellerG (g/lit "Nelson"))
Geni implements Column-array coercion to variadic SQL functions and Column methods, such as select
and group-by
. The coercion rules are as follos:
- maps have their values flattened, coerced into Columns and aliased as the keys and;
- other collections have their values flattened and coerced into Columns.
- otherwise the argument is directly coerced into Column.
A function like select
can take all of these different types in a single invocation:
(-> dataframe
(g/select :SellerG
"Address"
(g/col "Postcode")
{:log-price (g/log :Price) :rooms :Rooms}
[:Date :Method]
#{:Lattitude :Longtitude})
g/columns)
=> (:SellerG :Address :Postcode :log-price :rooms :Date :Method :Lattitude :Longtitude)
All calls to filter
and remove
are implicitly casted to booleans. This means that the Columns can be left as, say, integers:
(-> dataframe
(g/remove (g/mod :Rooms 2))
(g/select :Rooms)
g/distinct
(g/collect-col :Rooms))
=> (4 6 2 10 8)