Releases: zero-one-group/geni
v0.0.31 - Spark Doc Scraper
- Spark Doc Scraper:
scripts/scrape-spark-docs.clj
is able to scrape the relevant docs for the four modules. - Partial Docstrings: docstrings are available for
core.column
andml.regression
namespaces.
v0.0.30 - Some Basic Support for Spark Streaming
- Basic Spark Streaming functionalities: added some low-hanging fruits in terms of
JavaDStream
andJavaStreamingContext
methods. - More robust Spark Streaming testing function: now expects an
:expected
key and automatically retries to make the test less flaky.
v0.0.29 - Start of Spark Streaming Support
- DStream Testing Function: a more reliable and repeatable way to test Spark Streaming's StreamingContext and DStream methods.
- Automated Version Bump: done with Babashka.
- Updated Contributing Guide: thanks to @erp12 for pointing out certain gotchas on the guide.
v0.0.27 - Excel Support and Version Bumps
- Excel Support: basic functions
read-xlsx!
andwrite-xlsx!
are now available backed byzero.one/fxl
. - Version Bumps: for Spark and nrepl to the latest version.
- Install CI steps: Dockerless installs are now tested on Ubuntu and macOS.
v0.0.26 - Better RDDs, EDN Support and Data-Oriented Schemas
- Schema option for read functions: all read functions now support a
:schema
option, which can be an actual Spark schema or its data-oriented version. - Basic support for EDN:
read-edn!
andwrite-edn!
are now available with an added dependency ofmetosin/jsonista
. The functions may not be performant, but can come in handy for small-data compositions. - More RDD functions: this closes the RDD function gaps to sparkplug and adds variadicity to functions that take in more than one RDDs.
- RDD name unmangling: this follows sparkplug model of unmangling RDD names after each transformation.
- Version bump for dependencies:
nrepl
bumped to 0.8.1.
v0.0.25 - RDD Serialisation Model and More Methods
- RDD Function Serialisation Model: changed from the sparkling model to the sparkplug model. Slack user @g on clojurians/geni mentioned that the sparkplug model results in fewer serialisation problems than the sparkling one.
- More RDD Methods: added methods related to partitioners and JavaSparkContextMethods.
- Community Guidelines: added a code of conduct and an issue template.
- Design Goals Docs: first draft of the design goal outlining some of the main focuses of the project.
v0.0.24 - Basic RDD and PairRDD Support
- RDD and PairRDD Support: basic actions and transformations are supported, but it will require AOT compilations to pass serialisable functions to RDD's higher-order functions. Therefore, the RDD REPL experience is rather poor.
- Isolated Docker Runs: all Docker operations on the
Makefile
now runs on a temporary directory, so that there are no race conditions in writing to thetarget
directory. This means thatmake ci --jobs 3
is now possible on a single machine.
v0.0.23 - Basic RDD Support + Spark ML Cookbook
Preliminary RDD support with only certain transformations completed and completion of two parts of the cookbook for Spark ML.
- Basic RDD support: mainly basic transformations such as
map
,reduce
,map-to-pair
andreduce-by-key
. The main challenge has been doing serialisation of functions which are mainly taken from Sparkling and sparkplug. - Spark ML cookbook: added two chapters on Spark ML pipelines and ported customer segmentation blog post with non-negative matrix factorisation.
- Better Geni CLI: new
--submit
command-line argument to emulatespark-submit
. - Better CI steps: automated Geni CLI tests to avoid manual testing of the Geni REPL.
- Completed benchmark results: added results from dplyr, data.table, tablecloth and tech.ml.dataset.
v0.0.22 - Basic Geni CLI + Namespace Alignments
Better getting-started experience with the new geni
command and better alignment of Geni namespaces with Spark packages.
- New
geni
script with install instructions and a new asciinema screencast. This will be the main way to use Geni for small, one-off analyses and throwaway scripts. - Created another layer of namespaces with
zero-one.geni.core
andzero-one.geni.ml
. The idea is that thecore
namespaces should refer to only Spark SQL and theml
namespaces refer to Spark ML. This will help the mapping of Geni functions to the original Spark functions. - Added a simple benchmark piece that compares the performance of Pandas vs. Geni on a particular problem.
- An asciinema screencast for the downloading the uberjar and interacting with the Geni REPL.
v0.0.21 - First Alpha Release
Initial alpha release documented here on cljdoc.
The release includes an uberjar that should provide a Geni REPL (i.e. a Clojure spark-shell
) within seconds. Download the uberjar, and simply try out the REPL with java -jar geni-repl-uberjar-0.0.21.jar
! An nREPL server is automatically started with an .nrepl-port
file, so that common Clojure text editors should be able to jack in automatically.
The initial namespace automatically requires:
(require '[zero-one.geni.core :as g]
'[zero-one.geni.ml :as ml])
so that functions such as g/read-csv!
and ml/logistic-regression
are immediately available.
The Spark session is available as a Clojure Future object, which can be dereferenced with @spark
. To see the full default spark config, invoke (g/spark-conf @spark)
!