Releases · zero-one-group/geni

07 Oct 06:06

anthony-khong

v0.0.31

3d41d6a

v0.0.31 - Spark Doc Scraper Pre-release

Pre-release

Spark Doc Scraper: scripts/scrape-spark-docs.clj is able to scrape the relevant docs for the four modules.
Partial Docstrings: docstrings are available for core.column and ml.regression namespaces.

Assets 3

30 Sep 01:10

anthony-khong

v0.0.30

84fca65

v0.0.30 - Some Basic Support for Spark Streaming Pre-release

Pre-release

Basic Spark Streaming functionalities: added some low-hanging fruits in terms of JavaDStream and JavaStreamingContext methods.
More robust Spark Streaming testing function: now expects an :expected key and automatically retries to make the test less flaky.

Assets 3

23 Sep 00:06

anthony-khong

v0.0.29

613b5cf

v0.0.29 - Start of Spark Streaming Support Pre-release

Pre-release

DStream Testing Function: a more reliable and repeatable way to test Spark Streaming's StreamingContext and DStream methods.
Automated Version Bump: done with Babashka.
Updated Contributing Guide: thanks to @erp12 for pointing out certain gotchas on the guide.

Assets 3

17 Sep 01:16

anthony-khong

v0.0.27

95a6e90

v0.0.27 - Excel Support and Version Bumps Pre-release

Pre-release

Excel Support: basic functions read-xlsx! and write-xlsx! are now available backed by zero.one/fxl.
Version Bumps: for Spark and nrepl to the latest version.
Install CI steps: Dockerless installs are now tested on Ubuntu and macOS.

Assets 3

09 Sep 01:55

anthony-khong

v0.0.26

59c16f2

v0.0.26 - Better RDDs, EDN Support and Data-Oriented Schemas Pre-release

Pre-release

Schema option for read functions: all read functions now support a :schema option, which can be an actual Spark schema or its data-oriented version.
Basic support for EDN: read-edn! and write-edn! are now available with an added dependency of metosin/jsonista. The functions may not be performant, but can come in handy for small-data compositions.
More RDD functions: this closes the RDD function gaps to sparkplug and adds variadicity to functions that take in more than one RDDs.
RDD name unmangling: this follows sparkplug model of unmangling RDD names after each transformation.
Version bump for dependencies: nrepl bumped to 0.8.1.

Assets 3

02 Sep 02:16

anthony-khong

v0.0.25

e9d0eb4

v0.0.25 - RDD Serialisation Model and More Methods Pre-release

Pre-release

RDD Function Serialisation Model: changed from the sparkling model to the sparkplug model. Slack user @g on clojurians/geni mentioned that the sparkplug model results in fewer serialisation problems than the sparkling one.
More RDD Methods: added methods related to partitioners and JavaSparkContextMethods.
Community Guidelines: added a code of conduct and an issue template.
Design Goals Docs: first draft of the design goal outlining some of the main focuses of the project.

Assets 3

26 Aug 04:15

anthony-khong

v0.0.24

eea20f8

v0.0.24 - Basic RDD and PairRDD Support Pre-release

Pre-release

RDD and PairRDD Support: basic actions and transformations are supported, but it will require AOT compilations to pass serialisable functions to RDD's higher-order functions. Therefore, the RDD REPL experience is rather poor.
Isolated Docker Runs: all Docker operations on the Makefile now runs on a temporary directory, so that there are no race conditions in writing to the target directory. This means that make ci --jobs 3 is now possible on a single machine.

Assets 3

19 Aug 03:22

anthony-khong

v0.0.23

be42842

v0.0.23 - Basic RDD Support + Spark ML Cookbook Pre-release

Pre-release

Preliminary RDD support with only certain transformations completed and completion of two parts of the cookbook for Spark ML.

Basic RDD support: mainly basic transformations such as map, reduce, map-to-pair and reduce-by-key. The main challenge has been doing serialisation of functions which are mainly taken from Sparkling and sparkplug.
Spark ML cookbook: added two chapters on Spark ML pipelines and ported customer segmentation blog post with non-negative matrix factorisation.
Better Geni CLI: new --submit command-line argument to emulate spark-submit.
Better CI steps: automated Geni CLI tests to avoid manual testing of the Geni REPL.
Completed benchmark results: added results from dplyr, data.table, tablecloth and tech.ml.dataset.

Assets 3

11 Aug 13:17

anthony-khong

v0.0.22

6ae6f5d

v0.0.22 - Basic Geni CLI + Namespace Alignments Pre-release

Pre-release

Better getting-started experience with the new geni command and better alignment of Geni namespaces with Spark packages.

New geni script with install instructions and a new asciinema screencast. This will be the main way to use Geni for small, one-off analyses and throwaway scripts.
Created another layer of namespaces with zero-one.geni.core and zero-one.geni.ml. The idea is that the core namespaces should refer to only Spark SQL and the ml namespaces refer to Spark ML. This will help the mapping of Geni functions to the original Spark functions.
Added a simple benchmark piece that compares the performance of Pandas vs. Geni on a particular problem.
An asciinema screencast for the downloading the uberjar and interacting with the Geni REPL.

Assets 3

06 Aug 06:40

anthony-khong

v0.0.21

34d1d99

v0.0.21 - First Alpha Release Pre-release

Pre-release

Initial alpha release documented here on cljdoc.

The release includes an uberjar that should provide a Geni REPL (i.e. a Clojure spark-shell) within seconds. Download the uberjar, and simply try out the REPL with java -jar geni-repl-uberjar-0.0.21.jar! An nREPL server is automatically started with an .nrepl-port file, so that common Clojure text editors should be able to jack in automatically.

The initial namespace automatically requires:

(require '[zero-one.geni.core :as g]
         '[zero-one.geni.ml :as ml])

so that functions such as g/read-csv! and ml/logistic-regression are immediately available.

The Spark session is available as a Clojure Future object, which can be dereferenced with @spark. To see the full default spark config, invoke (g/spark-conf @spark)!

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: zero-one-group/geni

v0.0.31 - Spark Doc Scraper

v0.0.30 - Some Basic Support for Spark Streaming

v0.0.29 - Start of Spark Streaming Support

v0.0.27 - Excel Support and Version Bumps

v0.0.26 - Better RDDs, EDN Support and Data-Oriented Schemas

v0.0.25 - RDD Serialisation Model and More Methods

v0.0.24 - Basic RDD and PairRDD Support

v0.0.23 - Basic RDD Support + Spark ML Cookbook

v0.0.22 - Basic Geni CLI + Namespace Alignments

v0.0.21 - First Alpha Release