Skip to content
liangjh edited this page Sep 19, 2012 · 10 revisions

Casper Datasets

A concise dataset framework in Java

Contents

  • About
  • Features
  • Tutorials & Instructions
  • Performance

About

The Casper Datasets library is, in short, a memory-based dataset technology. The Casper Datasets framework allows a programmer to define a typed, n-dimensional dataset, and allows simple sorting, filtering, and aggregation operations.

Within a typical application stack, a data access / business object tier would require developers to create several context-specific pieces of code:

  • Objects that represent units of data (typically as a java bean in java)
  • A means to map rows from the data store to their objectified counterparts in memory.
  • Custom sorters, comparators, and filters to manipulate these objects.

In addition, this would have to be repeated for each object abstraction within an application.

The Casper Datasets library aims to unify these requirements into a single library that is both simple and intuitive.

Features

Casper Datasets exposes a generic data abstraction and searching framework that allows a user to define any mapping via a meta data object (essentially, a mapping of column names to types, akin to java.sql.ResultSetMetaData?). This dataset definition is utilized to manipulate, search, and sort objects held in memory.

The following major features are supported:

  • In-memory caching of data rows
  • Searching and filtering through the dataset based on attributes on any given column. This is accomplished through the use of filter chains, which support the following types of matches: (1) Equality, (2) Regular expressions, (3) Numeric Ranges, (4) Date Ranges.
  • Scrolling through query results via a cursor, akin to java.sql.ResultSet?
  • Indexing of columns for optimized lookup
  • Basic aggregation functionality: sums, weighted sums, average, weighted averages
  • Relational database adapters to load relational rows effortlessly

The following import/export features are supported (via CasperDatasetsIo):

  • Import/export from collection of POJO beans
  • Import/export from delimited (CSV) files
  • Import from Excel (XLS/XLSX) files

The following major additional features are supported (via CasperDatasetsExt):

  • Narrowing functionality which converts the columns in a casper dataset to the smallest possible value type of all items while retaining fidelity

Tutorials & Instructions

Please see the WIKI section for more information. Here are some quick links:

  • Tutorial-A: Basic code samples and usage instructions for the casper datasets framework.
  • Tutorial-B: Why use casper datasets? We present a fundamental data-driven problem to illustrate why a generic dataset technology may be beneficial for your next project.
  • Casper I/O: casperdatasets-io overview and code examples of importing/exporting datasets to/from files and bean collections
  • Casper Ext: casperdatasets-ext overview and code examples of narrowing a dataset

Performance

In a test of approximately 370,000 rows of data, a non-indexed search (i.e. full table scan) and sort yielding approximately 16,000 rows took about 360 ms on a high-end desktop machine. An equality-based search on an indexed field yielded performance several orders better.

  • Performance - A brief sample of performance tests done on the casper datasets framework (on several datasets sizes).
Clone this wiki locally