-
Notifications
You must be signed in to change notification settings - Fork 4
Field API
The Field
object is the analogy of the Pandas DataFrame Series
or Numpy ndarray
in ExeTera.
Fields contain (often very large) arrays of a given data type, with an API that allows intuitive manipulations of the data.
In order to store very large data arrays as efficiently as possible, Fields store their data in ways that may not be intuitive to people familiar with Pandas or Numpy. Numpy makes certain design decisions that reduce the flexibility of lists in order to gain speed and memory efficiency, and ExeTera does the same to further improve on speed and memory. The IndexedStringField
, for example, uses two arrays, one containing a concatinated array of bytevalues from all of the strings in the field, and another array of indices indicating where each field starts and end. This is much faster and more memory efficient to iterate over than a Numpy string array when the variability of string lengths is very high. This kind of change however, creates a great deal of complexity when exposed to the user, and Field
does its best to hide that away and act like a single array of string values.
Operations on fields can be divided into the following groups:
- accessing underlying data
- constructing compatible empty fields
- arithmetic operations
- logical operations
- comparison operations
- application of filters, indices and spans
Underlying data can be accessed as follows:
- All fields have a
data
property that provides access to the underlying data that they contain. For most field types, it is very efficient to read from and write to this property, provided it is done using slice syntax- Indexed string fields provide
data
as a convenience method, but this should only be used when performance is not a consideration - Indexed string fields provide
indices
andvalues
properties should you need to interact with their underlying data efficiently and directly. For the most part, we discourage this and have tried to provide you with all of the methods that you need under the hood
- Indexed string fields provide
Fields have a create_like
method that can be used to construct an empty field of a compatible type
- when called with no arguments, this creates an in-memory field that can be further manipulated before eventually being assigned to a DataFrame (or not)
- when called with a DataFrame and a name, it will create an empty field on that DataFrame of the given name that can subsequently be written to See below for examples
Numeric and timestamp fields have the standard set of arithmetic operations that can be applied to them:
- These are
+
,-
,*
,/
,//
,%
, anddivmod
Numeric fields can have logical operations performed on them on an element-wise basis
- These are
&
,|
,^
Numeric, categorical and timestamp fields have comparison operations that can be applied to them:
- These are
<
,<=
,==
,|=
,>=
,>
f = # get a field from somewhere
g = f.create_like() # creates an empty field
f = # get a field from somewhere
df = # get a dataframe from somewhere
f.create_like(df, 'foo')
new_field = df['foo'] # get the new, empty field
The following snippet involves the creation of 'memory-based' fields. These are intermediate fields that are the result of performing operations on fields that are part of a DataFrame.
df = # get a dataframe from somewhere
df['c'] = df['a'] + df['b']
If we were to modify this snippet so that we wrote the result of the addition operation to a separate variable, then assigned the variable to the dataframe, it would essentially be doing the same thing under the hood
df = # get a dataframe from somewhere
c = df['a'] + df['b']
df['c'] = c
inds = # get indices from somewhere
f = # get a field from somewhere
df # get a dataframe from somewhere
g = f.apply_filter(filt) # in-memory field 'g' produced by performing the filter: 'f' is unchanged
# create a field on dataframe 'df' and then assign to it - the slightly awkward way
h = f.create_like(df, 'h')
h = f.apply_filter(filt, h)
# the one-line way
df['h'] = f.apply_filter(filt)
# destructive, in-field filter
f.apply_filter(filt, in_place=True)
inds = # get a filter from somewhere
f = # get a field from somewhere
df # get a dataframe from somewhere
g = f.apply_indices(inds) # in-memory field 'g' produced by performing the index: 'f' is unchanged
# create a field on dataframe 'df' and then assign to it - the slightly awkward way
h = f.create_like(df, 'h')
h = f.apply_inds(inds, h)
# the one-line way
df['h'] = f.apply_indices(inds)
# destructive, in-field reindexing
f.apply_indices(inds, in_place=True)
session = # the Session object
s = # field from which we will obtain spans (typically a primary key)
f = # get a field from somewhere
df # get a dataframe from somewhere
spans = session.get_spans(s)
g = f.apply_spans_max(spans) # in-memory field 'g' produced by performing the span appliation: 'f' is unchanged
# create a field on dataframe 'df' and then assign to it - the slightly awkward way
h = f.create_like(df, 'h')
h = f.apply_spans_max(spans, h)
# the one-line way
df['h'] = f.apply_spans_max(spans)
# destructive, in-field span application
f.apply_spans_max(spans, in_place=True)