Skip to content

Interactive graphics on the web

Michael Lawrence edited this page Dec 7, 2017 · 1 revision

Interactive Graphics on the Web

Motivation

The web is the most accessible medium of the age. Thus, it is an effective means for e.g. communicating results to scientists and others of a non-technical persuasion.

Goal

Make it easy for data analysts to distribute their analyses and results through the web, as interactive, graphics-rich documents. A more concrete version is to make it easy as possible to port something like cranvas to the web.

Requirements

Computation

Generating static web documents from an analysis environment (R) is a mostly solved problem with packages like rook, brew, and knitr. Much more difficult are interactive documents, because they require logic to be executed whenever the document is viewed. This logic can happen in the client, the server, or both.

Running R code on the server is becoming increasingly easy with frameworks like RApache/rook, Rserve/FastRWeb, and Rstudio’s shiny. Rserve and shiny have moved to support websockets, which permit bi-directional communication. However, latency persists.

This latency is really only a problem when interacting with plots. Any delays spoil the flow of the analysis.

This latency is avoided on the client, although the client may be less computationally powerful than the server. The low-level language of the web client is JavaScript. So, a central question is whether we program directly in JS, in some JS-based language, or stick with R, somehow.

R on the client

Running R in the web browser would require one of the following solutions:

Implement R in JS

A JS implementation of R would be useful if we needed to interactive evaluate R code in the browser. There is no performance reason for doing this; the server should suffice for code evaluation. Direct compilation is probably faster for reactive logic.

Translate R code to JS code

Translating R code to JS code might be workable. The code would need to call down to the server for R routines that have not been ported. The RJavaScript package has a prototype translator directly from R to JS. Other possibilities exist, like compilation to the LLVM (using Rllvm) and from that to JS through emscripten. A challenge here would be debugging.

Bridge to the C implementation of R using a plugin

Gabe has implemented RBrowserPlugin for embedding the C/Fortran implementation of R inside any web browser supporting NPAPI. The unfortunate downside is that the user needs to install a plugin. Also, it is not clear for how long browsers will support plugins, given the power of HTML/SVG/CSS/JS. Similar plugins exist for many languages, like Python, but I at least have never seen a web page make use of one.

JS on the client

The alternative of course is to simply write JS for the client. This is similar to how we write C code in C today (Simon’s R2C package not withstanding). This would probably be easier to debug, since it skips a compilation stage. If we went this route, how could we lessen the burden?

We could take a page from the Rcpp Sugar library, i.e., make Javascript more like R. Add vectorized operations, etc. Such a library could eventually be accelerated with WebCL. And it would be a lot more accessible compared to R.

Drawing

One key question is how to draw the graphics. The graphics need to be scalable (in both speed and memory usage). On the desktop, we settled on OpenGL, and Qt was a big help. It is doubtful that anything similar exists for WebGL, yet.

The primary scene graph on the web is SVG. SVG offers some useful features, like collision detection, efficient incremental updates, etc. However, like many such scene graphs, SVG is unlikely to scale very well to large numbers of primitives.

With qtpaint, we had the choice of using a single built-in QGraphicsItem for every datum or to instead implement custom QGraphicsItems that would render multiple data. Obviously, we chose the latter. To do the same thing on the web, we need our own custom nodes. The truly unfortunate thing about SVG is that it is too abstract for it to be implemented in JS. There is no way to extend it at the JS level, because there is no low-level drawing API. Instead, we need to draw to <canvas> and possibly integrate with a <canvas> scene graph like paper.js, or just use HTML/CSS/jQuery.

The qtpaint package allowed for multiple views of the same scene. While the uses of multiple views of the same data model are obvious, it is not clear whether we would ever want multiple views of the same scene. The closest use case is a scrollable view with an overview window. But even then, there are differences; for example, the current viewport is shown as a rectangle in the overview. In qtpaint, we hacked in an overlay layer to try to handle such situations – but there are probably better ways.

In Firefox, one could use “-moz-element(#id)” for the CSS background to get multiple views of the same DOM element.

We then would need one or more high-level APIs on top of our low-level scene graph, in order to facilitate usage from JS. Perhaps D3 could be extended to provide an intermediate layer, and then things like ggplot2 and cranvas could be implemented on top of that.

Spatial Queries

One thing that we lose by choosing <canvas> over SVG is the mapping of user events to visual primitives, which is another requirement. The spatial index needs to be able to store points, rectangles and polygons, and the same for queries.

R Interface

At some point, it would be nice to be able to produce web graphics from R. We could create a translator from R to our JS sugar. Then it would be easy to create a backend for anypaint that relies on that translator to deploy R logic to the client and that calls down to the low-level drawing API.

Architecture

Here are the main architectural components:

  • Low-level drawing library on the client (JS) that is optimized for statistical graphics and is based on a scene graph of layers
  • Some efficient mechanism for mapping user events back to primitives
  • Some sort of stat computing environment/language in JS
  • Some way to compile/translate R to JS
  • Backend for anypaint that would deploy a cranvas plot to the web with minimal effort

Other software can tie into this architecture at any level. In particular, someone might implement an API in the style of cranvas or ggplot2 directly on top of the low-level drawing API, completely in JS.

Design

Low-level Drawing

From our requirements, we need a fast, low-level drawing API that would be based on <canvas> and WebGL. Some libraries exist to help with this.

Drawing API

We will base the design on the Painter API from anypaint. There will be an abstract API and an implementation for both the 2D context and WebGL.

See this for 2D in WebGL: http://games.greggman.com/game/webgl-fundamentals/

This may help for buffered rendering: https://developer.mozilla.org/en-US/docs/Web/API/OffscreenCanvas But there may be simpler ways.

Some important characteristics of the API:

  • Vectorized over graphical parameters: we often have vectors of data, even for things like color and font size, and the API should accept them as vectors. As long as the data has been sorted based on the parameters, there will be minimal state changes. It might be a bit more efficient to split by the attributes, but that would complicate the data structures. The API could sort, but that would affect the drawing order, which might not be desirable. And the pipeline could cache the sorting. This style of API works particularly well with the vectorized WebGL.
  • Transforms data to device space: the Layer will request the data limits for a particular dataset from the geom and use them to establish a transformation on the Painter for data to device space. This is different from the low-level transform on the <canvas> context, because the painter transforms only positions, not the sizes of lines, images, text, etc. If the drawing layer does not do this, it is left to the geoms, which would not be very convenient.
  • Color representation: the 2D canvas uses CSS color specs that are normalized to “rgba(r, g, b, a)”. This is great for hard coding colors; much less great for computing on colors. It would be particularly inconvenient for the WebGL backend, which would prefer something like an RGBA integer matrix, like the output of col2rgb, except it would be stored row-wise and transposed in order to fit as a column. Wait on this for now.

<canvas> Libraries

Processing.js

The Processing language ported to JS. Very mature, strong community. Has support for WebGL (oriented towards 3D though). Big downside is that it is not data-focused, and it is not vectorized.

Facet

There is the Facet library by Carlos, which tries to hide all of the OpenGL ES shader stuff. It seems a wee-bit rough right now (demos do not work, not a lot of documentation). Also, there is no interactivity in the plot itself, which explains why Simon was using D3/SVG for that. In theory, we could fill in the gaps in the “marks” API to support the equivalent of qtpaint’s Painter.

Non-WebGL 2D <canvas> scene graphs

We probably want to just use the DOM as our scene graph. This avoids extraneous dependencies, as these flash-oriented libraries offer little for statistical graphics.

  • KineticJS: Designed for performance; seems very mature and well documented. And probably the most popular. Supports caching of layers.
  • Fabic.js: 2D scene graph with SVG parsing
  • Easel.js: 2D scene graph with a vector graphics abstraction
  • oCanvas

Instead, we could just use this to stack up the canvas elements: http://www.arc.id.au/CanvasLayers.html

pixi.js rendering engine

This might greatly simplify the implementation of splotch. It has WebGL-based 2D primitive rendering with a canvas fallback, as well as a scene graph.

Optimizations

Here are the keys to making the <canvas> fast:

  • Avoid sub-pixel rendering (we try to disable this)
  • Prerender (we support image blitting)
  • Update incrementally (we encourage use of layers)
  • Draw in batch, avoid stroke/fill (we try our best but user must sort)
  • Avoid state changes (again, we try our best)
  • Get closer to the hardware: WebGL – TODO

The current design for batch drawing is to accept state vectors and only e.g. stroke/fill when the style changes. This requires checking for changes in the loop, as in qtpaint, although we are no longer in C++, so this could be a performance drain. We should profile. If it is slow, we will need to implement fast paths or maybe even ask the user to provide run-length encodings.

Note on not using floating point coordinates with the 2d context: http://seb.ly/2011/02/html5-canvas-sprite-optimisation/ This means we either (a) need to do all transformations ourselves and then round, (b) use WebGL, (c) deal with poor performance. Might be able to disable with FF/WebKit: ctx.mozImageSmoothingEnabled = false; ctx.webkitImageSmoothingEnabled = false; ctx.oImageSmoothingEnabled = false;

Scene graph

The Painter API will draw to a canvas element, and multiple canvases could be combined as nodes in a scene graph. This comes directly from HTML/CSS and jQuery-based layouts. There is a convenient lightweight CanvasStack library for stacking canvases on top of each other.

The Layer object will associate a <canvas> with a geom and pipeline stage. It will listen for changes in the pipeline and respond to them by asking the geom to draw to the <canvas> through a Painter. It will pass both the data (pipeline stage) and Painter to the geom. Similarly, the Layer will listen to the <canvas> to know when it needs to be redrawn (typically due to a resize).

Spatial Queries

We of course need some way to find points that fall within a region, or rectangles that contain a specific point. It seems like this might fit the bill: https://github.com/mikechambers/ExamplesByMesh/tree/master/JavaScript/QuadTree It does not support arbitrary polygons though, just points and bounded items (rectangles). Polygons are a pain in the ass! Seems like it might be buggy. The closure library has one, but it only works for a single point. Here is another option: https://github.com/silflow/quadtree-javascript

D3 also provides one, although it is pretty barebones. Other options include: porting Gabe’s SearchTrees package, or waiting for the hit detection in canvas v5.

High-level Drawing

On the JS side, we need to facilitate usage of the low-level API. One of the popular APIs is D3: Data Driven Documents. It has a clean, vectorized API for mapping data to visual primitives. Interactivity is fully supported, with quad trees, etc. Unlike Facet and Processing, D3 is oriented around SVG instead of <canvas>. We cannot extend SVG directly, but we could extend D3 to target a different type of DOM, one that draws in aggregate via <canvas>. It is likely that the R interface will target the low-level API, we should probably consider this a lower priority, but other projects, like the JS port of ggplot2, could benefit.

See: http://bl.ocks.org/1276463

Data Binding

We have always modeled data according to data.frame, and the plumbr package provided a mutable version of it. On the web, there is a welcome move towards client-side data models, where the server essentially becomes a RESTful service (this seems very much in line with shiny).

There is a good list of these frameworks here: http://weblogs.asp.net/dwahlin/archive/2012/07/08/javascript-data-binding-frameworks.aspx

We will create a tabular model that supports chaining into pipelines. It will be based on the popular backbone.js and could use crossfilter for filtering/grouping.

An Interactive Grammar of Graphics

Until now, interactions have been programmed directly in general purpose languages, without any real grammar-like abstraction. The benefits of an abstraction are many and include the ability for the server to pass the interactive logic to the client and thus avoid client/server communication. A grammar also affords a framework through which we can systematically construct, describe, evaluate and compare plots. It is a cognitive tool for exploring the space of possible plots, and thus helps us discover the appropriate plot for a given question, without relying on leaps of intuition.

Before we extend the base grammar of graphics to interactivity, we first need to consider dynamic graphics, i.e., graphics that change over time. A graphic is dynamic if any of the data underlying the plot changes. The dynamic data might be empirical, in the case of streaming data, or it might be computational, such as when we animate the values of a parameter over a range. To support dynamic data, we need two things: a controller that drives the updates and a data model that signals when the data change.

A dynamic graphic becomes interactive when we allow the user to change the data, i.e., the user becomes the controller. Typically the scope of modification is limited to computational data, i.e., the parameters, and what we might call user data, such as a selection of points. For example, the user might adjust the parameter of a statistic using a slider. Or, the user might modify the selected status of points by drawing an enclosing rectangle in the plot.

An interactive grammar of graphics extends a conventional grammar with the logic of the controller: it describes how user actions translate into modifications to the data. The interactive grammar relies on the base grammar for mapping the data to graphical primitives, and it reverses that pipeline to implement interactions. We define the control component as the inverse of the geom. While the geom maps data to a rendering, the control maps device-level input to the data. Other grammar components, such as the coord and the stat also need to operate in reverse.

In the reverse pipeline, the data model plays a role analogous to the renderer in the forward pipeline. The data model becomes the drawing surface for the user. The typical rendering library offers abstractions for drawing shapes and images. Analogously, there should be data models with specific support for e.g. storing rectangular indications. We define these specialized data types under the data component. The data model will typically consist of multiple, interdependent data components.

Typical Interactions

The user interacts with a plot to issue a command that changes some aspect of the plot (the view) or the pipeline (the data model). Typically, the user adjusts the plot axes, or manipulates annotations on the data, such as whether a point is selected. Often, interactions affect multiple plots (linking). These plots need to share the same (root) pipeline.

From the user perspective, the most common commands are query, pan/zoom, brush and transform.

Query

Exposes more detail about a row selection. Views are typically filtered based on selection status, so that the selected rows become visible in a plot, table, or tooltip/label.

Pan/zoom

Pan and zoom correspond to translation and scaling of the (continuous) limits, respectively. Pan/zoom is related to querying in that pan/zoom is based on a regional selection, while querying is based on a row selection. Interfaces include click/drag pan, wheel zoom, and rubberband zoom (which also pans).

Brush

Brushing changes one or more visual properties, typically color, based on a selection. Typically, it changes the color of the selected rows, but it could also, e.g., change the glyph type, or even color the selected region.

Transform

The pipeline often transforms, filters and/or aggregates data before it reaches the plot. The user can change the parameters of these operations by e.g. dragging a slider or click/dragging in the plot (tours, binning). Reordering the levels of a categorical axis also falls into this category.

Commands

In any GUI, the user issues a command by first specifying any parameters of the command through “stateful” controls like sliders and dropdown menus. The user executes the command by taking some instantaneous action, such as clicking a button or pressing enter in a text entry. Typically, the command parameters are spread across several widgets. Alternatively, we could collect the state into a single object (model) representing the command. This is called the command design pattern and it fits well with MVC.

Since a plot is a GUI, the same applies to plots. There are commands to select a point, color a point with a brush, reorder a categorical axis, etc. The user first configures the command by, e.g., drawing a rectangle, or dragging an axis label to a different position. Releasing the mouse button invokes the command. Following the command design pattern, each command should have its own data model holding its parameters, and a function that carries out the command.

Indication

Plot commands are usually parameterized by a set of zero, one or two dimensional shapes that the user defines with a pointing device. We call that geometry the indication. To define the indication, the user performs actions with the pointer device. The controller maps the resulting events to changes in the indication. Changes in the indication propagate to the rest of the visualization. The changes might change the selection status of some points, pan/zoom the extents of another plot, etc.

The indication will typically be defined in terms of data coordinates; however, other coordinate systems are possible. For example, the GGobi identify tool selects the nearest point, as long as the point is within some fixed number of pixels on the device. To reimplement that, we might define the indication in screen coordinates and the selection function would somehow compare the indication to the screen coordinates of the points.

Cursors

To guide the user in making an indication, the system renders a guide that we call the cursor.

Many commands are based on a single point, indicated by the pointer position, as reflected by the cursor from the windowing system. Sometimes additional feedback is helpful, such as drawing a line when selecting a univariate cutoff. Cursors become more complex when defining 2D regions, such as rectangles or polygons, or an angle, when angular brushing. Some cursors might be a permanent control in a plot, and the user can manipulate them like any other widget. For example, a rectangular brush could be resized by clicking on the corner and dragging.

Here are some important axes along which we can categorize different types of cursors:

Control
Hover (query), or click/drag (brush)
Permanence
Does the cursor remain after selecting?
Dimensionality
Zero (point), one (line) or two (rect, lasso, angle)

Selection

Most user actions boil down to a type of selection. Fundamentally, a selection is implemented by a function that maps the indication to the selection variable, which is typically boolean (selected or not) but could also be continuous (weighted selection).

There may be multiple selections in a plot. Use cases include transient vs. persistent selection, and a separate selection for each color (like the GGobi brush), although this might be best represented as a categorical selection. Selections may be merged with logical operators, i.e., and/or/not/xor. If the selections are disjoint, a factor can be generated from selection membership, using NA for unselected rows.

The function that maps the indication to the selection variable has at least two important aspects:

Codomain
Boolean, factor, or continuous (proximity)?
Cardinality
Select one or multiple?

We can also parameterize how the function is applied:

Target
Data points or regions?
Persistency
Do selections remain even after cursor has moved?

We consider downstream effects of selection to be outside the scope of this definition. Those effects include mapping the selection status to a color, filtering based on selection, and selection propagation (e.g., when selecting a point leads to the selection of all “similar” points).

The GGobi-style Graphics Pipeline

The design of the pipeline developed for GGobi is still very much appropriate for interactive web applications. The pipeline consists of a series of chained tabular data models (also seen in plumbr).

There are really only a few types of pipeline stages. There may be multiple instances of each type of stage, and the stages can be arranged in any order (after the root). These types include:

  • Data source (root)
  • Filter rows
  • Transform columns
  • Pivot between long and wide formats
  • Aggregation
  • Joins between datasets
  • Splits (just coordinated list of filters)

The data is at the root of the pipeline. It hides the actual storage of the data.

The filter stage removes rows based on the value of some variables or other input. Multiple filter stages connected to the same parent stage allow splitting of data.

The transform stage produces a new variable from some other variables or other input. The new variable might replace an existing variable or be added to the dataset.

The merge stage joins together two datasets. This could be a simple contatenation or a more complex join, usually according to the common variables. It might also add a variable describing the original source of each row.

Its worth noting that GGobi also contained stages like planar and screen. We now view these as part of the rendering process. The planar stage roughly corresponds to the “geom” (with the tour functionality becoming a transformation) and the screen stage is part of the drawing layer.

Interactions as the root of the (reverse) pipeline

GGobi also introduced the concept of the reverse pipeline, and this is fundamental to handling interactions. The interactive grammar modifies the data at the leaves (the plots), and the reverse pipeline maps those changes back to the root dataset. In the reverse pipeline, the interaction is the root and the leaves are the datasets.

The critical point here is that the interaction is part of the pipeline. It is a special root stage of the reverse pipeline. In the same way that the geom calls a widget rendering API, the interaction registers event listeners with a widget. It can really be any control but if we limit ourselves to the plot itself, the widgets are layers in a scene graph. In most scene graphs, a layer can be both the output and the input in shiny terminology. That does not mean that a layer must always do both. For simplicity in these discussions, we will keep drawing and listening separated between layers.

Besides modifying the data, the interactions can also modify the properties of the pipeline stages.

So there are some obvious general types of interactions:

  • Data assignment
  • Property assignment (pipeline, widget, etc)

Geoms and interactions both sit at the end of the pipeline and both interface with the layers. Often, the distinction is blurred, such as how a brush listens to the mouse and draws the selection cursor. We might use a general term for these objects: plidgets (for plot widgets).

Use cases

Lets consider some use cases (from cranvas) and how we might express them as the grammar:

Identify points in a scatterplot

data -> points `-> identified -> labels ^– identify [transient] ^– not_identified <- identify [persistent]

Here, the identify interaction modifies a variable in the original data so that the identified points pass the identified filter and have their labels plotted. We use the high-level terms labels and identify although in reality these are just certain parameterizations of more general components. The labels component is really the text geom, with the necessary aesthetic mapping to draw the IDs (or any variable of interest) of the points. The text is aligned and offset according to some parameters. The offset could be derived from a transform stage, but we ignore that for simplicity.

Similarly, identify is a parameterization of something more general: it is a type of data assignment. One parameter is the variable name(s) that it sets. It might set the entire variable so that only the currently selected point is identified, or, in the case of persistent/sticky identification, it would perform a sub-assignment so that the selected point is added to the set. The persistent mode would probably be implemented by a not-identified filter operating in reverse, as shown in the diagram.

The other parameter is the value that is set. That needs to be dynamically computed from the low-level input event. The function would take the event as its input and yield the new values for the variable(s). This function is analogous to the geom function that draws given one or more variables. It would need to support many parameters to handle both identification and other types of brushing.

Specifying interactions

These would include the type of query: nearest neighbor or within region (circle, rectangle)? If an abstract region is specified, there needs to be a mapping from event to region. This would consider the alignment between the region and the mouse point, and whether the region is given in device or data coordinates.

JS syntax

The JS syntax for constructing pipelines could be similar to that of D3:

data.points(); data.is(“_identified”).text(“id”); data.isnt(“_identified”).action(identify);

R to Client/Javascript Interface

This will be difficult.

The developer can program using D3 directly, but what about the R user? Hadley has started ggvis, a package for using Vega as a ggplot2/gog backend. It attempts to define and implement an interactive grammar of graphics. It depends on Shiny to implement R callbacks. That means there generally needs to be a server, and there will be network overhead. Ideally, we could express our computations in R, but have them run in the browser.

Compilation path

The first question is whether we translate to Javascript directly, or whether we proceed through LLVM and Emscripten.

Compiling R to Javascript is generally infeasible. However, a subset may be doable. The RJavascript package is a good start. We would probably need to add support for manipulating pipeline properties and the data. The pipeline (and thus the data) would be passed to the callback handlers. All coordination should happen through that object. We can use codetools to ensure that external dependencies are resolved and uploaded to the client.

Server-side integration

Can the server pass off code that is enclosed in the currently running session (kind of like mclapply)? If there is a client and server-side runtime, how are they kept in sync?

R Runtime

How do we mimic the R runtime, with its dynamic search path, etc? Symbols cannot be resolved at compile time; it must happen at runtime.

Leveraging the R plugin through an abstraction

The R browser plugin avoids the need to use JS for stat computing. The obvious problem is that not everyone will have the plugin installed. Could we somehow make it so that the plugin is optional, without having to change the code?

It comes down to the level of abstraction.

Low-level abstraction: direct manipulation of R objects

The first is a low-level abstraction, where JS code manipulates R objects as if they were in memory, and vice-versa. The plugin supports this already, and supporting evaluation on the server would mean that the proxy objects would communicate via a web socket, instead of via the plugin.

On the R side, the abstraction would be fairly easy to implement. On the JS side, there is no way to create actual proxy objects without using a plugin, and we do not want to screw with the syntax, so we have two options:

  • Use a different language that compiles to JS, where every object has a method like .R() that performs the necessary communication. Not sure how this would distinguish between R proxies and normal JS objects, unless it is typed, or we take a performance hit by figuring it out at run time. Typing could be optional.
  • Constantly update the JS objects via the web socket to remain in sync with the R session. This would be complicated but not very performant.

Either way, this general approach lets us distribute the logic between the two languages. The high-level logic could be in JS and R would be used only for stat computing tasks, or JS could call directly to R and R would manipulate the DOM. Compare this to the design of anypaint or shiny, where all logic is in R. In anypaint, there is an imperative API available to R for manipulating the UI. In shiny, R returns a list of outputs. Shiny has a very functional design, so there is no way to directly manipulate the client state during the evaluation of the handler. That said, it is rare for an anypaint handler to rely on mutable state. The handlers mostly issue a series of calls that could be easily deferred for execution in the client. In theory, one could establish a web socket that would perform the updates immediately. Latency then becomes an issue.

High-level abstraction: different UI implementations

If the communication model between R and JS comes to resemble shiny, then it becomes easier to have the widgets implement shiny API and then have the plugin register different callbacks. This is way simpler than the low-level solution, but it of course is much more limited. We gain the performance of having a client-side R, but we lose the direct manipulation of objects on either side. Of course, the R code will need to be robust to running on any client (there must be sufficient resources, access to the data, etc).

Statistical Computing in the Web Browser

If we are to implement interactive graphics, we need to compute on data on the client. This requires both a convenient language and a feature-rich library of statistical routines and graphics. R is one of the best examples of an effective language and platform for statistical computing, so we should use it as a baseline and as inspiration. One approach would be to embed the C implementation of R inside the web browser. Although a proven concept, the system dependencies (a plugin) make that undesirable. We could also implement R in JS. Emscripten would facilitate borrowing from the C implementation. But that would present all sorts of complications and the performance would probably be terrible.

A Web-based Language for Statistical Computing

It should be possible to implement an R like language in JS. The goal here would not be to replicate the functionality of R. Rather, we simply want a language that is suitable for working with data, liberally borrowing ideas from R (and other places).

Important aspects of the R language include:

  • Vectorized
  • Functional (like JS)
  • Dynamically typed (like JS)
  • Object-oriented (like JS)
  • Implicit variable declaration (a la CoffeeScript, but different scoping)

We do not necessarily want to set out to copy all of the features in R. In particular, the functional OOP paradigms of S3 and S4 may conflict too much with the OOP semantics of Javascript.

The key feature for me is vectorization. That keeps the syntax clear and could potentially lead to better performance (through e.g. WebCL).

The question is whether to come up with a new language that would translate to JS (like CoffeeScript) or simply embed these semantics into JS (similar to how the D3 API is vectorized). Obviously, the former choice would afford a lot more flexibility. I think that “x + y” is much more readable than “x.plus(y)”, and for some vector “i”, “x[i]” is better than “x.extract(i)”. A translator is much more difficult to implement, obviously. Perhaps we could base this on top of CoffeeScript? Or maybe it is better to just use D3 for now.

A Web-based Platform for Statistical Computing

A statistical computing platform needs to provide extensible functionality for computing on data. This includes:

  • Implementations of statistical algorithms
  • Support for reproducible research (provenance)
  • A modular extension system (packages)
  • Self-describing objects (get this for free with JS)
  • An object representing a dataset (data.frame)
  • Graphics

Notably, the above list does not include any sort of user interface. Various interfaces would be built on top of the core platform.

The two most important components from our perspective are data and graphics. We need some abstraction of a dataset, where the data might be stored on the client, on the server, on the GPU, etc. For graphics, we need a scene graph that is optimized for drawing data (already described above).

One idea is to simply look at what cranvas needs from R and decide how important it is, how hard it would be to implement in JS, and how fast it needs to be (could it be done in R?). We then endeavor to make the important, performance-critical pieces easy to write in JS/D3. And the rest can use the server.

Backend for anypaint

If the drawing happens in R (on the server) then this is just encoding the plot commands in JSON (like Vega and probably others) and then consuming those messages on the client side.

But if we want to draw in the client, then the drawing logic itself is being compiled and sent to the client. The anypaint API would never actually be called from R.

R API for Interactive Graphics

An R API for interactive graphics should look and feel much like an R API for static graphics. The ggplot2 API would be a good starting point, and perhaps we could even express this as an extension of ggplot2.

Interactive plots differ from static plots because they:

  • Respond to user input, i.e., there is a controller,
  • Model the data as a dynamic and mutable “surface”.

Interactive displays combine multiple plots and must share mutable state across plots.

We represent each type of hardware/device event as a control object. Responding to user input (commands) suggests an imperative style of programming, which would complement the declarative style popularized by ggplot2. We allow the user to implement input handling logic by passing an R function to the control object.

Data modeling becomes more complex in the interactive case. The ggplot2 package eagerly coerces data to a data.frame through the fortify generic and treats it as static. Interactive graphics often defer to user input and might only show a portion of the data at once. This means data access should be lazy and the data are dynamic. The data might also be modified in response to user input, so it needs to be mutable.

Sketch: Selection in a scatterplot

Consider the typical scatterplot with a brush that colors the selected points.

Static plot

First, we need a scatterplot. With ggplot2, we would do:

ggplot(mtcars) + geom_point(aes(x = mpg, y = hp))

Note that the plot is given a data.frame directly, so it has its own copy of the data. This also works in the interactive case, as long as we are not sharing the data.

If we wanted to share the data, then we would explicitly create a data model:

m <- data_model(mtcars)
p1 <- ggplot(m) + geom_point(aes(x = mpg, y = hp))
p2 <- ggplot(m) + geom_point(aes(x = mpg, y = wt))

The model object is a pipeline of operations, starting from the data. It looks like an R data.frame, but it is mutable and supports chaining. Hopefully can use mutaframe for that. When compiling, we do need some way to tell whether two model references are the same. Hopefully md5 hashing works.

Transient brushing

But now we need a way to accept user input. When the user clicks the mouse, we initialize a new selection and as the user drags the mouse, we update the selection, until the mouse button is released. First, we need a rectangular indication model, create a layer to render it, and finally add input handlers to update it.

r <- data_rect()
brush <- geom_rect(data=r) +
    control_press(r$reset) +
    control_drag(r$end)

When we create the plot, we configure it to map the selected status to the color of the points, and add the brush control:

p1 <- ggplot(m + stat_overlap(r)) +
    geom_point(aes(x = mpg, y = hp, color = ..overlapped..)) +
    brush

The stat_overlap object asks the data_rect which elements it overlaps and generates a logical vector that is TRUE for the overlapping elements. Note that stat objects in ggplot2 imply a new layer. That would not be desirable here.

Problem
How to control which coordinates are passed to the controls, including device, raw, etc?

Note that model_rect() just creates a model with something like:

model(x0=integer(), x1=integer(), y0=integer(), y1=integer())

and then adds methods for stateful drawing of rectangles. It should probably also provide hit testing.

As brush is commonly used, we could wrap up much of this into a tool:

brush <- tool_transient_brush()

Build system

We might need the following from a build system:

  • Dependency management, including automatic downloading of packages
  • Optimization, including compilation of coffeescript
  • Possibly some sort of continuous integration (reload on save)

For the optimization, it is pretty clear we want to use an AMD-based system, probably RequireJS. There is a module for RequireJS that compiles coffeescript (require-cs).

For package/dependency management there are many options:

  • NPM: node.js package manager, mature and wide-spread, has github support; but does not support single file dependencies (like the AMDified coffee-script.js) and is very node.js-centric.
  • bower: from twitter, low-level, only downloads dependencies, which can be any URL, including git repositories; uses a different description format and thus misses dependencies.
  • volo: from RequireJS author, integrates with AMD, supports only github, including specific files, also automates tasks; seems like it is more for creating front-end projects than libraries.
  • jam: based on RequireJS, also includes compilation; restricted to its own package repository.
  • component: similar to volo/bower, except based on the commonjs system instead of the browser-oriented AMD
  • ender: only interface is through the browser

Bower seems attractive. It is unopinionated and just what we need. One issue is that, like component, it introduces a component.json, as an alternative to package.json. This sort of fragments things, but it also separates browser-based assets from those of node.js. This is apparently well accepted, at least by the yeoman and brunch folks.

Build tools:

  • grunt

Seems like grunt is the main leader here.

All-in-one environments:

  • mimosa
  • yeoman
  • brunch

Those are bit too opinionated… in theory we could create our own yeoman generator for libraries like this one; probably best to do it once by hand just to understand.

Possible bootstrap strategies:

  • simply put all dependencies into git
  • use make to wget the necessary files
  • expect user to have bower installed and call bower install
  • npm update installs bower dependency, hook calls bower

The first seems most common. The last two come down to whether we are primarily an npm package or a bower package. Conceptually, we are a browser package, not a node.js module, so bower makes a bit more sense. If a bower package, then we operate under the expectation that the user has bower installed and that updating bower is not important when updating the application. The whole thing is a bit nuts: distribution package => node.js/npm => bower => our package.

If we drive the build with grunt, then we might want to have a grunt task that uses bower to download the dependencies. https://github.com/yatskevich/grunt-bower-task Note that the user needs to have that task installed (along with bower, grunt, etc). For the convenience of the user, we should have these listed as devDependencies in a package.json. But then we have both a package.json and component.json!! Is that a big deal? Having a package.json does not make something a node.js package; but npm does understand it, which is a plus. The only important redundancy is the version. Needs to be bumped in both places. This makes me lean toward the jam/volo approach. Different package managers are namespaced out within the package.json. Or we could claim that package.json is our build system only. Haha. Ok.

An issue is that Bower just grabs the entire repostitory. To use with requirejs, we need to either

  • copy the files into some common directory, point require to it
  • point require to the individual files in the larger repository

It would seem that the second approach is cleanest, as there is no redundancy. The grunt-bower-hooks package does this, as long as the main files are declared. A problem with bower is that the great majority of packages lack a component.json, so there are no specified “main” files. An approach that is probably as good as any right now is to just list the exact URLs to each file, in place of the version in component.json. After all, there is no point worrying about a version until component.json exists.

Directory layout

There are few conventions in the JS community for directory layout. We need to place the following components:

  • Source code
  • Dependencies
  • Optimized distribution
  • Build tools
  • Tests

Bower puts the dependencies in the ‘components’ directory, and we should follow that convention. The distribution files could land in ‘dist’ and the build tools in ‘tools’. The source code is trickier. Options include ‘lib’ (it is a library, but this is often associated with external deps), ‘src’ (it is source code) or ‘[package-name]’ (like GTK+). It seems elegant to have the main module be named ‘[package-name].js’ and to sit above a ‘[package-name]’, since that module will construct the umbrella library export object. So the main module could be in the root dir, ‘lib’ or ‘src’. Since it is itself source code, lets put it in ‘src’.

The require path configuration is important for mapping module names to their paths. It is used in two scenarios: compilation and testing. The two configurations are fairly different otherwise. It makes sense to have the require.config() call in a js module that is required from the tests HTML page, and a similar require.config() in a ‘build.js’ under ‘tools’. In theory, these paths could be determined automatically using the bower API. Actually, it has already been done: https://github.com/yeoman/grunt-bower-hooks

Alternative: Bokeh

The bokeh library was developed by the Python community to support web-based interactive graphics, defined in Python. It supports server-based Python callbacks, as well as a serverless mode that uses JS callbacks compiled from Python with PyScript.

Perhaps inspired by Pandas, bokeh breaks from the conventions of the D3-based libraries in Javascript. The data are represented as a column-oriented table, not a document. It uses backbone.js to treat the dataset as a conventional data model. There is a scene graph consisting of vectorized glyph objects, which abstract rendering over canvas and WebGL. There is a selection model, and updates report the selected rows.

The library is still immature, and has some gaps and probably some performance deficiencies. Better use of data models would avoid the need for very much custom JS. There is currently very little support for client-side data manipulation. However, it has the right idea.

There already exists an R binding for bokeh called rbokeh, by Ryan Hafen. So far, it is not as extensible as the Python interface, but it could be improved.

What we really need is an abstraction that sits on top of R interfaces to different interactive graphics systems.

Implementation Roadmap

This will serve as our TODO list.

Low-level Drawing (paint)

Play around with Facet, chat with Carlos, can we build on it?

  • State “DONE” from “TODO” [2012-12-14 Fri 05:09]

Looked at it. It seems pretty immature, and does not work on Firefox due to some bug in the typed arrays. That said, it is the closest thing to a 2D canvas based on WebGL.

JS/<canvas> 2D implementation of the anypaint Painter API

WebGL painter implementation

jsperf comparisons between the two

Data Models (plumbing)

data.frame-like model in backbone.js

Geoms

Actions

Plots

GGobi on the web

Clone this wiki locally