GSoC-2014: Data Wrangling, Analysis and Visualization

Hello everyone,

I am not quite sure about the formal procedure, (I will figure it out hopefully) but I have a nice idea for an extension, which would be really helpful for many processing users (in my experience).

I am a mathematician who currently continues his studies with a bachelor in media art. My current focus is data analysis and most important visualization. Creative, good looking, interactive visualization plays a key role in understanding high dimensional complex data set. They while being a great interactive and beautiful form of narration are a major tool in gathering meaning out of information.

In my daily work I spend a lot of time with data retrieving, data wrangling and data processing. For this work I am mostly using Python (with the pandas library), Javascript and R. If I want to create good looking visualization - especially if I want to work on a prototype - I always switch to processing in the next step. It is the fastest and easiest way for to give your data an aesthetic flavor.

A big minus on this procedure is: there is always a big abstraction gap between the data wrangling and analyzing and the visualizing in the end. This makes feedback loops in your work (generate your visualization, detect errors, switch back to the data...) really annoying and time consuming. Also, in my point of view, it leaves processing in the design / arts hemisphere, while it would be great tool in visualization - especially if normal processing users like artists and designers are able to work on data related stuff easily.

So my suggestion for a contribution is a library, which is working similar to pandas, or maybe which just delivers a kind of framework / interface to handle data in processing the more easy way.

Possible starting points could be:

  • get data the easy way: it should not matter if your data are a CSV file, a JSON file, a REST Web Service, an SQL or a NoSQL database. You should just have one major function to get your data as a data frame into processing.

  • process data the easy way: a data frame delivers you an easy to use interface to perform operations on your data (building sums, group the data, filling NaN entries, creating SQL queries on the data frame...). In the best case it should also provide a set of standard statistics functionality.

  • provide an output the easy way: here processing plays the key role. If your processed data can easily be converted to a custom output formate (native Array, List, File ...) which then can be easily attached to standard processing graphics functionality a prototyping feedback loop for good visualization would be a lot faster and effective and would include a lot more designers and artists who are not so common with data wrangling the hard way...

What do you think about such extension?

If you don't really know what I have in mind yet, you should have a look at the Pandas library for python - which is doing exactly this thing. But as python is not as smart for visualization as processing a merge of both benefits would be the total awesomeness! :-)

Cheers Jonas

Edit: You find the proposal here: http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jonaskoehler/5629499534213120

Answers

  • Answer ✓

    Dunno whether it's possible, but there's a Processing/Python version around.
    Who knows you could use that Panda library there? :-/

  • @GoToLoop: Thanks for your feedback. As python is as flexible with libraries as Java is, it would not be the problem. But I don't know if your statement is matching my point.

    I still don't see why processing, as a rapid-prototype platform should not support rapid prototyping for data visualizations.

    Also the barrier for designers / artist would be a lot smaller if they can use processing directly, as they have been using it before.

    As a professional data scientist / visualizer you won't probably start doing things with processing. But for information designers, data artists or engineers experimenting with interaction based data exploration it would be a great tool.

    I see there would be a demand for it and sending a bigger group of experienced java-based processing users to the python-based processing library will only split up the user groups and cause coexistence of the same development effort...

  • Answer ✓

    Although it seems that Panda might be a better specific tool, I would love to see something that works with or is compatible with D3. I am not advocating that this is the best tool, however, with the recent inclusion of Processing.js in Processing 2+ as well as the huge community surrounding D3, I believe it might be a good candidate for for a port.

    Is there a good Java Library that has this type of functionality? The fastest solution would be to wrap and or extend an existing Java Library for p5 rather than looking to port or transcode existing solutions in other languages.

  • @manofstone: Pandas (don't forget the s ;-) ) is a very specific tool. And it is really powerful for huge calculations. You are right, it would be pointless to reimplement it for processing in its full functionality. I just used it as an example of a really good, easy to use, data handling tool. A lot of stuff, involved with professional statistics / analytics could be neglected.

    I also think more about the D3-ish way of handling the things. It is probably a better example for my suggestion than Pandas.

    In D3 we have something like a DOM and a predefined document structure. I don't really see something close in processing. Is there something like DOM structure for p5? In this case this would be an interesting attempt.

    On the other side plain rendering would be also interesting, because I think this is the pure strength of processing: the easy access to good looking visualizations combined with easy possibilities to provide interaction.

  • Answer ✓

    My apologies to Pandas. ;)

    So, I did a little searching and there are DOM implementations for Java. In my understanding, from a graphics perspective, the DOM references in S3 are a way of handling graphics and interaction natively in the browser. This is apposed to using the HTML5 Svg or Canvas objects. I believe S3 also supports SVG so it would be a better model to follow, since it is already similar to the graphics calls that Processing supports.

    I think in an ideal world is a library that also has a counterpart in the javascript world. Like Shiffman's shims/wrappers that he wrote to use his pBox2d/jBox2d physics calls using the box2dweb javascript library. This is an idealized solution, but the benefits are reaching a larger audience and by being able to put HTML5 compliant sketches online, to be used as examples, demos and other projects. This also allows you to run the sketches on iOS and other devices that do not support Java Applets but do support HTML5.

  • After some research I currently tend to work on a wrapper / interface which allows the use of CPython code (numPy, Pandas) in processing.

    Besides the Jython Interface for CPython (http://jyni.org/), which is currently alpha, there is also the possibility to embed those libraries via execnet (http://codespeak.net/execnet/index.html).

    If we can use numPy and Pandas directly in Java an easy-to-use interface for Processing there would be no need for a data wrangling library like I suggested.

Sign In or Register to comment.