The Art of Data Science

I want to think about data science as an art form: the art of telling stories about data. These are stories decision-makers can understand and use to make good decisions. These stories highlight what’s most important about the data and what it means.

Thinking about ordinary stories, what makes a good story? Here are some characteristics I think a good story needs to satisfy:

  • It resembles the real world situation being described in the ways that matter. It can be completely different from the real world in ways that don’t matter.
  • It is told in an “artistic” way that provides insight into the situation.
  • It leads the listener to appreciate the message of the storyteller.

All these things are also true of the stories told by data scientists. So this leads me to a question: What can we do to help data scientists make good art?

Specifically, the goal is to help data scientists craft, structure, and interpret stories about data. To do this, we need a language that can express everything that is needed in stories, contains the right constructs to tell stories artistically, and is easy to interpret to draw the right message. We also need tools for constructing stories that are collaborative, support exploration, and enable the scientist to quickly evaluate whether a story is effective.

I claim that probabilistic programming can potentially provide the right sort of language constructs for telling stories about data. Probabilistic programming is expressive and the models are interpretable. Most important for storytelling, the models are what we call generative. This means that the models describe how the data came to be, not just what’s in it.

I have two daughters in second grade. They just did a unit on the Wampanoags, who are a native American group in Massachusetts. The Wampanoags have an artform called Pourquoi tales, which are stories about things in the natural world are the way they are. My children wrote Pourquoi tales like “Why the squirrel collects acorns for winter.” The thing about Pourquoi tales is that the explanation for why something is the way it is is always a story of how that thing came to be. So the squirrel collects acorns for winter because one winter she was hungry and had a hard time finding food, so the chipmunk convinced her that it would be better to collect food in advance. Generative models are like Pourquoi tales. They turn a why into a how.

The other thing about probabilistic programming is that you can quickly use a program to generate examples of data that would be produced by the program. You can then evaluate whether the program is telling the right story, i.e., whether it describes the real data. It doesn’t have to resemble the real data in all aspects, but it should correspond in the ways that matter to the story.

So how do you help data scientists do “art”? We decided to start studying this question by observing ourselves trying to tell stories about data. And the data we chose happened to be art – specifically, the art of Wassily Kandinsky, a Russian artist who was active in the first half of the twentieth century. Kandinsky was one of the pioneers of abstract art, where there is no attempt to depict any real world objects in any manner whatsoever. We chose Kandinsky for several reasons. Most importantly, as abstract art, we don’t need any knowledge of real world objects to model it; it is closer to pure “data” than representative art. Kandinsky also wrote a book called “Point and Line to Plane” that outlines his theory of art and gives us some tools to think about this problem. In addition, Kandinsky is my favorite artist, so I was a little biased in the selection of subject.

So, data science as art, and art as data… This is very much work in progress. We’re far from creating good models of Kandinsky, but we’ve made a start and we’re working on it.

Our modeling approach is to first create formal representations of examples of the data (in this case, particular artworks), then to create a generative model that captures these formal representations, and finally to generate data examples and evaluate and refine the model by comparing them to actual data.

Here’s an example of data for which we created a formal representation. It’s Diagram 14 from Point and Line to Plane. The diagram is comprised of several “planes”, or complexes of shapes that are generated and organized in a particular manner.

Kandinsky Diagram 14
Kandinsky Diagram Number 14

I’m not going to go through all aspects of the representation in full detail. Also, there’s a lot of interpretation and subjectivity in understanding a diagram like this and representing it’s organizing principles. On a high level, these are some of the things we came up with:

  • At the top, there is a horizontal plane consisting of nine black or white arcs, all with the same orientation and similar curvature. The number nine doesn’t seem to be anything special; we could say the number is drawn from a distribution. The arcs also look like they’re uniformly distributed horizontally and vertically over the top quarter of the figure. There is no overlap between the arcs, although a few of them are touching.
  • Just below that, there’s another plane of four tilted black arcs. Again, there’s no overlap but there’s a similar touching formation to the first arc plane.
  • There are three circles in the lower right of the diagram with varying radii that are drawn from some distribution. The line thickness appears to depend on the radius.
  • In the lower half of the figure, left and center, is a rather hard to model plane. There are two black triangles that are near images of each other. Each triangle is intersected by a number of lines, but the number and nature of the lines is different. Both triangles have regions of white where they intersect the lines. For the left triangle, the white region is where they intersect, while for the right triangle, the white region is the entire area where the lines intersect the triangle, except for the lines themselves.
  • The striped archway appears to be a singular element. It is slightly related to the arcs but, unlike them, has a constant thickness and describes a semicircle. And there is no other object in the diagram with this texture.

We studied a number of diagrams like this and developed our own generative model in Figaro for these kinds of diagrams. Our model recursively decomposed the diagram into planes and subplanes that included a variety of shapes. The choices of which shapes to include as well as the number and properties of those shapes depended on the properties of the plane that contained them. For example, a horizontal plane is more likely to contain rectangles than a diagonal plane.

Here are a few examples of images generated by our model. First, a two-plane image with one dominating shape:

Single Shape

Here’s a multi-plane image with several shapes:

Multishape

Finally, here is a more complex image:

Complex Image

I think it’s clear that although the images have the right sorts of things in them, we haven’t yet captured the essence of the Kandinsky diagrams. One possible reason is that our models don’t yet include any constraints; for example, a constraint that the objects should fill up the canvas in some way that’s logical. It will require more investigation to formalize what those constraints might be.

The main point of our study is to see what kinds of models we want to express and try to identify languages and tools that would support that. Right now, we’re thinking about what our ideal language for expressing these models would be. Although it’s possible to express these models in Figaro, Figaro is a more general-purpose language and not designed explicitly for telling stories like this. Any specialized language we develop would compile into Figaro, which will provide all the reasoning capabilities. Another thing we’re looking at is collaborative model development tools for telling stories.

A final thought: I’ve talked about helping data scientists do art. Can we also help artists do data science? Artists, through intuition and training, have an eye for seeing things in an insightful way that might not be immediately apparent to non-artists. This might be a very useful attribute for someone who’s trying to make sense of data and tell stories about it.

3 Comments

  1. interesting article, Avi. I admit, I was a bit intimidated when I began, but you did a great job of explaining the goal and the process in ways that I could understand. I look forward to reading more!

  2. I think I get the general idea: Analyze the pictorial data for patterns or “formal representations”; use these representations as algorithms in a model that generates pictorial data; and finally perform an iterative tuning of the model until it starts outputting pictures that look Kandinsky-esque. My early comments are …
    (1) Although I’m an architect and photographer with considerable computer graphic experience, I have no idea what Figaro is, or how it works.
    (2) Your second generation output looks pretty Kandinsky-esque to me. I am not disturbed by the “empty space” in the frame; if there is a “problem”, it may be that the shapes over-lap rather than touch at a point — a characteristic of the original you chose to emphasize.
    (3) You are probably familiar with Edward Tufte’s “The Visual Display of Qualitative Information”, which is a pioneering work in the use of pictures to transform data into good stories. This isn’t quite what you’re up to, but there’s some congruence.
    I look forward to hearing more about this. Thanks, RPD

  3. Great essay, Avi! I like the idea of explaining the data by telling a story. That’s for sure more pleasant and less pedantic than building a “model”. A model is a story that you express with equations. Besides mathematical skills, it requires knowledge of patterns, perspective, balance and sense of beauty.
    As developer of Lea, a PP tool in Python, I find some echo of your ideas with some examples I provided. You can have a look at my “Murder case”, a Whodunit? story using Bayesian reasoning (http://bitbucket.org/piedenis/lea/wiki/Examples#markdown-header-a-murder-case). Also, I like Kandinsky. For your research, he’s probably an easier starting point than Jackson Pollock! I think that what you’re looking for is Kandinsky’s grammar. On a more sarcastic tone, you could have a look at my bullshit generator (http://bitbucket.org/piedenis/lea/wiki/Examples#markdown-header-bullshit-generator). The idea is a bit similar as yours, just replace Kandinsky by any modern talkative software guru.
    Congratulations for this blog initiative. I’m sure it can help PP to be more accessible.
    Pierre
    PS: Sorry for modern software gurus; I do like them actually, except when they are talkative.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s