Calling Bullshit 4/5

Carl T. Bergstrom, Jevin D. West

Max Kraynov

Aug 08, 2022

Parts 1, 2, 3.

7/ Data Visualisation

Most people are not used to interpreting charts and graphs, which poses an unsolvable problem as mass media publishes increasingly sophisticated data visualisations.
Graph designers have a great deal of control over the message that the graphic conveys and the way the reader feels about it. Clever positioning of variables can create an illusion of correlation where there’s none, or gaps between variables can be made disproportionally large. A careless or malicious designer can ruin even the best analysis.
In architecture the term “duck” is used to describe when form is put ahead of function (inspired by the “Big Duck”). In a similar fashion, data visualisation must be about the data first and foremost, with the decoration being less important. Ducks used to be native to the mass media, but recently have have started creeping into the scientific literature.
“Ducks” are like clickbait. They’re not harmful, just capture attention for a few seconds. Looking cute becomes more important than the underlying data.
Content marketers are seemingly obsessed with periodic tables of everything. As opposed to the chemical periodic table by Dmitry Mendeleev, which has had an enormous theoretical basis behind it, the modern “periodic tables of everything” are purely form over function, creating an illusion of classification and structure, but in fact are just loosely assembled pieces of information.
Subway maps are masterpieces of visualisation of the most important information for the commuter. Graphic designers are obsessed with them, too, creating their own “subway maps of everything”. While the sequential structure of each “line” can make sense to describe progress over time, the choice of physical positioning of the lines is where the concept falls apart.
Venn diagrams are often abused as an ornamental backdrop for numbers and words, but the actual intersections of the areas don’t mean anything at best and are misleading as usual. Again, it’s form over function.
Using an object for visualisation where its parts are labelled with some text is questionable, too, as there has to be a clear logical link between the part of an object and the label.
In charts always look at the axis labels and the scale (especially when they don’t start from 0). [MK: please make sure you read this review of another book.] Bar lengths can be made disproportionally longer to emphasise the gap on an emotional level. Also it’s a huge red flag when axis start with a negative number for values that can only be positive.
Even if Y axis starts with 0, if the variation of the variable is small relative to the scale, this is deception. (e.g., plotting a change from 57 to 59 over time on an axis with the range of 0-100 – it’s simply not visible). Put kindly, the “graphical display choices are inconsistent with the story”.
Displaying two variables on two Y axis (left and right) on a graph is likely to tell a misleading story if both axes don’t start with 0. (It’s easy to smoothen the slope.)
Uneven and varying scales on the X axis are a sign of data massaging.
“Binning” (averaging the data within a range) can disguise variance in data and make an illusion of data uniformity.
The principle of proportional ink means that the representation of numbers as physically measured on the surface must be directly proportional to the numbers. E.g., if the Y axis starts at 50 and has two bars on it: one is 100 and another one is 150, this show the second bar to be twice as tall as the first one (instead of 50%). This has far-reaching consequences.
Bar charts are designed to reflect magnitudes, line graphs tell stories about changes (hence their axes don’t necessarily need to start with 0 if they’re not shaded).
3D bar charts are useful for displaying two independent variables, but not one (a 2D chart is better suited for it). No need to unnecessarily impress the viewer. The use of perspective makes it harder to assess the relative sizes of the chart elements.

8/ Calling Bullshit on Big Data

The challenge in machine learning is generalisation – the ability to identify patterns not seen before. Overfitting is when noise is treated as a signal, making the model unreliable.
Complicated models fit the training data well, but simpler models often perform better on test data. The trick is finding the simpler model, which will not ignore useful information.
No algorithm can overcome flawed training data. The process of obtaining and cleaning the training data is long, expensive and is still performed by humans who are biased. Adding more variables isn’t always a good idea because the model need to be retrained with more data, which is expensive to collect and clean; in medical research there may not be enough patients for a certain combination of symptoms or genes. It’s the “curse of dimensionality”.
And then we have fake news, which can be considered “fake” in 2020, but are perfectly correct in 2022. Training sets need to evolve, too.
Sometimes algorithms pick on external cues in analysing objects (environment, layout, etc.), which makes them unusable in other conditions. (E.g., classify an animal as a wolf because the background is full of snow due to the fact that the training data contained lots of images of wolves in the snow.)
The book also touches on the algorithmic accountability (the user of the algorithm is responsible for its outputs) and algorithmic transparency (the person needs to know what factors went into the decision about them). But even then, there’s an issue of interpretability: it’s hard to impossible to understand how exactly the decision is made.
Data doesn’t have to be directly biased to be picked up by an algorithm: it’s possible to tell one’s race or gender with a high certainty by looking at the university attended, societies the person is a member of or a hobby with skewed demographic representation.
Big data is not better, it’s just bigger.

Part 5.

Course Notes: Continuous Business Learning

Discussion about this post