Why is data dangerous?

In the words of @RorySutherland: “The data made me do it” is the 21st Century equivalent of “I was only obeying orders”. The growing power and influence of Data Science touches everyone’s lives. Sutherland also remarks: “Markets are complex and there can be more than one right answer. People in business prefer the pretence of ‘definitive’ because if you can show you’ve done the ‘only right thing’ you have covered yourself in event of failure”. These are all attempts at Plausible Deniability, and they are weak.

For the record, plain old data is not dangerous, you are unlikely to be hit by an errant Spearmans Rho, or a rogue Control variable that detached itself from an analysis. Data is just a record of the measurable values of something that has happened in the past. Digital exhaust, if you will. Like speed in a car, it is the inappropriate use of it that causes issues.

zuck-data

Doing the right thing often sees people becoming  enslaved to Type 1 and Type 2 data, because they are the easy parts. You can hire experts, who can count well, use the software and understand how to tease out knowledge from the data points. What the majority can’t do, or may even do intentionally, is to manipulate the presentation, context and language used when presenting their findings. This is the Type 3 data I talk about, that isn’t traditional data as we know it.

Type 3 data is the really dangerous stuff. The reason for this is our complete fallibility as human beings. This is nothing to be ashamed of, it is how we are made and conditioned. It is in fact, entirely, boringly, and ordinarily normal. I was recently told by a lawyer – I say this because she is pretty well-educated – that all statistics are a lie. She then cited the famous Mark Twain (nicked from Disraeli) saying of, “There are lies, damn lies and statistics”, as if this were all the proof she required. Interestingly, when I challenged her on this and made a case for accurate uses of statistics she refused to even acknowledge this. She was wedded to her belief and I must be wrong. Case closed.

statwordcloud

I think immersion in courtroom rhetoric may have been getting the better of her. However, this goes to show the just how dangerous we humans can be. Imagine being a client with a lawyer whose dogmatism may cause them to overlook or be able to question relevant statistical evidence? All stemming from a strongly held view that all statistics are lies. Professor Bobby Duffy recently wrote an excellent book called Perils of Perception and on p.100 he shows just how problematic this view can be.

My point is: If a person who is well-educated, and practising in a profession like law, can hold such a position, then it is not beyond any of us to do so, quite unwittingly. Until one is more familiar with the behavioural biases that we are all susceptible to, the way Type 1 and Type 2 data can be mis-represented (Type 3 data) and how that uses our in-built foibles to generate a reaction.

This is where someone who understands both of these areas, and can blend that knowledge into an expertise which is useful, can help you. When important decisions on strategy, direction and spending  are conditional on interpreting data from others, you want to get it right first time. If not, you’ll be forced into, “The data made me do it”, and that rarely ends well.

burning money

 

 

 

 

Advertisements

Type 3 data in action. The Guardian is at it again.

The purpose of this blog is to get behind the data stories we encounter. Understandably, most commercial data is sensitive and remains unpublished. This means I have to rely on publicly available mangling of the data to illustrate the points.

The article of 11th October 2018 carries the snappy title, “Profits slide at big six energy firms as 1.4m customers switch” (The 3 types of data are explained here)

I will stick to the problems with data and not make this a critique af the article, for its weaknesses alone. That is just churlish. Read the following and think of yourself being presented with a document like this and having to critique its worth as something to base your decision-making on.

This article encompasses the Type 3 data example so very well! It appears that the journalist has started with an idea and then worked backwards to mangle what Type 1 data they have to fit the idea they want to transmit to the reader. To be clear: this post is not written an opinion piece about the Guardian, but a critique of an article purporting to use Type 1 data  to support the ‘Sliding Profits’ hypothesis.

Before we go any further the Golden Rule of data has been broken. You simply mustn’t decide the answer, and then try to manipulate, mangle and torture the data to fit your conclusion. You must be led by the data, not the other way round. It is fine to start with a hypothesis and then test the data to see if that is true. It is a major credibility red flag when the conclusion is actually the initially assumed answer.

Red Flag

If the article is apparently a business article it is rather worrying when the journalist obviously doesn’t know the difference between profit margins and profit¹. These are two distinctly different ideas yet they are used interchangeably in the piece. Red flag number two (if the first wasn’t enough). Paragraph five manages to combine the margin’s of two companies with the profits of another and then – completely randomly – plugs in (excuse the pun) an apparently random reference to a merger and the Competition Commission.

Terms like the ‘Big Six’ are used but nowhere does the author bother to say who the Big Six are. Whilst it is a moderately common term it cannot be assumed that everyone knows who they are. This is sloppy reportage and another Red Flag for the reader. Sloppy here, sloppy elsewhere. Who knows? This is back to the Type 3 issue of how it is presented to you. In this case, so far, very poorly.

The energy market regulator, Ofgem, is cited as the source for the first graphic. The Y (vertical) axis is numbered with no qualification, the date and document that this is taken from isn’t mentioned. Type 1 data being mangled by the Type 3 data. Overall – poor sourcing and not worth the bother. You can dismiss graphics like this as you can reasonably assume it is a form of visual semiotic designed to elicit a feeling and not communicate any reliable Type 1 data to you. (Note the profits and profit margins even being conflated in the graphic title!)

Poor graphic.JPG
Poor graphic designed to mislead – taken from the Guardian article.

 

The final critique is the one that speaks to the concept of Type 3 data. The language used in the article is such a blatant attempt to skew the article away from reportage about how the entrant of challengers into the market place are affecting the profits, and profit margins, of the established players. I think the subsidiary point is about the fact that consumers aren’t switching suppliers as much as is expected. I had to read the article several times to distil those as the most likely objectives of the piece.

Finally, if you re-read the article and just look at the tone and, more specifically, the adjectives used you’ll be surprised. What I can’t work out is the author’s agenda. To just report such a muddle of data is one thing, most popular press has an agenda of some kind.

NB: I really hope the Guardian doesn’t just keep gifting such poorly written articles. I think I may look at the coconut oil debate next!

Continue reading “Type 3 data in action. The Guardian is at it again.”

What is Type 3 data and why is it so important?

A simple enough sounding question, though something that is quite contested. I propose that we need to look at three distinct subsets of the concept of data. You’ll see why in a moment why this article isn’t a technical explanation of data in stats. For that (and it is necessary) this is a super post that explains them.

This article is intended a guide to help you categorise the data that is being presented to you in the course of a day.

Type 1 – This is ‘just’ the hard numbers.

By this I mean just what you imagine. The figures that get plugged into SPSS, Stata, R, SAS and the like. How these are analysed determines the output. It is necessary – and can be mind-numbingly boring, I know this as I’ve had to do it many times! – to check how any of the variables may have been re-coded, re-weighted and then analysed in the data-management components (.do files, syntax files etc) of the popular stats packages. [Why isn’t Excel listed? I asked my ex-supervisor and a Professor who specialises in this stuff. He politely guffawed and told me that it isn’t a ‘proper’ statistical analysis program. Once the heavy lifting has been done it may be exported to Excel as that is what the majority of people are used to seeing.]

figures.png

Type 2 – This type of data is the so-called softer numbers.

Whereas the first type of data is  useful for analysing the patterns of turnout for an election, the way different materials on an aircraft fatigue, how people move through a supermarkets etc. Type 1 relies on quantifiable and easily measurable (converted into a numerical value for analysis) variables. One step right, turns right and two steps at a 40 degree angle, over a nine second period and so on.

Type 2 data is an attempt to record and analyse human emotions, behaviour, and sometimes capture the strength of intent to do or not do something. We have all been asked things like, “How did that make you feel? Please rate your reply from Very Unhappy, Unhappy, Neutral, Happy to Very Happy?” This is the classic Likert scale.

Stop though. Have you considered if Semantic Differential Scales were used instead? Perhaps a mixture of the two, or two different data sets derived using different assessment methodologies? These too can be plugged into the stats programs and analysed. The trickier thing here is the subjectivity element. Is my Very Unhappy the equivalent to your Very Unhappy. The way this effect is mitigated is by large-scale testing, as this generates a happy medium by excluding the outliers. Hence, be very wary when a small sample size is used to generate an indication of feeling or intent.

Likert answers

Type 3 – And this is where it gets hazy and interesting!

Type 3 data is the way in which data is framed and presented to you. This may be in a newspaper, an internal report or perhaps a sales presentation. They are all trying to sell you something. The wrapping of the data and analysis may be in a manner to enhance the credibility and believability of the package, or you may be being steered away from robust data because it doesn’t fit with someone’s agenda. Either way, you are being encouraged to buy in to a point of view and the ‘data’ is being used in an effort to burnish the idea.

Cleverly employed Visual Semiotics that speak to far deeper parts of our brain are often employed. You already know what these are, they’re the graphs, symbols and pie charts as well as the tangentially relevant accompanying images. See the recent post on the mangling of data by the Guardian newspaper – the image of the white police officer discharging a taser directly towards you – for an example of this. Creative affect labeling, which is the process of putting feelings into words, of some of the characteristics of the data, certainly the ones that focus is being directed towards, is influential. The latest research techniques have allowed scientists to show this happens, however you may think you can override such feelings.

Visual Semiotics

Although Type 3 data is all about the way in which the data is framed, it isn’t the numbers in the traditional sense. It is the third part of the package. Type 1 data is, if correctly produced and analysed, completely susceptible to the influence of Type 3 data, as is Type 2 data.

Type 3 data is the processing, packing and presentation of the digital exhaust that makes up Types 1 and 2 data. It is important as it mediates between us unpredictable humans, slaves to our emotions, with all our psychological foibles and weaknesses hidden just below the surface. As such,  Type 3 data should be afforded as much significance when analysing any data that is presented to us.