Why Doesn’t Big Data Always = Good Data?

The Data Scientists out there will sigh as they feel that they have heard this a thousand times before. However, it is human beings that are the issue. Numbers are just numbers, it is what we humans do with them that is the issue.

Very quickly then; this is the correlation and causation argument writ large.

correlation causation.jpg
But it must be true???

Can you see the issue? On the face of it it makes sense. I prefer the elegance of expression of the original description of post hoc ergo propter hoc. Merely acquiring more and more data points, a bigger data set, better hardware, software and human expertise to manipulate this data does not equal better results from the data.

Big data is great and powerful when it is clean and accurate data. But….pause and think: before plunging into the analysis and insight phase the cleaning and tidying phase – the often skipped past boring stuff – needs to be complete. The crazy outliers need to be identified, partial data from a one source needs to be investigated, in the case of human surveys the ‘don’t know’ answers may be coded out, and so on.

There are a variety of ways to allow the Data Scientists to do this, but the heart of the matter is that if they are not given the time, tools and budget to do this then you are back to the junk in, junk out scenario that affects everything to do with computers.

As humans we are programmed in ways that really hamper us. This is especially true when we are operating outside of our field of expertise or are very out of date regarding a subject matter area. Our brains crave clarity and simplicity, we avoid the unknown as that is where danger may lie. We want to make as smooth and as risk-free transit through life. Because of this the best and the brightest can suddenly become very credulous and succumb to deep-seated fear and prejudice. This propensity feeds the behaviour of some because they are told something, seize upon it and then happily transmit it to others as fact. The recipients believe it, often more so when it is passed to them by a person or source in whom additional credibility is invested.

I was struck yesterday when listening to an episode of The Infinite Monkey Cage – a science program on BBC Radio 4 – where anthropologists and evolutionary biologists were tearing their hair out at the traction an image we are all familiar with has gained. The evolution of man from ape to upright walking man is apparently a terribly inaccurate and misleading image. Apparently, it first appeared in a French school textbook back in the Fifties, resonated (which shows the power of a credible source and a good image) so much that it stuck and has been reproduced millions of times over. I had no idea how inaccurate it was and like to think that I am not very credulous. It goes to show the power of something that has been ever-present though. Few people except the experts challenge it, even now.

human-evolution-670
The iconic, contested and wholly inaccurate image

Bringing this to business: I feel for the person or team at Apple that had to brief Tim Cook and co that the earnings forecast had to be dramatically trimmed because the previous cash-cow of the iPhone was no longer selling as quickly. I appreciate I have the benefit of hindsight regarding the following remark; the fact that people were hanging onto their devices for longer and were railing against the so-called planned obsolescence that many believed was being built in.,coupled to the belief that the latest OS was designed to overwhelm older devices and yet without the latest OS then the functionality was going to limited henceforth, really upsets consumers. If that is combined with the increase in length of the service contracts we are all but forced to agree to by the network providers (here in the UK at any rate) in order to have the latest tech, subsidised by these growing contracts, I suspect this wouldn’t be such new news.

We can see the clever PR operation swing into action. Apparently great PR relies so heavily on gut feelings and relationships that people overlook how incredible people are at computing very complex Big Data. Still far ahead of any computer. To whit: the entire slowdown has been pinned almost completely on the Chinese market. Something I find hard to swallow. I have no doubt it is a large component and very politically expedient given the way China is portrayed in the US these days. The messaging seems to play heavily on the deterioration of relations between the US and China. The PR teams are operating on very thick and contextual data, nothing more. The human brains are the computers here. Either way, is apparently, not the fault of Apple… *coughs politely*

blaming everyone else

On the other hand, perhaps they knew of this trend and the feelings that underpinned it because they had excellent Big Data, had combined it with the Thick Data approach and insights of Anthropologists, Sociologists and Political Scientists who specialise in these fields, so they could synthesise the findings into usable data, and the real issue wasn’t knowing this but when to let the markets know? Sadly, few large companies manage to meld their data very effectively and usually the larger they are the greater the disconnect between the boardroom and the customer, and the inadequacies of the information providers aren’t spotted soon enough.

What about the person responsible, or is there one? Challenging assumptions is often uncomfortable and often seen in an organisation as disruptive and potentially unwanted behaviour. A Chief Data Officer (CDO) ought to have both the support and power to ask the ‘who, what, when, where and why’ questions relentlessly. In fact, if they aren’t querying the data they are to use for gaining insight and helping the other leaders to make the best informed decisions, they are probably falling short in their role.

How Do I Know…

…if I am getting the entire Data Story?

…if it was analysed properly?

…if I can trust the conclusions and recommendations?

Every executive that is reliant on decision-making data presented to them by other people shares these doubts. If you don’t know how to ask the correct questions, parse the information in the replies correctly and follow-up with the right requests for more information you will forever be at the mercy of others. My experience is that people with responsibility do not enjoy that situation.

Without an impartial assessment of the Data Story they will not be able to satisfy themselves that the Data Story they are being told is the right one. Every big decision needs to be made with a greater element of faith than was intended.

untrustworthy.png

There are two basic elements to achieving an accurate Data Story. The first is the human, and the second is the technical.

  1. Human

Everything may be tickety-boo, the best, most loyal people, are giving you a perfect Data Story. If you know this to be true then stop reading now. Life is great. On the other hand, if you ever wonder then keep reading.

(Type 1, Type 2, and Type 3 data - a recap here -  for clarity , I am writing about Type 2 and Type 3 data. Remember, Type 1 is the Mars Lander sort of stuff!)
  • “These results are from AI. It can do things we can’t.”

Whether the results are attributed to AI, which has spotted a very subtle pattern in a vast mass of data, or a straight survey designed, run and analysed by , means nothing in and of itself.

Even if an AI tool uses the best and the brightest to program the algorithms it ‘thinks and learns’ with, the fact remains that people – with all their attendant beliefs, prejudices, biases, agendas etc – set the rules, at least to start. If the machine has indeed learned by trial and error, it was still programmed by people. Therein lies the weakness.

human AI blend

This weakness comes from the initial decision makers, precisely because they aren’t you or your Board. The Board is likely to have a much wider range of experience and carry more responsibility than the Data Science/IT/Marketing departments.

How often have you spent time with these people? Are they even in the same office as you? How old are they? What are their social and political biases? And so on. Unless you know this then how can you begin to understand anything about the initial algorithms that started the AI going. When were they written, what was the market like then, by whom, in which country?

With all data collection and manipulation it is crucial to have a fuller story. It is the  background and understanding of those setting the questions, writing the algorithms, tweaking the machine learning, analysing the data, their managers, the instructions they have been given, the emphasis that this Data Story has received in the rest of the organisation before you see it. It is also insight into the marketplace provided by the sort of Thick Data that Tricia Wang and other ethnographers have popularised.

My message to you is that data is so much more than numbers. Just numbers can be misrepresent the story so greatly. We are social animals and as long as there are people involved in the production, analysis and presentation of data it doesn’t matter a jot how incredibly intelligent and fast the tools are. We are the weakness.

complicated employees

If you still struggle believing this concept then think about electronic espionage. It is rarely a failure in something mechanical that causes catastrophic breaches of security, it is the relative ease with which people can be compromised and share information. The people are the weak link.  In the very first days of hacking a chap called Kevin Mitnik in the US spoke of Social Engineering as the means to an end. We are all inherently flawed, these flaws are shaped and amplified by our social and work environments, so why couldn’t that affect the Data Story you get?

    2. Technical

  • “The data we have used is robust.”

I’ve heard that line trotted out many times. Gosh, where to start? It may be. Nonetheless, a lot can and does happen to the data before you see the pretty graph. Here are just a few things to consider before just agreeing with that assertion:

What was/were the hypothesis/hypotheses being tested?

Why?

When was it collected?

By whom (in-house or bought in from a third-party)?

Qualitative, quantitative, or a blend?

What was the method of collection (face to face interviews, Internet, watching and ticking boxes, survey, correlational, experimental, ethnographic, narrative,phenomenological, case study – you get the idea, there are more…)?

How was the study designed?

Who designed it?

How large was the sample(s)?

How was the data edited before analysis (by who, when, with what tools, any change logs etc, what questions were excluded and why)?

How was the data analysed (univariate, multivariate, logarithmic, what were the dummy variables and why, etc.)?

How is being presented to me, and why this way (scales, chart types, colouring, size, accompanying text etc)?

Research design

And so on. This is just a taste of the complexity behind the pretty pictures shown to you as part of the Data Story. From these manicured reports you are expected to make serious decisions that can have serious consequences.

You must ask yourself if you are happy knowing that the Data Story you get may be intentionally curated or unintentionally mangled. I started this site and the consultancy because I am an independent sceptic. In this age of data-driven decision-making you mustn’t forget. Incorrect data can’t take responsibility for mistakes, but you will be held to account. This is not scaremongering, it is simply fact.

If you need a discreet, reliable and sceptical  third-party to ask these questions then drop me an email.  I compile the answers or understand and highlight the gaps. You make the decisions, albeit far better informed and with the ability to show that you didn’t take the proffered Data Story at face-value, but asked an expert to help you understand it.