Sign Out
Logged In:
Analytics Network
Tab Image

Data Science - the devil is in the detail

Friday, 5 Jul 2013 (revised date: Friday, 19 Jul 2013)
Sayara Beg

It could be argued that cleansing data, transforming it and loading it in to structured database tables is a certain way to eliminate all the interesting discoveries that could possibly be made.

I believe, the real data science takes place whilst analysing the dirty, uncleansed and unstructured data.  Why, because, as the saying goes, 'the devil is in the details'.

Performing any form of scientific experiment using big data is a huge task and can be at a high risk of misinterpretation and misrepresentation if noise and bias is not accurately identified and clearly labeled.  The smallest error or oversight at the beginning of any scientific data experiment can snowball into a completely useless set of analytical results, which could become uncomfortable to unravel or expose to the audience, once realised. 

If we did cleanse the data using pre-defined rules & logic, then transformed the data and subsequently stored it all in structured tables, have we not sanitised the data so much, that any result we produce will be considerably 'biased by default' to rules and logic applied pre-analysis to render the results as 'dismissible with no real research value'?

I say, you should apply all the data science analytical methods you want to, before you extract, transform and load, because, to quote Jeff Jonas "all errors such as misspellings and numeric transpositions are valuable regardless of whether these errors have been generated by accident or are professionally fabricated lies created by sophisticated criminals"  In other words, the devil is in the detail.


Sayara (Chair)


  2016 (3)
  2015 (10)
  2014 (9)
  2013 (20)


Michael Mortenson
Fri, Jul 12 2013 19:40 GMT

Hi Sayara. Interestingly I think Google eschewed using stemming or lemmatization for many years and only have started to use it more recently. I certainly have no inside scoop (more's the pity!) but I would summise that they must be using the mother of all ensemble methods combining the above with synonyms, chunking / tagging, n-grams, boosting, bagging and a whole lot more I'm sure.

For those of us using Googlemail or Google+ you may notice as well that if used in the same web session as normal searches then you will be "logged in" when searching. I have already seen evidence that Google is using my search history and email activity to effect the results shown - taking the whole thing to another level again.

They certainly seem to be the ones to watch ... maybe the best answer to "what is data science?" might simply be "the stuff that Google does!!".

Sayara Beg
Wed, Jul 10 2013 16:54 GMT

Michael, this is really interesting.  I am looking forward to hearing about your approach at the OR55.

It also throws up the question, what algorithms are used by search engines like 'Google' to display the closest result sets to the original question, given that the search is often across unstructured and uncleansed data.

Sayara (Chair)

Michael Mortenson
Sun, Jul 07 2013 16:08 GMT

Thanks Sayara - a very thought-provoking post.

One thing that jumps straight to mind is that perhaps this is one of the things to commend about Hadoop-style big data architecture. Unlike BI data warehouses where the transform stage is made in advance of an analysis (i.e. in the ETL process), in NoSQL databases you just extract and load in advance and then transform when you want to do the analysis. Invariably some form of transformation occurs but at least the data remains in its original form in its storage. So any transformation involved in one analysis doesn't effect the transformation you may use in another.

I have had a similar decision to make in a current project where I am analysing some job adverts using text mining. Typically in such analyses one would transform the text using stemming or lemmatization so that all words are reduced to their root (e.g. manager, management and manages would all be reduced to manage or manag depending on the method used).

The benefit is that is can recognise the following two sentences as the same:

"the job involves managing analytics projects" 

"the job involves the management of analytical projects"

However, it will also mean that these following sentences are treated as approximately the same:

"has experience running projects using Management Science or Analytics"

"managing a team of scientists analysing the experience of runners"

As can be seen there is a clear trade-off between better capturing the meaning of a sentence (precision) and recognising sentences as the same that are just phrased in a different tense for example (recall).

(To find out which approach I chose you'll have to attend my presentation at OR55 in September!!).

I think in summary even if you are not using prescriptive ETL methods (I used MongoDB) and you are using unstructured data you still can't avoid facing trade-offs that will have a big impact on the results.