It could be argued that cleansing data, transforming it and loading it in to structured database tables is a certain way to eliminate all the interesting discoveries that could possibly be made.
I believe, the real data science takes place whilst analysing the dirty, uncleansed and unstructured data. Why, because, as the saying goes, 'the devil is in the details'.
Performing any form of scientific experiment using big data is a huge task and can be at a high risk of misinterpretation and misrepresentation if noise and bias is not accurately identified and clearly labeled. The smallest error or oversight at the beginning of any scientific data experiment can snowball into a completely useless set of analytical results, which could become uncomfortable to unravel or expose to the audience, once realised.
If we did cleanse the data using pre-defined rules & logic, then transformed the data and subsequently stored it all in structured tables, have we not sanitised the data so much, that any result we produce will be considerably 'biased by default' to rules and logic applied pre-analysis to render the results as 'dismissible with no real research value'?
I say, you should apply all the data science analytical methods you want to, before you extract, transform and load, because, to quote Jeff Jonas "all errors such as misspellings and numeric transpositions are valuable regardless of whether these errors have been generated by accident or are professionally fabricated lies created by sophisticated criminals" In other words, the devil is in the detail.