Messy data is fun.

Perpetuum mobile. Color gears on white isolated background.Whether you are an ETL developer, a business analyst or a data scientist, we all spend way too much time cleaning up and conforming our data set. We are bent on cleaning up our data to try to extract precious and valuable information. We spend long hours preparing our datasets and when we are finally ready to pursue our analysis, a new set of data is available. Then we need to start all over again! We will eventually ask a developer to build a job, if we want the solution to be sustainable. However even with tools like DataStage or Informatica, it’s not always easy to clean up datasets.

These “anomalies” that we seek to correct or structure have multiple origins: unstructured feed of information, human data entry errors,  wrong sensor or automated information, inconsistencies betweens IT processes.

New tools are now appearing on the market to help us quickly structure messy data. The beauty of it is that you don’t need to write a single line of code. Trifacta, for example uses what they call “predictive interaction.” This algorithm analyzes the interactions with the source file and will automatically suggest a list of possible transformations ordered by their probability of relevance. For example, in an unstructured file, after selecting a few email addresses, the program will automatically propose a rule to extract email addresses and display a distribution of data to which the rules apply. This application also facilitates the detection of anomalies, data aggregation based on advanced rules and the resolution of erroneous encoding problems on large sets without writing scripts or complex SQL.

A product a little less mature but also very interesting comes from Mountain View. “Google Refine” works on similar principles as Trifacta. Google Refine can quickly prepare, without coding, a dataset for a visualization tool or analysis. In addition, it is free. Videos are also available to quickly learn the tool.

The main advantage of these tools is to quickly transform a data set to actionable information and to maximize the analytical and decision-making time.

On the other hand, tools such as WhereScape Red can play a similar role to rapidly clean and merge multiple data sources and automate loading into a database. Redscape can also generate SQL code that could possibly serve as a basis for development with a traditional ETL tool or depending on the size of the solution and the company, eventually become the production system itself!

A range of products are now available to provide analysts datasets that can be rapidly exploited by visualization products (such as Tableau Software, SAP Lumira, Qlikview or Yellowfin), tools for predictive or statistical analysis (R, Tibco Spotfire, SAS, SPSS,..) or the good old Excel spreadsheet.

These applications also have some influence on the way we approach the preliminary phases of the BI project, since it is now easier to test our assumptions about data quality and dispersion of the data by performing fast and accurate proof of concepts. Our users can now have data available to validate their needs before the development of the solution starts. It’s another set of tools we can add to our agile project framework to be even more efficient.

In short, this new product line is expanding and certainly deserves the attention of all those for who data is their bread and butter.

Raphael Colsenet – Novembre 2014

No comments yet

Leave a comment