Tuesday, January 20, 2015

Big Data and the importance of Meta-Data

Data isn't really respected in businesses, you can see that because unlike other corporate assets there is rarely a decent corporate catalog that shows what exists and who has it.  In the vast majority of companies there is more effort and automation put into tracking laptops than there is into cataloging and curating information.

Historically we've sort of been able to get away with this because information has resided in disparate systems and even those which join it together, an EDW for instance, have only had a limited number of sources and have viewed the information only in a single way (the final schema).  So basically we've relied on local knowledge of the information to get by.  This really doesn't work in a Big Data world.

The whole point in a Big Data world is having access to everything, being able to combine information from multiple places within a single Business Data Lake so you can allow the business to create their own views.

Quite simply without Meta-Data you are not giving them any sort of map to find the information they need and help them understand the security required.  Meta-Data needs to be a day one consideration on a Big Data program, by the time you've got a few dozen sources imported its going to be a pain going back and adding the information.  This also means the tool used to search the Meta-Data is going to be important.

In a Big Data world Meta-Data is crucial to make the Data Lake business friendly and essential in ensuring the data can be secured.    Lets be clear here HCatalog does matter but its not sufficient, you can do a lot with HCatalog but that is only the start because you've got to look about where information comes from, what its security policy is, where you've distilled that information to.  So its not just about what is in the HDFS repository its about what you've distilled into SQL or Data Science views, its about how the business can access that information not just "you can find it here in HDFS".

This is what Gartner were talking about in the Data Lake Fallacy but as I've written elsewhere, that sort of missed the point that HDFS isn't the only part of a data lake and EDW approaches only solve one set of problems not the broader challenge of Big Data.

Meta-Data tools are out there, and you've probably not really looked at them but here is what you need to test (not a complete list, but these for me are the must have requirements).
  1. Lineage from source - can it automatically link to the loading processes to say where information came from?
  2. Search - Can I search to find the information I want?  Can a non-technical user search?
  3. Multiple destinations - can it support HDFS, SQL and analytical destinations
  4. Lineage to destination - can it link to the distillation process and automatically provide lineage to destination
  5. Business View - can I model the business context of the information (Business Service Architecture style)
  6. My own attributes - can I extend the Meta-data model with my own views on what is required?
The point of modelling in a business context is really important.  Knowing information came from an SAP system is technically interesting, but knowing its Procurement data that is blessed & created by the procurement department (as opposed to being a secondary source) is significantly more valuable.  If you can't present the meta-data in a business structure you aren't going to get the business users able to use it, its just another IT centric tool.

The advantage of Business Service structured meta-data is that it matches up to how you evolve and manage your transactional systems as well.


No comments: