Metadata Management as Data Governance enabler
I know Metadata Management quite well! Just Manage bunch of Data Dictionary will do! [Facepalm]
Common understanding of metadata is data about data which is the information that is needed by an organization to effectively and efficiently manage its data and information resources.
So we are just managing stack of Data Dictionaries from different Source System only? Right? Well, it is more than manage stack of ordinary data dictionary nowadays.
Metadata Management - management which able help us to classify data by topic based on similar characteristics and describe the definition of data by meaning, structure, and lineage. With well manage metadata, it is able to guide and control on the usage of data by searching using keywords and retention period of data.
Early development of Metadata Management begins from library classification systems - Dewey Decimal System in 1876 which is introduction of card catalog by author, topic, location, etc. to search a book in library and today most of the libraries converted their catalog data to digital databases. Hence you can imagine Metadata Management are like managing books in a huge library using digitize card catalog for existing books or newly arrive books which waiting for Librarians to index them into one of the book shelf based on the book's criteria.
For libraries having digitize card catalog to perform book management, what about most organization whom is data centric and how do they manage data using metadata? There are 6 components that related to metadata management:
Books with an eye-catching title is always a good start for reading
A book with good title always a good start that attracting our eyes to pick them up for a glance through on the contents. Imagine you are looking on the cover of Harry Potter Series Episode 3 "Harry Potter and The Prisoner of Azkaban" which talking about story of Sirius Black that escape from Azkaban Prison and secret that tie with Harry. With the title is clear that the story plot is relating to a prisoner with main character Harry.
Same goes to our today's business terms and technical tables which also need dedicate definition and names. A simple case in point which is can be seen using a Insurance Company ABC to illustrate terms use across different business department and IT:
Criteria to consider
Data Elements & Data Entities
Database Tables & Database Columns
These depending on how well the business knowledge and a big picture of enterprise wide view of a person or a team. As we know it is important for clarity of data communication to avoid ambiguous of data meaning.
Next... A Perfect Table Content, A Perfect Lethal Drop Shot
My habit of reading normally start with an eye-catching book title and follow with the table of content. If the table content are well manage along with the thinking of the writer, I will want to appeal for the details of the entire book.
Same to today's data management of organization where Data Structure reflect business rules while data models are visual representation of data structures from conceptual to logical and finally to operational.
Look back to Insurance Company ABC where data structure can describe with below scenario,
A policy is owned by one Policyholder where the holder can own different types of Policies such as Life Insurance and Automobile Insurance. Where only Life Insurance Policies can have Beneficiaries while the Beneficiary can be Primary Beneficiary of many Policies.
Convert above business rules to a visualize diagram which we will have data models of company:
From scenario we can see business users like Customer Services are leads in provide business knowledge for
Data structuring processes and usage based on the shape and constraint of business processes and practices
For instance SOP that involve in client portals when receiving enquiries request of a newly onboard policyholder and notification via messaging app for first-time login activation and information creation. Policyholder, Mobile Number are main component in this data structure.
Data modelling activity and help to verify accuracy of data model that represent business rules.
Continue from previous scenario after login activation and new client need to provide Social Security Number for a Resident or Passport Number for Non-resident as is mandatory during information creation in client portals. Which it impact the technical table that both ID and Resident Status attributes are having interrelated and both cannot be null or blank value in Source System for individual.
Wait... Data Profile and Data Quality? Do We Need it?
Data profiles are metadata that is extracted by examining actual data values in a database. You may reckon data profile are like "Detective Data" with magnifier to look on every details that is necessary when investigate a case.
So why we need Data Profile and Data Quality in Metadata Management? Make it simple, as Data profile can tell our metadata with those min, max value, distinct count of a List of Values for a drop down such as the value frequencies of gender parameter in Customer list which you will surprisingly find out there is not just abbreviation of "F" and "M" for female and male given the gender is not a drop down list to fill up with multilingual client that fill-up customer information by themselves.
And yet Data profiling is a collection of many techniques and the example we share is the most basic form of profiling known as Column Profiles.
There are other profiling techniques such as
Attribute Dependency Profiling to looks for hidden relationships between attribute values. Take an example of Premium calculation in Insurance Company ABC which life-style behavior, age and gender are one of the related attributes that impacting the amount of premium of a policy.
Profiling Time-Dependent Data to learn how much history exist and whether data follow any predictable pattern and does data meaning change over time. By way of illustration using scenario happen in Insurance Company ABC - Policyholder's Date of Birth. This often happen to organization which done migration from an obsolete systems to another new system where existing customer did not contain Date of Birth and hence migration date is assign to these customer. This is only part of the historical issue given if migration is done without proper cleansing and the longer the time the expensive the cost of cleansing.
Subject Profiling that examines Subject across or within database with the ranging from high-level counting to detailed counting with system combination or each system. Policyholder subject in Insurance Company ABC which to examines all basic personal info and perform counting on active policy hold by each customer to check any abnormalities or outliners among the list is a good case in point.
Profiling State Transition Models which examines life cycle of actual state-dependent-object. For instance, Underwriting process of Insurance Company which how customer info is being process through KYC, AML and into internal process before confirm a policy is in forced for a policyholder. Is challenging as the process is complex but it also important as is one of the core in a organization.
Profiling Relational Data Models which provide information about actual keys and relationships within data models. With scenario of data models in Insurance Company ABC whether up-to-date with actual data for instance of those primary key, foreign key and relationships of database table of policyholder, policy, beneficiary, etc.
It only become the valuable part when people start to know how to analyze the data and raise question about status of the data to understand data reality.
Only when we understand the reality and the next exercise can be done will be Data Quality with rule-based assessment which categories as Completeness, Validity, Accuracy, Consistency, Timeliness, Integrity and stakeholder survey on the utility with perception of usefulness. Structure and meaning of data will change over time and one time profiling only tell us the past and current only. It is advise to perform the repeating exercise on a regular basis and comparing the results to analyses root cause that the data quality deficiency.
Still Remember Yellow Pages, the Telephone Directories of Businesses? You Might Need Your Own Version of "Yellow Pages".
For kids who born at early of 1990's might still know or remember what is Yellow Pages, a weighty tome where full of information about business directories such as restaurant, boutiques, workshops and etc. For stakeholder who always work with data, it is same case that we need our version of "Yellow Pages".
Most of the time, majority data consumer such as business users or data analysts are work with data without fully understanding on full picture of what they have in their hands. Subsequently, costing them spend extra time and effort in finding, understand and even recreate datasets that already exist which may leads to incorrect analysis. If you lucky enough to have a team of SMEs who understand the data very well across different source system which is always the ideal scenarios but not realistic in reality.
Thus intervention of Data Catalog are here to guide data consumer to understand what data of a organization is exist and information of data such as definition, formats, lineage and accessibility with collections of metadata. Which in turn allow data consumer have the capability to evaluate the data suitability for usage purpose based on different use cases.
Let's understand Data Catalog more with illustration using Insurance Company as scenario where a new onboard Data Analyst, Siti is given task to analyze "Distribution of active policyholder with types of policy purchase in entire Malaysia Nationwide". So how do she start?
This will be standard procedure which Siti will go through. With available document which most of the time are outdated and if she lucky enough to get tribal knowledge if her colleagues willing to share with her who are the PIC that she can talks to for data usage. Then she need to back and forth between search and evaluate suitable data to perform data preparation and analyze. It's an brutal force work which waste a lot of time for other quality task such as decision on the approach of insight should be use to deliver the analysis that is simple yet powerful.
With introduction of Data Catalog, Siti can perform searching based on terms such as "Client", "In Force" to look on possible data which she can pick up for evaluation on the suitability. As Data Catalog able tell Siti on catalogue of available datasets with information such as
Dataset names, definitions and locations
Data formats and constraints
Database types and platform
Data provenance and lineage
Where interaction should not only include search terms but also SMEs of the data that she can refer personally to understand more with illustrate Catalog Model as below
The entire process is running efficiency and boost up confidence of Siti on trust to data she is going to perform analyze.
Till now you may have question on "Doesn't data catalog is quite similar with data dictionary?". The fact is NO.
The Difference is easy to see which both Data Dictionary and Data Catalog have the capability to provide data definition and describe data formats. Whereas what makes Data Catalog distinguish is not only tell us about data definition and format but allow people like Siti with other participants with roles such as data expert, data producer or data consumer with crowd sourcing of metadata and use the tools to faster collaboration and sharing.
Do you know the stories happen in Human Evolution?
Human evolution consist of few stages where people originated from apelike ancestors to todays with much more advanced ability compare with 6 million years ago. In between the lineage that happen is based on natural selection, variation, and survival ability. Till today lineage still happen in Nature and even data we producing today are consist of complex transformation from source point to downstream for many purpose such as reporting, etc.
Before get know what is data lineage, let's have brief understanding on what is data provenance. Definition of "Provenance" based on google, it came from Latin word "Pro" which is forth and "Venire" is mean come then introduce to English with the meaning of "the place of origin". Thus Data provenance is the point of origin in a data lineage trail which normally refers to the initial source such as Legacy system, ERP, Social, SaaS, Open data and etc. Provenance normally is a static metadata which remain consistent regardless of time.
Whereas Data lineage refers to the entire trail of events and actions of data flow from point of origin to point of interest. Data lineage is captured as a combination of static metadata with mapping of source, staging and target including active metadata such as data processing logs. Which also meaning to say data lineage has direct relationship with data engineering as illustrate with below diagram.
Lineage is happening from the point where data is being acquisition from origin of source system such as legacy and perform integration and preparation then finally to storage and usage. Which questions in diagram is needed to be answer when we are
Looking on the lineage of a use case such as commission calculation of Insurance Agent.
To comply BCBS239 requirements on governance and management of data such as identify and monitor lineage of PII data from multiple source system to multiple target systems or any downstream application which might involve any transformation logic behind.
Of course when we are looking on any lineage tools, not just able answer us with above question and yet the versioning of entire data lineage trail as metadata will change across time. Also it would be ideal if can tag along with data quality scorecard based on different data elements where we can monitors the health of our data to always ensure the accuracy. The endgame of a complete metadata lineage are meant for helping data consumers and data stewards to have trust in data and participate in troubleshooting and root causes analysis when a problem arise.
Hello? Is Sensitive Radar home?
In today's organization, variety of data are lot more than we can imagine where mostly includes Customer, Product, Sales, Marketing, Human Resources, Service, Finance, Equipment, IT, Operations, Facilities, and etc. that depends on the industry that a organization stays in. Among all data which we collected today and we should have answer to below questions when comes to Data Sensitivity:
What data is governed by regulation and policies?
Classic illustration will be Personally Identifiable Information (PII) which is any information that could be potentially identify an individual either direct or indirect means. Obvious set of PII data are demographic data (Name, Email Address, House Address), or government ID (Social Security Number, Vehicle Registration Number) where other sensitive data such as Ethnicity, Gender, Religions are not allow retails shop to store these info (especially in EU GDPR). Yet scope of PII is continuously evolving where nowadays Digital Identifier such as IP address, LinkedIn account is consider part of PII.
Who have the authorized access to prevent data breach which may lead to corporate reputational lost?
So who can view those PII value?
Can a Data Lake developer view them? They need some sample to perform pipeline development from Source system to the lake and so a masking PII data is practical for development purpose?
Or what kind of PII information is sufficient for Marketing to perform their campaign when promoting a new product? Whether demographic data is sufficient while health condition of clients will remain conceal to Marketing department?
These are depending on the regulation and policy of a corporate is driving as it differ in each industry.
What are the potential threats and risks that comes from internal and/or external during the data is move or rest?
There are numerous of threats and risk can happen such as
Unauthorized illegal access of application which lead to misuse of data by revealing personal details when external hackers successfully enter the application via method like phishing,
Stored data without necessary protection or in illegal location, for instance , storing PII details in CSV file at business user's Desktop without any encryption
These risk rarely happen within systems and applications or data movement such as Core systems, Cloud applications, Analytical Applications, and etc. as they all have different function or focus in handling privacy and security.
It is depending on user's knowledge and capabilities when can be educate later on the concept and guideline of data protection
To answer those questions that focus only in data privacy and protection based on data sensitivity level, it is depending on the privacy and security constraint which comes from
Business process where stakeholders (CDO, Legal, Compliance, Business Users and IT) have ability to identify the sensitivity in business content followed by source system element
A cross functional team (Legal, IT and Data Governance) to make decision on the sensitivity levels which always affected by current legislation such as EU GDPR, and PDPA, or other legal and compliance issues in different industry trends
It is not solely CISO job as their focus more on network and platform protection till today and time change. In spite of Cloud adoption which role and responsibilities of CISO are changing from On-Prem information security to broader responsibilities which include handling real-time threat for applications running in the cloud and collaboration with Regulation on the security of data storage in Cloud.
So do you know what is Metadata Management actually doing now?
When people came across to what components that Metadata Management should have, and how we should plan or initiate it:
Non ambiguous data name and meaning (common understanding) is fundamental for communication
Only then business knowledge will able to visualize data model and structure (forward-looking Business architecture) accurately to achieve business analysis and other tasks that work with data such as data profile. Where it will became valuable when people know what question should be raise to built Data Quality Matrix for root cause analysis periodically.
With catalogued metadata (holygrail of assets for strategy) whom contributed by data stewards then only leads to faster collaboration and sharing among other data consumers for more confident when consume data effective and efficiently.
To maintain data lifecycles based on data sensitivity (customer data protection) classification which Data Provenance and Lineage capable to help data stewards built customer trust and participate in troubleshooting and identify causes when any problems arise.