To-do: Serve the Data Elegantly

Imagine if you have to consume a large number of data from various data formats……

Liana S
5 min readAug 6, 2019

As Mapan business environments get increasingly volatile, our data architectures must be flexible and adaptable to a large number of data in various data formats conditions. Besides, by having 2.5 million members, we realized that it’s essential to protect individual privacy. To overcome this common challenge, we decided to adopt data modeling techniques and data security policies in responding to build our data layers.

Let’s take a look at how we taking care of our data!

The oldest Mapan’s data pipeline repository — 2019

What do we have in our Data Layers?

1 Raw data refers to any data object that hasn’t undergone thorough processing, either manually or through automated computer software (based on Techopedia). It can be in the form of files, visual images, database records or any other digital data.

As time goes by, Mapan doesn’t only have RDBMSes (MySQL, PostgreSQL) as data storage but also have APIs, Kafka messages, data transaction logs, and Google Sheets in which we call it as raw data.

Mapan have data stored in many sources like MySQL, PostgreSQL, third-party system, and even Google Spreadsheet

2 Data Lake is a centralized repository contains data items simulating the behavior of raw data in the transactional data pipeline. Data Lake retains all data types and schemas, the way the data is stored in the previous layer. We can store our data as-is.

To make it easier to aggregate data items in various data sources in Mapan’s raw data, we store data items to Data Lakes, including both totally unstructured and highly-structured data. That’s why we need unstructured storage, like Google Cloud Storage or Amazon S3, to store our data in any file format.

This level of freedom makes data lakes highly adaptive places and allows for a broader range of analysis on the data that’s stored in them. Specifically, data lakes allow Data Scientists to analyze data that wasn’t previously accessible i.e. logistics tracking, customer support call notes, etc.

3 Data Warehouse contains data items extracted from varying data sources in Data Lake, having undergone data cleansing and transformation to meet the requirements of data users. In Mapan, we have three sublayers as representative of Data Warehouse and we use this layer as a single source of truth.

(i) Data Vault is a hybrid approach that combines the best of 3rd Normal Form (3NF) and dimension modeling. This data modeling technique enables historical storage data coming into the database by separating the data to three main components which are Hub, Satellite, and Link tables.

(ii) Dimensional model is a data structure technique optimized for Data warehousing tools. It’s designed to read, summarize, analyze information and comprised of Fact and Dimension tables.

(iii) Integration Layer is a combination of semantic, reporting and analytical technologies.

Data Warehouse enables higher-level analytics, be it business intelligence or machine learning

In this third layer, we also have Data Health Check Collections which contains anomalous data items which not fulfilling completeness, accuracy, and consistency in both transactional and analytical data pipelines. How to check these aspects?

Inspired by DAMA UK’s white paper on Data Quality Dimensions, Mapan’s Data Governance Team created a Core Data Quality Dimensions as a measurement of data quality. These collections will be used as parameters to monitor the quality of raw data by running a Data Quality script in the Data Lake environment.

4 Because Data Warehouse collects data from the entire business, it’s important to restrict control of who can access it. Additionally, querying data directly from Data Warehouse is a difficult job for anyone who doesn’t know about Data Warehouse concept. So we decided to create the next layer of Data Warehouse to transform information into specific insights; Data Mart.

While a Data Warehouse is built to store data from the entire business, Data Mart is built to fulfill the request of data users. It often is seen as a small slice of the Data Warehouse. Therefore, we can use Data Mart to isolate — or partition — a smaller set of data from a whole to provide data access for the end consumer.

How we represent the information to data users? We use Metabase and Tableau as analysis tools for business users or operational users to make it easier whether to querying or viewing data.

Now, who can access and also responsible in each data layer?

Since we need to protect our Agent’s data privacy and develop a better way of working, Mapan’s Data Governance created data protection and sharing policy to ensure only the agreed parties can access the agreed data items within the agreed time frame.

This is how the workflow goes on — 2019

Data User: Intends to request data items to perform either operational or analytical tasks in their local environment. They can only access the data based on the Point of Difference (POD) between roles, such as internal pods, like the operational user, and external pods, like company-level.

Data Engineer: Maintain & continuously improving data infrastructure so it can provide data across our product to Data Lake storage.

Business Intelligence Engineer: Maintaining ETL processes to improve the capacity of our Data Warehouse.

Data Science Engineer: Develop and implement data products, e.g. regression engine, classification engine, and recommendation engine. They usually use tables in Data Warehouse to build the product, but if they need raw or denormalized format, they can also access the Data Lake layer.

References:

--

--