Estimates suggest that by 2025, there will be more than 150 zettabytes – that’s 150 trillion gigabytes – of unanalysed data. Much of this data will be unstructured, compelling almost every business to formulate a big data strategy. This will require them to radically rethink data storage and analysis.
“Big data” has become a somewhat ubiquitous term since it was popularised in the mid-2000s. Big data is information that’s characterised by the four V’s of big data:
- Volume: unstructured data, such as sound and image, occupies more space than conventional data.
- Variety: comprises structured, unstructured and semi-structured data.
- Velocity: information is generated at a very high speed.
- Veracity: the assumption that with higher volumes of data we will be closer to the “truth”.
These conditions have far-reaching implications for data analysis. Artificial intelligence has significantly advanced the way we analyse big data. Now, we have gone beyond simple hypothesis and query analytics to be able to actually explore data, detect trends, and make forecasts to facilitate better decision-making. However, the environments required to handle these data are growing more complex.
As the era of big data advances, traditional data warehouses have been found wanting.
Gartner estimated that 60% of data warehouse implementations have been found to have limited acceptance or failed entirely five years ago. It goes without saying that seven years down the line the situation may be even more critical still.
However, we shouldn’t be quick to discard years of progress and technological development. Instead, we should apply existing knowledge to consider more evolved data storage and analysis solutions. Then, we can ensure insights translate to profits. This is what we’ll explore here.
Table of Contents
The dawn of the data warehouse
In the beginning, there was the data warehouse. The data warehouse was fed by the database, which was the repository of all the organisation’s data. There can sometimes be some confusion between a database and a data warehouse. For those outside of data disciplines, the key differences are summarised in the table below. The list isn’t exhaustive, but it provides a basic overview.
|Designed to capture data.||Designed to analyse data.|
|An amalgamation of related data.||An information system that contains data from one or several sources.|
|Uses simple transaction queries.||Complex queries are used for analysis.|
|An application-oriented collection of data.||A subject-oriented collection of data.|
|Contains detailed data.||Presents a summary of available data.|
In the database, the data points are explored and organised into tables and columns so they can be transferred to the data warehouse. Within the data warehouse, the data is further organised and standardised depending on how they relate to each other. How this is done will be dictated by the data model.
An optimised data model will create logical datasets that are relevant and easy to retrieve. The data warehouse enables companies to rationalise their data storage to ensure the most relevant information is retained for analysis. This is crucial to generating insights and complying with regulations.
What are the different types of data models?
Since its inception, the data warehouse has commonly used two different data models: one designed by Bill Inmon, the “father of data warehousing” and Ralph Kimball, another eminent Data Architect. The key difference between these two models is that Kimball recommends a bottom-up approach and Inmon a top-down approach.
Which model generates the best results is still an ongoing discussion among Data Architects. Without going into excessive detail – as of course there are many technical differences between their approaches – the key distinction is what happens to the data after it passes through the staging environment to the data warehouse.
When the data arrives in the staging environment from the database, an Extract Transform Load (ETL) tool will process the data. According to Inmon, the data should then be fed directly into the data warehouse before being broken down into data marts that serve each business unit. Kimball, by contrast, says that data should be loaded into data marts before being unified (conceptually) as the data warehouse.
The debate as to which approach is better is neatly summarised by contrasting statements from the two Data Architects. Kimball states, “The data warehouse is nothing more than the union of all data marts”. Inmon responds with a metaphor: “You can catch all the minnows in the ocean and stack them together – they still do not make a whale”. The pros and cons of each approach are summarised below:
|Data warehouse acts as a unified source of “truth” for the whole organisation.||More expensive and time-consuming to implement.|
|More flexible to changing business requirements and modifications to data sources.||Requires a high level of expertise to set up and run.|
|Data normalisation avoids redundancy, so analytic processes are less prone to errors.||Separating data marts from the data warehouse requires more ETL processes thus a greater engineering overhead.|
|Simpler and faster to design and set up.||Less flexible to change and adapts more slowly.|
|Enables rapid data retrieval, as data is segregated into fact tables and dimensions.||The initial division of data into data marts means there is no “single source of truth”.|
|The dimensional data model is easier to understand.||The dimensional model creates more data redundancy and therefore a higher chance of errors.|
|The bottom-up approach makes all data integrations relevant to business needs.||The approach is confined to business processes so it does not offer a complete view of enterprise data.|
Big data shakes up data storage
After years of development and debate in the data warehousing field, big data came along. Harvested from unstructured and semi-structured sources like sound, images or social media posts, this information disrupted data storage as we knew it. This is because unstructured data can’t be nearly organised into rows and columns to be fed into the data warehouse.
This presented Data Architects with a significant conundrum. These data were undoubtedly packed with useful insights, but how could they design an environment to effectively analyse this information? The architecture had to be manageable, navigable, and cost-effective, without compromising the richness of this sheer volume of information.
After years of finessing the data warehouse, they created a new concept: the data vault. If you’ve been researching business intelligence and data storage options, you’ve likely already come across products like Azure Data Vault. But what is Azure Data Vault? And how are products like this different from a data warehouse?
What is a data vault?
In essence, data storage solutions like Azure Data Vault store raw data. Data transformation happens on-demand, as opposed to relying on pre-processing, which is done by tweaking the ETL process. Although traditional ETL processing makes sure all the data is clean, relevant and predictable, it’s not suitable for all scenarios – namely the handling of big data.
In traditional data warehousing, the transform phase manipulates raw data to fit the warehouse schema or business rules. The load phase then transfers the data in its standardised format into the data warehouse. Then, when it’s needed, the user accesses the data directly.
By contrast, data vault modelling will extract raw data from the source as normal, but then only perform a basic transformation, such as erasing corrupt values. Then, the data will be loaded into the raw data vault. The system will apply more complex transformations according to business rules on demand. The output will then be transferred to the business vault where users access data through data marts.
Advantages of the data vault model
- Adds flexibility to the data integration process.
- Retains raw data, in case there is other useful unexplored information.
- Simplifies transformations as the approach is more responsive, as opposed to attempting to create one-size-fits-all schemas.
- Business rules and data are decoupled, making updates and new transformations easier to apply.
- Different departments can apply different transformations according to their needs.
The ultimate data storage solution?
Although the data vault model may appear to crack the data storage code for the big data era, it doesn’t come without its drawbacks. Ad hoc processing can impact overall performance, while multiple versions of transformations and datasets can lead to bloated data architecture. This can drive up data storage costs.
Furthermore, implementing the solution requires greater expertise. Whereas traditional data warehouses use no-code ETL, a company will need to hire a knowledgeable Data Architect to get a project off the ground. Overall, it’s likely to cost more – but ultimately, this will have to be weighed up against the return on investment.
At the end of the day, the business needs to define what they want to know. If this knowledge resides in structured data, we need an ordering principle; something that can help us to store and find the information we need as fast as possible. In these scenarios, perhaps repeatedly restructuring the data is unnecessary.
However, an interesting proposition is the implementation of a data vault 2.0 of sorts. The data vault model has features in common with Kimball’s model. Both deploy dimensional models, which subsequently, can be layered. This means that if a company has an existing data warehouse, there’s no need to start over.
Besides, there’s likely to be a wealth of structured data from ERP, CRM and the majority of common software ready and waiting for analysis. Instead, they can leverage a “loose coupling” data model and attempt to reap the best of both worlds.
Turn to experts to get the right solution
The crux of the issue is that we need to analyse the business to determine the best fit. Sometimes, the most advanced solution won’t be the best for the business’s needs. Instead, a more traditional approach to data warehousing could save time and monetary resources.
On the other hand, deploying the most technically advanced solution could represent long-term gains. Implementing a data vault, or a combination of a data vault and warehousing system, could reveal never before seen insights. This could translate into more efficient operations, better decision-making, and enhanced customer service.
To help you analyse the business’s data storage needs, you should enlist the help of a Data Architect. They will be able to evaluate the system, make recommendations, and supervise the installation of the solution. There are numerous experienced Big Data Architects in the Outvise network. Wherever your organisation is based, there will be an Outvise expert ready to work onsite or remotely.
The correct solution will always be the one that generates more profit. Get the expertise you need to create synergy between tech and your business goals.
9 years working as BI/Analytics Professional. He has experience developing several models for Telco's like retention, segmentation, time series, etc. He develops a risk model for the National Custom to detect risk in the imported goods using Neural Networks with SAS.
5 years developing from scratch the Data Warehouse and the BI solutions for a local bank using SAS - ETLs, Cubes, Reports.