What is a Data Warehouse - DataPlatform.gr

Author
Recent Posts

The journey began in 2008 where I officially started working in the field of IT (IT). Starting the first semester of school I realized a special attraction towards databases and automation. I have been involved with databases such as Microsoft SQL Server / Oracle Database, data analysis and automations using the command line (CLI), Visual Basic for Applications and Python. Through years of experience I have developed these capabilities so that I can make my life easier. For me, the purpose of every IT guy and every office worker is to have the knowledge so that through tools he can work a little but produce a lot. Through this his website DataPlatform.gr I try to offer knowledge and propose solutions to everyday problems.

Certifications:

certs

Latest posts by Stratos Matzouranis (see all)

In this article we will analyze what it is Data Warehouse, what you are made of and what is its utility.

In the age of IT, managing the ever-increasing data is becoming more and more difficult. We want to make use of this data and at the same time not to delay our application.

The Data Warehouse is a system you use for data analysis. There data is collected (ETL) in a database called staging from various sources such as transactional databases (OLTP) and Big Data and after some cleansing has been performed on the data (cleansing / data quality ) these are transferred to its base Data Warehouse (OLAP) as small entities (Data Marts). From there, the Data Warehouse is connected to the Reporting Tools such as Power BI, Excel, QLik, Tableau, etc. so that the information reaches the end user.

What is a Data Warehouse? — image is from Wikipedia

How is data distributed in a Data Warehouse?

The most popular data analysis technique is using Multidimensional cubes which are called and OLAP Cubes (Online analytical processing) .

There our data is divided into dimensions (dimensions) which can be time, product, geographical area, … and in fact where each cell contains a count (measure) which can be the number of sales made, the profit, the cost, etc.. Also the aggregations (eg average / total / bottom sales) are pre-calculated and stored when updating the cube with the new data through a process called process.

How is data stored in a multidimensional cube?

The data of a multidimensional cube is stored in the form of either Star Schema either Snowflake Schema but before we see in detail what these two forms mean we should know the three terms below.

Fact tables

To Fact table measurements are recorded (measures) of specific facts such as number of sales made, cost and profit. Also included are foreign keys which allow their connection with the dimension tables.

To ensure the uniqueness of each record over time as changes may have been made to the source of the data, a unique number called a primary key is defined as Surrogate key.

Dimension tables

Sta Dimension tables we have the dimensional data that may be common for measurements (measures) which we have in Fact tables such as time, employee, product and store.

Its use Surrogate key to ensure the uniqueness of the documents it exists in them as well.

Data Marts

Each separate entity of a subject such as for example finance or sales is called Data Mart and contains its own Facts Table along with Dimension Tables

Star Schema

In a multidimensional Data Warehouse the simplest form of a Data Mart it is one Star Schema. Each Dimension Table is directly linked to the Fact Table through the Foreign Key.

Snowflake Schema

That Snowflake Schema is a more advanced version of it Star Schema. Its difference is that Dimension tables are normalized into smaller sub-tables. Their use is recommended in cases where the speed of data recovery is more important than the recovery of detailed information.

Advantages of using Data Warehouse:

It combines data from many different sources. As a result it is easy to extract the data with a query.
It does not create blocks in the production OLTP bases. As the data has been copied to the Data Warehouse infrastructure.
It provides historical data over time even if changes have been made to the OLTP database thanks to the use of the Surrogate key.
It offers clarity to the data. Removing as much information can lead to wrong conclusions. It can also correct wrong information, e.g. typographical.
It offers high performance even in complex data analysis queries.

Sources:

Share it

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by the GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the "Analytics" category.
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the "Functional" category.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by the GDPR Cookie Consent plugin. The cookies are used to store the user consent for the cookies in the "Necessary" category.
cookielawinfo-checkbox-others	11 months	This cookie is set by the GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by the GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the "Performance" category.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not the user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__gads	1 year 24 days	This cookie is set by Google and stored under the name dounleclick.com. This cookie is used to track how many times users see a particular advert which helps in measuring the success of the campaign and calculate the revenue generated by the campaign. These cookies can only be read from the domain that it is set on so it will not track any data while browsing through other sites.
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number of visitors, the source where they came from, and the pages visited in an anonymous form.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.