A data warehouse is a centralized repository of integrated data from one or more disparate sources. Data warehouses store current and historical data and are used for reporting and analysis of the data.
Download a Visio file of this architecture.
To move data into a data warehouse, data is periodically extracted from various sources that contain important business information. As the data is moved, it can be formatted, cleaned, validated, summarized, and reorganized. Alternatively, the data can be stored in the lowest level of detail, with aggregated views provided in the warehouse for reporting. In either case, the data warehouse becomes a permanent data store for reporting, analysis, and business intelligence (BI).
Data warehouse architectures
The following reference architectures show end-to-end data warehouse architectures on Azure:
- Enterprise BI in Azure with Azure Synapse Analytics. This reference architecture implements an extract, load, and transform (ELT) pipeline that moves data from an on-premises SQL Server database into Azure Synapse.
- Automated enterprise BI with Azure Synapse and Azure Data Factory. This reference architecture shows an ELT pipeline with incremental loading, automated using Azure Data Factory.
When to use this solution
Choose a data warehouse when you need to turn massive amounts of data from operational systems into a format that is easy to understand. Data warehouses don't need to follow the same terse data structure you may be using in your OLTP databases. You can use column names that make sense to business users and analysts, restructure the schema to simplify relationships, and consolidate several tables into one. These steps help guide users who need to create reports and analyze the data in BI systems, without the help of a database administrator (DBA) or data developer.
Consider using a data warehouse when you need to keep historical data separate from the source transaction systems for performance reasons. Data warehouses make it easy to access historical data from multiple locations, by providing a centralized location using common formats, keys, and data models.
Because data warehouses are optimized for read access, generating reports is faster than using the source transaction system for reporting.
Other benefits include:
- The data warehouse can store historical data from multiple sources, representing a single source of truth.
- You can improve data quality by cleaning up data as it is imported into the data warehouse.
- Reporting tools don't compete with the transactional systems for query processing cycles. A data warehouse allows the transactional system to focus on handling writes, while the data warehouse satisfies the majority of read requests.
- A data warehouse can consolidate data from different software.
- Data mining tools can find hidden patterns in the data using automatic methodologies.
- Data warehouses make it easier to provide secure access to authorized users, while restricting access to others. Business users don't need access to the source data, removing a potential attack vector.
- Data warehouses make it easier to create business intelligence solutions, such as OLAP cubes.
Properly configuring a data warehouse to fit the needs of your business can bring some of the following challenges:
Committing the time required to properly model your business concepts. Data warehouses are information driven. You must standardize business-related terms and common formats, such as currency and dates. You also need to restructure the schema in a way that makes sense to business users but still ensures accuracy of data aggregates and relationships.(Video) Azure Tutorial || Azure Data warehouse solution || Part -1
Planning and setting up your data orchestration. Consider how to copy data from the source transactional system to the data warehouse, and when to move historical data from operational data stores into the warehouse.
Maintaining or improving data quality by cleaning the data as it is imported into the warehouse.
Data warehousing in Azure
You may have one or more sources of data, whether from customer transactions or business applications. This data is traditionally stored in one or more OLTP databases. The data could be persisted in other storage mediums such as network shares, Azure Storage Blobs, or a data lake. The data could also be stored by the data warehouse itself or in a relational database such as Azure SQL Database. The purpose of the analytical data store layer is to satisfy queries issued by analytics and reporting tools against the data warehouse. In Azure, this analytical store capability can be met with Azure Synapse, or with Azure HDInsight using Hive or Interactive Query. In addition, you will need some level of orchestration to move or copy data from data storage to the data warehouse, which can be done using Azure Data Factory or Oozie on Azure HDInsight.
There are several options for implementing a data warehouse in Azure, depending on your needs. The following lists are broken into two categories, symmetric multiprocessing (SMP) and massively parallel processing (MPP).
- Azure SQL Database
- SQL Server in a virtual machine
- Azure Synapse Analytics (formerly Azure Data Warehouse)
- Apache Hive on HDInsight
- Interactive Query (Hive LLAP) on HDInsight
As a general rule, SMP-based warehouses are best suited for small to medium data sets (up to 4-100 TB), while MPP is often used for big data. The delineation between small/medium and big data partly has to do with your organization's definition and supporting infrastructure. (See Choosing an OLTP data store.)
Beyond data sizes, the type of workload pattern is likely to be a greater determining factor. For example, complex queries may be too slow for an SMP solution, and require an MPP solution instead. MPP-based systems usually have a performance penalty with small data sizes, because of how jobs are distributed and consolidated across nodes. If your data sizes already exceed 1 TB and are expected to continually grow, consider selecting an MPP solution. However, if your data sizes are smaller, but your workloads are exceeding the available resources of your SMP solution, then MPP may be your best option as well.
The data accessed or stored by your data warehouse could come from a number of data sources, including a data lake, such as Azure Data Lake Storage. For a video session that compares the different strengths of MPP services that can use Azure Data Lake, see Azure Data Lake and Azure Data Warehouse: Applying Modern Practices to Your App.
SMP systems are characterized by a single instance of a relational database management system sharing all resources (CPU/Memory/Disk). You can scale up an SMP system. For SQL Server running on a VM, you can scale up the VM size. For Azure SQL Database, you can scale up by selecting a different service tier.
MPP systems can be scaled out by adding more compute nodes (which have their own CPU, memory, and I/O subsystems). There are physical limitations to scaling up a server, at which point scaling out is more desirable, depending on the workload. However, the differences in querying, modeling, and data partitioning mean that MPP solutions require a different skill set.
When deciding which SMP solution to use, see A closer look at Azure SQL Database and SQL Server on Azure VMs.
Azure Synapse (formerly Azure SQL Data Warehouse) can also be used for small and medium datasets, where the workload is compute and memory intensive. Read more about Azure Synapse patterns and common scenarios:
Azure SQL Data Warehouse Workload Patterns and Anti-Patterns
Azure SQL Data Warehouse loading patterns and strategies
Migrating data to Azure SQL Data Warehouse in practice
Common ISV application patterns using Azure SQL Data Warehouse
Key selection criteria
To narrow the choices, start by answering these questions:
Do you want a managed service rather than managing your own servers?
Are you working with extremely large data sets or highly complex, long-running queries? If yes, consider an MPP option.
For a large data set, is the data source structured or unstructured? Unstructured data may need to be processed in a big data environment such as Spark on HDInsight, Azure Databricks, Hive LLAP on HDInsight, or Azure Data Lake Analytics. All of these can serve as ELT (Extract, Load, Transform) and ETL (Extract, Transform, Load) engines. They can output the processed data into structured data, making it easier to load into Azure Synapse or one of the other options. For structured data, Azure Synapse has a performance tier called Optimized for Compute, for compute-intensive workloads requiring ultra-high performance.
Do you want to separate your historical data from your current, operational data? If so, select one of the options where orchestration is required. These are standalone warehouses optimized for heavy read access, and are best suited as a separate historical data store.(Video) Introducing the modern data warehouse solution pattern with Azure SQL Data Warehouse
Do you need to integrate data from several sources, beyond your OLTP data store? If so, consider options that easily integrate multiple data sources.
Do you have a multitenancy requirement? If so, Azure Synapse is not ideal for this requirement. For more information, see Azure Synapse Patterns and Anti-Patterns.
Do you prefer a relational data store? If so, choose an option with a relational data store, but also note that you can use a tool like PolyBase to query non-relational data stores if needed. If you decide to use PolyBase, however, run performance tests against your unstructured data sets for your workload.
Do you have real-time reporting requirements? If you require rapid query response times on high volumes of singleton inserts, choose an option that supports real-time reporting.
Do you need to support a large number of concurrent users and connections? The ability to support a number of concurrent users/connections depends on several factors.
For Azure SQL Database, refer to the documented resource limits based on your service tier.
SQL Server allows a maximum of 32,767 user connections. When running on a VM, performance will depend on the VM size and other factors.
Azure Synapse has limits on concurrent queries and concurrent connections. For more information, see Concurrency and workload management in Azure Synapse. Consider using complementary services, such as Azure Analysis Services, to overcome limits in Azure Synapse.
What sort of workload do you have? In general, MPP-based warehouse solutions are best suited for analytical, batch-oriented workloads. If your workloads are transactional by nature, with many small read/write operations or multiple row-by-row operations, consider using one of the SMP options. One exception to this guideline is when using stream processing on an HDInsight cluster, such as Spark Streaming, and storing the data within a Hive table.
The following tables summarize the key differences in capabilities.
|Capability||Azure SQL Database||SQL Server (VM)||Azure Synapse||Apache Hive on HDInsight||Hive LLAP on HDInsight|
|Is managed service||Yes||No||Yes||Yes 1||Yes 1|
|Requires data orchestration (holds copy of data/historical data)||No||No||Yes||Yes||Yes|
|Easily integrate multiple data sources||No||No||Yes||Yes||Yes|
|Supports pausing compute||No||No||Yes||No 2||No 2|
|Relational data store||Yes||Yes||Yes||No||No|
|Flexible backup restore points||Yes||Yes||No 3||Yes 4||Yes 4|
 Manual configuration and scaling.
 HDInsight clusters can be deleted when not needed, and then re-created. Attach an external data store to your cluster so your data is retained when you delete your cluster. You can use Azure Data Factory to automate your cluster's lifecycle by creating an on-demand HDInsight cluster to process your workload, then delete it once the processing is complete.
 With Azure Synapse, you can restore a database to any available restore point within the last seven days. Snapshots start every four to eight hours and are available for seven days. When a snapshot is older than seven days, it expires and its restore point is no longer available.
 Consider using an external Hive metastore that can be backed up and restored as needed. Standard backup and restore options that apply to Blob Storage or Data Lake Storage can be used for the data, or third-party HDInsight backup and restore solutions, such as Imanis Data can be used for greater flexibility and ease of use.
|Capability||Azure SQL Database||SQL Server (VM)||Azure Synapse||Apache Hive on HDInsight||Hive LLAP on HDInsight|
|Redundant regional servers for high availability||Yes||Yes||Yes||No||No|
|Supports query scale out (distributed queries)||No||No||Yes||Yes||Yes|
|Dynamic scalability||Yes||No||Yes 1||No||No|
|Supports in-memory caching of data||Yes||Yes||Yes||Yes||Yes|
 Azure Synapse allows you to scale up or down by adjusting the number of data warehouse units (DWUs). See Manage compute power in Azure Synapse.
|Capability||Azure SQL Database||SQL Server in a virtual machine||Azure Synapse||Apache Hive on HDInsight||Hive LLAP on HDInsight|
|Authentication||SQL / Azure Active Directory (Azure AD)||SQL / Azure AD / Active Directory||SQL / Azure AD||local / Azure AD 1||local / Azure AD 1|
|Data encryption at rest||Yes 2||Yes 2||Yes 2||Yes 2||Yes 1|
|Row-level security||Yes||Yes||Yes||No||Yes 1|
|Supports firewalls||Yes||Yes||Yes||Yes||Yes 3|
|Dynamic data masking||Yes||Yes||Yes||No||Yes 1|
 Requires using a domain-joined HDInsight cluster.
 Requires using Transparent Data Encryption (TDE) to encrypt and decrypt your data at rest.
 Supported when used within an Azure Virtual Network.
This article is maintained by Microsoft. It was originally written by the following contributors.
- Zoiner Tejada | CEO and Architect
To see non-public LinkedIn profiles, sign in to LinkedIn.
Read more about securing your data warehouse:
- Securing your SQL Database
- Secure a database in Azure Synapse
- Extend Azure HDInsight using an Azure Virtual Network
- Enterprise-level Hadoop security with domain-joined HDInsight clusters
- Enterprise BI in Azure with Azure Synapse Analytics
- Automated enterprise BI with Azure Synapse and Azure Data Factory
- Logical data warehouse with Azure Synapse serverless SQL pools
- Enterprise data warehouse
A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools.What are the 3 data warehouse architectures? ›
Data Warehouses usually have a three-level (tier) architecture that includes: Bottom Tier (Data Warehouse Server) Middle Tier (OLAP Server) Top Tier (Front end Tools).What is Microsoft Azure data warehouse? ›
Synapse Analytics. A data warehouse is a centralized repository of integrated data from one or more disparate sources. Data warehouses store current and historical data and are used for reporting and analysis of the data.What are the 5 basic stages of the data warehousing process? ›
- Step 1: Determine Business Objectives. ...
- Step 2: Collect and Analyze Information. ...
- Step 3: Identify Core Business Processes. ...
- Step 4: Construct a Conceptual Data Model. ...
- Step 5: Locate Data Sources and Plan Data Transformations. ...
- Step 6: Set Tracking Duration. ...
- Step 7: Implement the Plan.
- Storage. A primary function of a warehouse is offering storage space for inventory, equipment or other items. ...
- Safeguarding goods. ...
- Moving goods. ...
- Financing. ...
- Price stabilisation. ...
- Information management.
- The bottom tier, the database of the data warehouse servers.
- The middle tier, an online analytical processing (OLAP) server providing an abstracted view of the database for the end-user.
- The top tier, a front-end client layer consisting of the tools and APis used to extract data.
Architecture design: Kimball or Inmon
Two of the most frequently used approaches to data warehousing design were created by Ralph Kimball and Bill Inmon. Inmon's approach is considered top down; it treats the warehouse as a centralized repository for all of an organization's data.
Applications architecture, which represents the application systems, and how they relate to key business processes and each other. Technical architecture, which describes the technology infrastructure (hardware, software, and networking) needed to support mission-critical applications.What are the 3 types of data that can be stored in Azure? ›
Azure storage types include objects, managed files and managed disks. Customers should understand their often-specific uses before implementation. Each storage type has different pricing tiers -- usually based on performance and availability -- to make each one accessible to companies of every size and type.Why use Azure data warehouse? ›
The advantages that come with Azure SQL Data Warehouse include: Cost effective pay-as-you-go model when compared to the cost of an organization implementing their own enterprise-level data warehouse. Leverages Azure cloud compute and storage resources. Scalable compute power.
A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence.What are the 3 important characteristics of data warehouses? ›
Characteristics Of A Data Warehouse. The four characteristics of a data warehouse, also called features of a data warehouse, include SUBJECT ORIENTED, TIME VARIANT, INTEGRATED and NON-VOLATILE. The three prominent ones among these are. INTEGRATED, TIME VARIANT, NON VOLATILE.Which 3 major issues that will be faced in data warehouse development? ›
- High costs and failure rates. ...
- Rigid, inflexible architecture. ...
- High complexity and redundancy. ...
- Slow and degrading performance. ...
- Outdated technologies.
What is a database vs. a data warehouse? A database stores the current data required to power an application whereas a data warehouse stores current and historical data for one or more systems in a predefined and fixed schema for the purpose of analyzing the data.What are the 3 basic functions of warehousing? ›
Regardless of the product, every warehouse moves things, stores them, keeps track of them, and sends them out. Those four functions result in our four essential categories of equipment: storage, material handling, packing and shipping, and barcode equipment.What are the 3 roles of warehouses? ›
- Inventory control. By having a warehouse, it is much easier for businesses to manage a large amount of inventory. ...
- For economic reasons. ...
- Centralising the products. ...
- An emergency buffer. ...
- Adding value. ...
- Keeping goods safe.
- Allows for more consistent production. For companies who mass-produce their own goods, having ample warehousing space to store all of their raw materials is not only important, it is vital. ...
- Storage. ...
- Minimizes the risk of damaged goods. ...
- Private warehouse. ...
- Bonded warehouse.
ETL, which stands for extract, transform and load, is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system.What are the 3 main DB migration strategies? ›
There are three main approaches to database migration: big bang data migration, trickle data migration, and zero downtime migration.How is data stored in data warehouse? ›
Data is typically stored in a data warehouse through an extract, transform and load (ETL) process, where information is extracted from the source, transformed into high-quality data and then loaded into a warehouse. Businesses perform this process on a regular basis to keep data updated and prepared for the next step.
- Data Warehouse Architecture: Basic.
- Data Warehouse Architecture: With Staging Area.
- Data Warehouse Architecture: With Staging Area and Data Marts.
A data warehouse is a collection of databases that stores and organizes data in a systematic way. A data warehouse architecture consists of three main components: a data warehouse, an analytical framework, and an integration layer. The data warehouse is the central repository for all the data.What is data warehousing and explain in detail with example? ›
Data Warehousing integrates data and information collected from various sources into one comprehensive database. For example, a data warehouse might combine customer information from an organization's point-of-sale systems, its mailing lists, website, and comment cards.What is the best architecture to build a data warehouse? ›
Three tier architecture, the most popular type of data warehouse architecture, creates a more structured flow for data from raw sets to actionable insights.Which schema is best for data warehouse? ›
Star schema is the type of multidimensional model which is used for data warehouse.What is data warehouse explain 3 tier architecture of data warehouse? ›
The Three-Tier Data Warehouse Architecture is the commonly used Data Warehouse design in order to build a Data Warehouse by including the required Data Warehouse Schema Model, the required OLAP server type, and the required front-end tools for Reporting or Analysis purposes, which as the name suggests contains three ...What are the 5 elements of architecture? ›
A well-designed home needs to include these five elements:
- Sustainable architectural design.
- Functionality & considered engineering.
- Responsibly constructed.
All infrastructures should be within the budget and meet the data needs of the organization. Additionally, they should ensure efficiency in the organization's data architecture. Some examples of these infrastructures are database servers and network systems.What are different types of data architecture? ›
Today, I want to describe 3 parts of data architecture in simple terms; applications, data warehouses, and data lakes.What are the 2 types of storing the data? ›
There are two main types of digital data storage: Direct-attached storage and network-based storage. Each type can accommodate a range of devices, so we'll look at the general types first and then delve more into specific data storage devices.
Azure offers a choice of fully managed relational, NoSQL, and in-memory databases, spanning proprietary and open-source engines, to fit the needs of modern app developers. Infrastructure management—including scalability, availability, and security—is automated, saving you time and money.Which three types of activities can you run in Microsoft Azure data Factory? ›
Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.What is the importance of data warehouse architecture? ›
Data warehousing improves the speed and efficiency of accessing different data sets and makes it easier for corporate decision-makers to derive insights that will guide the business and marketing strategies that set them apart from their competitors.Which data warehouse architecture is best and why? ›
Advantages of Top-Down Approach –
Since the data marts are created from the datawarehouse, provides consistent dimensional view of data marts. Also, this model is considered as the strongest model for business changes.
The goal of a data warehouse is to create a trove of historical data that can be retrieved and analyzed to provide useful insight into the organization's operations. A data warehouse is a vital component of business intelligence.What type of data is stored in a data warehouse? ›
A data warehouse typically contains several years of historical data. The amount of data that you decide to make available depends on available disk space and the types of analysis that you want to support. This data can come from your transactional database archives or other sources.What are the benefits of data warehousing? ›
- Provide a stable, centralized repository for large amounts of historical data.
- Improve business processes and decision-making with actionable insights.
- Increase a business's overall return on investment (ROI)
- Improve data quality.
The major concerns are: quality and consistency of data. Consistency remain significant issues for the database administrator. One of the major challenge that has given differences in naming, domain definitions, identification numbers is Melding data from heterogeneous and disparate sources.What are the 4 common big data challenges? ›
Because of the constantly evolving data sources and the increasing amounts of generated data, companies face severe problems in achieving high-quality data integration. Those challenges altogether can also be called "The 4 V's of Big Data". They are data Veracity, Volume, Variety, and Velocity.What are the main reason behind a data warehouse failure? ›
A major – major – reason why data warehouse projects fail is poor communication between project stakeholders and the IT/technical team that's developing and coding the data warehouse.
Once in the data warehouse, the data is ingested, transformed, processed, and made accessible for use in decision-making. The three main types of data warehouses are enterprise data warehouse (EDW), operational data store (ODS), and data mart.What is another name for data warehouse? ›
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources.Is data warehouse a data storage? ›
A data warehouse is a system used for storing and reporting on data. The data typically originates in multiple systems, then it is moved into the data warehouse for long-term storage and analysis. Data warehouses are on-premises or in the cloud.What are the four key work activities in warehouse operation? ›
In general, warehouse activity consists of receiving, put away, storage, packing and shipping.What are the four objectives of warehouse operation? ›
Warehousing logistics objectives
Maximize the use of storage space. Ensure optimal organization of the workforce. Leverage handling equipment. Guarantee access to goods when they are requested.
Data warehousing environments are data management systems typically designed to optimize the performance of data analysis queries on large data repositories.What are the two 2 basic warehouse operations? ›
Picking and packing are two core functions performed in a warehouse. A warehouse management system should generate pick lists for each picker to retrieve items in the most efficient way.What are the three 3 most important skills that a warehouse worker should have? ›
- #1 Dependable. Every employer in every job type would probably list the ability to rely on their employees as their number one request. ...
- #2 Flexible. The warehouse world isn't just about being able to lift boxes or drive a forklift. ...
- #3 Organized.
"5S" is a philosophy or method or 5 steps that need to be taken to create a "clean and organize" workplace. The 5 steps to do to get a clean and organize workplace as per in 5S are: Seiri means Sort, Seiton means Set In Order, Seiso means Shine, Seiketsu means Standardize and Shitsuke means Sustain.What are the 7 key processes that all warehouses share? ›
- Receiving. It is the handling of products into a warehouse and onto a system. ...
- Put Away. ...
- Picking. ...
- Packing. ...
- Dispatching. ...
- Returns. ...
- Value Adding.
Generally speaking, there are three approaches to replenishment – push, pull, and distribution requirements planning (DRP).What are the three advantages of using a data warehouse? ›
Data warehouse benefits
Provide a stable, centralized repository for large amounts of historical data. Improve business processes and decision-making with actionable insights. Increase a business's overall return on investment (ROI)
A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence.What is data warehousing with example? ›
Data Warehousing integrates data and information collected from various sources into one comprehensive database. For example, a data warehouse might combine customer information from an organization's point-of-sale systems, its mailing lists, website, and comment cards.