技术见解 / Data warehouse

What is a data warehouse?

A data warehouse is a type of data management system that enables today’s business intelligence initiatives. The intention of the data warehouse is to ingest data, and then organize and manage the data in such a way that enables data engineers, data scientists, and key stakeholders to leverage business intelligence tools to gain important insights and make meaningful business decisions.

In a common data warehouse design pattern, data is sourced from relational databases that power business tools, such as CRMs, ERPs, or flat files and aggregates, and stores the data in a uniform manner that’s optimized for business intelligence operations.


Benefits of having a data warehouse

In today’s hyper-competitive, data-driven business environment, making more intelligent business decisions is a differentiating factor that influences the long-term success of companies. Data warehouses play a critical role in an organization’s ability to stay competitive and let data play a driving force in how they develop new products, market to customers, or foster collaboration between business units. Key elements include:

  • Business intelligence: By aggregating data in a way that’s easily accessible to business units and optimized for business intelligence tools, organizations can make more sophisticated data-driven decisions.
  • Maintaining data: Unlike an operational database, where data can be overwritten to maintain the most up-to-date information, a data warehouse maintains copies of data entries, preserving historical data, which is a key benefit of the data warehouse.
  • Increasing revenue: Building on the value of business intelligence, data warehouses deliver standardized data on a variety of business operations. Clear, insightful data on business trends, processes, and operations leads to measurable and sustainable increased revenue.
  • Intelligent forecasting: With in-depth business data, scientists can analyze and forecast more effectively, assess the historical performance of any organization, and report key insights backed with undeniable information.

Data lake vs. data warehouse

Data lakes and data warehouses are both effective management systems for storing and managing data. In reality, however, they provide uniquely different value propositions to organizations. A data lake is unmanaged data in open file formats that can be read and modified by multiple technologies, whereas a data warehouse provides a closed format and is optimized for performance of the data warehouse engine. This is a trade-off, as a data lake provides openness and accessibility of the data whereas the data warehouse gives control, security, performance, and increased organization of the data.

Architecturally, a data lake differs from a data warehouse in how it stores and manages data. A data lake aggregates raw storage to be used for data management and analytics, whereas a data warehouse uses a unified schema to store and manage data. The important distinction with a data lake is that the raw storage can be used to store and manage a variety of data structures, such as structured data, semi-structured data, or unstructured data. The value to organizations is that they’re able to store an array of different data types without worrying about conflicting schema or translating the data to a uniform schema.




Types of data warehouses

Aside from differing data management systems, there also are different types of data warehouses. Some of the most common types include cloud data warehouse, enterprise data warehouse, and data mart.

Cloud data warehouse vs. data warehouse

The cloud data warehouse is a warehouse that’s consumed via the cloud-based data Platform as a Service (PaaS) model. PaaS solutions, such as Amazon Redshift, Google BigQuery, and Azure Synapse Analytics enable organizations to build data warehouse initiatives in the cloud, capitalizing on the incredible computing power and cost optimization offered by the cloud. One of the primary ways that cloud data warehouses differ from the traditional on-premises data warehouse is in the transformation of data, with the two most common data transformation strategies being Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT).

ETL and ELT

ETL and ELT are data-integration techniques that are used to move data from source(s) to the destination data warehouse. ETL is known to be the go-to data-integration technique for on-premises data warehouse design patterns. With this technique, data is extracted from the data sources, cleansed or transformed on a staging server, and loaded into the data warehouse.

ELT has become a rising replacement for the traditional ETL for companies using cloud-based data warehouses. With ELT, data is extracted and loaded into the cloud-based data warehouse where the data cleansing and transformation are performed. This evolution of data integration allows companies to offload the responsibility to the cloud-based PaaS model, taking advantage of the incredible computing power of the public cloud.

Enterprise data warehouse vs. data warehouse

An enterprise data warehouse is a warehouse that stores all of an organization’s data. The enterprise data warehouse is often confused as being synonymous with a data warehouse. They are, however, somewhat different. An enterprise data warehouse is designed to store all of a company’s data in a uniform schema. Once all of this data is loaded into the enterprise data warehouse, companies can perform comprehensive analysis on the data, garnering key insights regarding their organization.

A data warehouse, in comparison, may be a smaller database used to store historical data regarding one single business unit. With this approach, companies typically use multiple data warehouses that serve specific data analytics functions.

Operational data store vs. data warehouse

An operational data store is a data warehouse that provides a snapshot of the most up-to-date data aggregated from various sources. The value of this warehouse strategy is in the single view of real-time data spanning multiple data sources. With an operational data store, companies make better business decisions based on fast querying of real-time data.

The data warehouse, on the other hand, is not designed for fast querying or aggregating the most recent snapshots of various data sources. Rather, the data warehouse is intended for deeper analysis of historical data over a defined time period.

Data mart vs. data warehouse

The data mart follows the same structured database schema constraints found in a traditional data warehouse design. Where these data warehouse design patterns deviate is in the data scope. One way to understand how these design patterns differ is in the flow of data in a typical data pipeline.

In a common data pipeline design, data is sourced from various operational databases, staged for transformation, and then loaded into the data warehouse. In this approach, business units are responsible for accessing their required historical data from this single data warehouse, making it challenging for business units to easily access the information they need.

Data marts build on this architectural design by copying particular datasets into standalone data marts. For example, a data mart may hold only historical data for purchasing, sales, or inventory. This architectural design addresses some of the underlying challenges found in the data warehouse, making it easier for business units to perform analysis that is relevant to them.




Designing an optimized data warehouse architecture

When it comes to data warehouse design, there is no one-size-fits-all approach. Depending on the unique business requirements of a company, it may make sense to choose a particular data warehouse schema or one data integration technique over another. Although there are a variety of variables that influence the design approach to the data warehouse, there are several high-level design patterns that should be considered when optimizing a data warehouse architecture. There are, however, four key design characteristics of an optimized data warehouse architecture.

  • Theme focused: Also known as business purpose data warehouses, theme-focused data warehouses are the subject the data warehouse is designed around. Some common data warehouse themes are sales, inventory, or product development.
  • Unified: Data is preserved in a consistent naming convention and format. This uniformity allows business units to easily source and analyze information from the data warehouse without having to transform or clean the data.
  • Time variance: This is a design parameter where the timestamp is recorded when an entry is made in the data warehouse. This allows data analysis to be performed based on when data is recorded in the warehouse.
  • Nonvolatility: Represents a state in which data cannot be deleted or overwritten. This ensures historical data is preserved.



Modern data warehouse solutions with VMware Tanzu

With a growing variety of data sources, as well as new strategies coming online to make use of enterprise data in meaningful ways, it can be challenging for companies to make informed decisions regarding their data warehouse approach.

VMware Tanzu Greenplum

VMware Tanzu Greenplum is a parallel-processing database server based on PostgreSQL that supports next-generation data warehousing and large-scale analytics processing.

With Tanzu Greenplum, companies can partition data automatically and run parallel queries on it, allowing services to function as a single database supercomputer. This unique parallel processing of databases allows Tanzu Greenplum to be tens to hundreds of times faster than the traditional database, supporting large-scale data analytics processing.

Analytics from BI to AI

Through massive parallel processing, Tanzu Greenplum removes the siloed database effect by delivering a single, scaled-out environment. With it, companies can execute lightning-fast data querying and effective data ingestion, and scale operations far beyond the capabilities of today’s traditional business intelligence and analytics operations.

Enterprise data science

Tanzu Greenplum leverages Apache MADlib, an open source library of machine-learning services designed for the PostgreSQL family of databases. This evolution of the PostgreSQL ecosystem delivers multinode, multi-GPU, deep-learning capabilities. Additionally, it offers model versioning, along with the capability to push models from training to production via a REST API. This and other attributes of the Tanzu Greenplum solution are writing the next chapter in streamlined and intuitive enterprise data science.

The promise of data

Organizations need to ensure they’re using database tools that scale with the complexity of new data structures and types. Traditional relational databases cannot manage the scale and, therefore, require separate instances to support the size of modern data. Object store infrastructure scales dramatically, yet is extremely difficult to query in an efficient manner. Tools that do scale and provide fast query, however, tend to be costly and proprietary. They often run on expensive customized hardware, live only in the public cloud, or are missing key features.

The Tanzu Greenplum database enables organizations to utilize a rich toolset based on the open source Postgres database. Greenplum provides a cost-effective, unified platform that aggregates massive data stores and types, enabling extraordinarily fast access to deep analytic insights. By doing this, Tanzu Greenplum offers a path to truly realize the promise of data.

Learn more about Tanzu Greenplum