Modern analytics has changed the landscape of how we store, access, and present data. This article covers the difference between a data lake and data warehouse along with information for one to choose between the two.
Traditional data warehousing is not necessarily a thing of the past, but its use has changed from being the only way to provide business analytics to a more specified purpose. Extensive planning is involved with creating an Enterprise Data Warehouse (EDW) and in most cases, it satisfies a majority of a company’s information needs.
Big data technologies are allowing for companies to quickly store raw data directly from the source and into a data lake. Emails, videos, and manufacturing equipment feeds are just some examples. The usability of this raw data is extensive, but it requires more technical skillsets.
Quite often a hybrid approach is used to satisfy business requirements.
What is a Data Lake?
A Data Lake is a storage system that allows all raw and unstructured data from source systems to be in one location. This may include native operational data from a RDBMS system in which case it would appear to be like an EDW’s Operational Data Store (ODS). Don’t be mistaken, this is not an EDW by any means.
Raw data requires vast amounts of storage since nothing is excluded from the source system. The lowest level of data is retained, referred to as leaf level. This leaf level data is intended to be stored long term and to be updated in near real-time from the source systems.
Agile development is typical with Data Lakes since there is no structure and multiple tools can be used to access data. Data scientists would be the main users of this raw data; preparing and creating models for predictive analytics, machine learning, and data mining. Business users will require processed data and a more intuitive tool for consuming data from this storage type.
What is a Data Warehouse?
Data Warehouses store transformed data that has undergone a proper Extract, Transform, and Loading (ETL) process. Typically, there is a business requirement defined before the data is stored in an EDW.
Defining business requirements and building to spec adds increased development time and costs for delivering tangible results to the business. However, what’s delivered to the business professionals is easier to consume and can be self-service.
Structured development cycles are commonplace when creating and maintaining an EDW. This is due to the nature of the schema-oriented design for ease of use. Enterprise toolsets such as IBM Cognos, MicroStrategy, Power BI, and Tableau are some examples that allow the business to govern data and deliver full Analytics solutions for a Data Warehouse. Business users are the main consumers for this storage system.
What are the Differences?
|Data Lake||Data Warehouse|
|Data Storage Architecture||
|Users and Data Access||
How to Choose?
Enterprise solutions don’t come in a one size fits all, so it’s important to balance the operational needs of the business with real-time analytics of a data lake. In some cases, a data warehouse is used to fulfill business needs that the data lake can’t and vice versa when a mature EDW is already in place.
For an organization that is moving towards consolidating their reporting or starting a data science program, it’s important to consider both options. There are also software vendors that specialize in a hybrid approach for these storage types. These vendors may not have 100% functionality of both a data lake and data warehouse, but for many organizations, it’s a perfect fit.