In the first article of this data engineering series, I’ve covered the four fundamental aspects of handling data: data engineering, data analysis, data science and data governance. This time, I’ll dive deeper into the most common architectures for storing data.
Before that though, we need to start by defining what is structured and unstructured data.
Structured and unstructured data
You can think about structured data as an Excel spreadsheet table. When adding a new item, you have columns that indicate what goes where, for example: name, type, identification number, date of production, and so on. Thanks to that, you can easily sort items by name, group by type, and filter by date of production.
Imagine now that you put everything you know about a product in one spreadsheet cell. On the one hand, it’s way quicker to add that item, on the other, doing simple operations like sorting and grouping becomes significantly more difficult. That’s an example of unstructured data. Images, video files, audio files, pdfs and word documents are also types of unstructured data. The Excel example nicely illustrates the advantages and disadvantages of unstructured data. It’s cheap and fast to acquire and store it – but using it becomes complicated.
Theory in Action 1
To better illustrate these topics let’s use a fictional company – AED – a chain of electronic hardware stores that operates in the United States. AED gets products from multiple providers and sells them in 120 stores across multiple states. |
Example of unstructured data: “MSI GeForce 4060 Ti 16 GB SDDR 6 PCIe 4.0 x16 1162038” |
Example of structured data: “{Item_type: “Graphic Card”, Producer: “MSI”, Product Line: “GeForce 4060”, RAM size: “16 GB”, RAM type: “SDDR6”, bus type: “PCIe 4.0 x16”, Product code: “1162038”}” |
What is a data warehouse?
A data warehouse is a special type of database that is designed to store large quantities of data. Data warehouses can be built using standard SQL database engines (like Postgress, Microsoft SQL Server, Oracle Database), noSQL databases (like coach DB, Casandra, Parquet files) or specialized dedicated solutions (for instance, Snowflake). A data warehouse stores structured data in a way that makes it easy to perform operations like aggregation, segmentation, and doing statistical calculations (median, average and so on). It’s a perfect way to store data that can be used to generate reports, draw business insight, get a high-level view of your business and even drill down from that high-level view into details. A disadvantage of this approach is that it takes a lot of effort to transform source data into a format compatible with a data warehouse (especially as source data is unstructured).
Theory in Action 2
AED, the fictional company introduced in the first part of this article), decided to pull data from its operation systems (like point of sale and accounting) to create a data warehouse that stores revenue and costs from each store in the chain. Data was structured to make it easy to get information about those two financial parameters for a single store, city, distribution area or even holistically for the whole group. This data would then be used to generate power BI reports that data analysts could use, as an input for data science algorithms used for financial forecasts and other systems within the company that requires this kind of information. |
What is a data lake?
Data lakes are similar to the hard drive in your computer – they can store any type of data (structured and unstructured) just like your computer (be it documents, images, audio and video files and so on). In that regard they are very flexible and, in contrast with a data warehouse, you don’t need to manipulate data to store it. But it comes with a huge disadvantage. If you do not pay enough attention to how you organize things it can quickly become a big mess that data engineers jokily sometimes call a data swamp. So why use data lakes instead of standard file stores (like AWS S3)? Data lakes store redundant information in the cloud and are optimized to handle huge amounts of data (putting in a lot of data and pulling a lot of data out can be done quickly). Additionally, modern data lakes have some nice features that enable easier access to files containing structured data (for example querying a CSV file using SQL language, so using them is like using a relational database).
Theory in Action 3
AED noticed that a data warehouse was not enough for them as they had a lot of unstructured data, so they decided to put it into a data lake. At first everything was going great. It was fast and easy to get data in and out, they did not need to worry about backups as data in the lake was redundant and were less worried about accidental loss of data due to deletion (as a data lake stores all data including historical data). Soon, however, a problem emerged. Different departments were entering data in different ways, making finding things difficult. The same data was stored in multiple places and often there were issues with data consistency (like conflicting bits of information). AED realized that instead of having a data lake from which it’s easy to fish data out, they created a data swamp that was hard to navigate and where it was almost impossible to find what they were looking for. |
What is a data lakehouse?
A data lakehouse, as the name suggests, is an attempt to combine the advantages of a data warehouse and a data lake, while limiting the disadvantages those technologies have on their own. In a data lakehouse, data is pulled in and stored in raw format in a data lake (this storage is called the bronze layer). Next, using a set of useful built-in tools, the data is cleaned, transformed into common format and stored again in a data lake, this time in a semi-structured format (this storage is called the silver layer). Lastly, data is deduplicated, aggregated, and stored in a data warehouse-like format (this storage is called the gold layer). This three-layer approach is called medallion architecture. It makes it easy to organize and transform the data, allows it to be used in data analysis (pulling data from the gold layer), and data science (pulling data from any or all of the three layers).
Theory in Action 4
In AED’s journey to handle data they decided to try data a warehouse approach. They put all data about product prices, revenue and costs into a data lake (raw html files, images, csv files, extracts from other databases, and json files gathered from API calls) into bronze layer structures. Next all text and numeric data was transformed into json format, passed through quality gates (including data anomaly detection) and put into a silver layer. Lastly, it was transformed into relational data and uploaded into data-warehouse-like structures in a gold layer. The entire data pipeline was configured so that each time new input data was available, the entire process was automatically run for that new data and appended to existing data structures. The data science team used images stored in the bronze layer to detect product properties and validate them against written product specifications to detect potential errors. They also used this opportunity to add missing properties (like color or product size). The data science team also used data available in the data warehouse to craft algorithms for suggesting prices in a simpler and more efficient way. The data analyst team, on the other hand, used data from the gold layer to create powerBI reports. The engineering team, thanks to the medallion architecture, also had an easier job detecting problems with data. Thanks to data linage built into the data warehouse, they could track what happened with each piece of data from the data warehouse – all the way back to the source file. Data governance structures used the data lakehouse’s unified access control to set permissions for all databases involved in the process. |
From complex clutter to actionable insights
Partner with a team that can help you develop a customized storage strategy that scales with your business. Store smarter, not harder – contact us using this form to learn more.
About the authorPiotr Jachimczak
General Manager
A seasoned manager with 20 years of international experience in the IT industry, Piotr has a master’s degree in IT and is an EMBA graduate. A career that began as a software engineer led to numerous different roles on the technical side of software product development, including QA specialist, BI specialist and business analyst. Noticing a chasm between the technology and business worlds, he switched his career to management to help bridge that gap. Since then, Piotr has worked as a scrum master, project manager, delivery manager, IT director and general manager.