Data has been at the center of information technologies since the first calculating machines were invented. But in recent years, with the explosion of Big Data, data governance, data lakes, machine learning (ML), the landscape of data-related topics has become exponentially more complex than it was just 10 years ago. Moreover, research forecasts that 2024’s Big Data market size of $84 billion USD will reach $103 billion USD by 2027. Read on to get a high-level view of a modern data approach so it’s easier to navigate different aspects of it.
Theory in Action 1
To better illustrate those topics let’s introduce a fictional company – AED – a chain of electronic hardware stores that operates in the United States. AED gets products from multiple providers and sells them in 120 stores across multiple states.
What is data engineering?
Data engineering is a process of building systems to collect and prepare data for further usage. Usually, it consists of three parts:
- Data ingestion / extraction – pulling data from internal or external sources, including CSV and Json files, spreadsheets, databases, other systems or internet webpages
- Data transformation – verifying, cleaning, deduplicating and aggregating pulled data
- Data storage / load – storing data so other system can access it
This original process, sometimes referred to as ETL (extract, transform, load) has evolved into an ELT approach (extract, load, transform), which has gained more popularity as of late (more on this in a subsequent article).
Data engineering is just one part of how organizations need to work with data. Once the data engineering phases are complete, experts can start data analysis and data science initiatives. Of course, throughout all this, data governance must be in place – to ensure security is maintained and facilitate further work.
A well-organized data ecosystem looks like this:
Theory in Action 2
AED advertises themselves as stores with the lowest prices. To fulfill that promise, they need to constantly monitor prices at affiliated wholesalers, as well as prices their competition offers. To do that, they created a system that pulls (Extract) data from wholesalers and competition websites. Once data is acquired from external sources, it’s grouped by products (Transform) and stored in the data warehouse (Load). This way, for each product they have list of prices from wholesalers and a list of prices offered by their competition. Data in the warehouse is used by multiple other systems within the company, as well as by data analysts to monitor and set new prices.
Data analysis
Once data is acquired, cleaned, stored, and partially transformed, it can be modeled, visualized and analyzed to support decision-making processes. An important aspect of that analysis is tracking trends and anomalies. To accomplish this, Business Intelligence tools are used (like: PowerBI, Looker or Tableau). The advantage of those tools is that, to use them, technical knowledge is not required.
Theory in Action 3
A data analyst at AED uses reports created in PowerBI to figure out which wholesaler has the lowest prices for an analyzed product. Then he checks the competition’s retail prices, and based on that data, recommends a new price for that item to the business team. PowerBI reports are created using data acquired and stored through an ETL (data engineering) process.
Data science
Usually, data analysis requires humans to draw conclusions from data, whereas data science focuses on using statistical methods, algorithms and machine learning to extract information from data. While data analysts need structured or semi-structured data to work with, data science works with those, as well as completely unstructured data (like images, audio and video files, and text-heavy files). The main purpose of data science is to automatically draw conclusions from data that is hard for humans to interpret (for example due to the sheer size of it).
Theory in Action 4
Last year, the number of product types sold by AED tripled. Since data analysts are not able to process prices for such a large number of products, the company created a data science solution to automate this process. Specially created algorithms check data acquired from wholesalers and competitors (again acquired during an ETL process). Because the developed solution is not limited by human ability to analyze information, it can work on much larger data sets. The new solution, when making suggestions, takes into account not only current market data but also historical data. Thanks to that it can predict how prices will be set in the future and use this information to adjust prices.
Data governance
Data Governance refers to the set of processes, tools and governing bodies an organization employs to ensure data is secure, of proper quality, valuable, traceable, and easy to locate. It also empowers companies to comply with applicable regulations and address privacy issues. It’s worth mentioning that while data engineering, data analysis and data science focus mostly on technical aspects, data governance is a much broader topic that combines technology with management practices.
A company can create their own data governance framework, but some use existing ones like:
- Data Management Body of Knowledge (DMBOK)
- Data Governance Institute (DGI) Framework
- The SAS Data Governance Framework
- BCG Data Governance Framework
Theory in Action 5
AED is in turmoil. The California Privacy Protection Agency has fined the company $10 million USD for improper use of privacy-protected data. The company’s monthly email newsletter was sent to people who had not given consent. This occurred because the list of customer email addresses used by the marketing team had been collected from many different databases in the company. After a month-long investigation, it was determined that the source databases did not track users’ consent for marketing communication channels, and the algorithm used to collect data automatically set ‘true’ value to any missing Boolean data cells. This unfortunate incident revealed AED’s serious problem with ensuring data privacy, data quality, and tracking where data comes from (data lineage). Even more shocking was the fact that during the investigation, the company discovered that in application log files, among all technical details of errors, credit card numbers were also stored – in plain sight. The AED board came to the painful realization that with the increase in the number of systems, they had lost grip on their data. It was time to put some new tools in place, revisit processes related to the handling of data and put people in charge of managing data. In other words, introduce a proper data governance framework.
Learn more about the importance of data governance.
I think this covers the basic high-level topology of the data landscape. Please remember that, to make it easier to understand, I simplified or omitted some aspects of these topics. In reality, data engineering, data analysis, data science and data governance interlude with each other. For example, data science has started to become more frequently used in ETL process to improve data quality or to enhance data with additional information. But to quote George E. P. Box “All models are wrong, but some are useful”.
In the next part of this series, I’ll cover data storage approaches like data warehouses, data lakes, and data lakehouses.
To find out more about how data engineering can enhance your business, get in touch with experts by filling out this short contact form.
About the authorPiotr Jachimczak
General Manager
A seasoned manager with 20 years of international experience in the IT industry, Piotr has a master’s degree in IT and is an EMBA graduate. A career that began as a software engineer led to numerous different roles on the technical side of software product development, including QA specialist, BI specialist and business analyst. Noticing a chasm between the technology and business worlds, he switched his career to management to help bridge that gap. Since then, Piotr has worked as a scrum master, project manager, delivery manager, IT director and general manager.