Software Development

Data Engineering with Databricks: Key Functionalities and Expertise in Practice 

Home

>

Blog

>

Software Development

>

Data Engineering with Databricks: Key Functionalities and Expertise in Practice 

Published: 2025/01/02

12 min read

Databricks is the world’s first data intelligence platform powered by generative AI. What are the key features of Databricks and what tools and techniques are worth using in different situations based on specific use cases? Read on to find out.

Why is data engineering important?

Data is a critical business resource as it enables organizations to maximize productivity. Today four of the top five companies in terms of market capitalization are data companies. […] [1]

The data economy, the ecosystem that enables use of data for business performance, is becoming increasingly embraced worldwide. Data has enabled firms such as Netflix, Facebook, Google, and Uber to acquire a distinct competitive advantage. McKinsey Global Institute indicates that data-driven organizations are 23 times more likely to acquire customers, six times as likely to retain customers, and 19 times more profitable.

However, most organizations struggle to convert data for improved business performance. There are many reasons for this, and one of the most important is a lack of high-quality data. […] [1]

Nowadays, it’s not about storing and keeping data – it’s about extracting value from the data and processing it in a way that brings the greatest benefits to an organization. That’s why ETL and ELT processes are emerging in the cloud, which aggregate and adapt data to the market’s expectations and the development of an organization.

What is Databricks?

Currently, cloud providers offer many different data processing solutions, both native to each vendor and multi-cloud. However, in my opinion, Databricks is currently the most multi-cloud platform. Databricks offers a unified approach to data analytics and machine learning (ML), regardless of the cloud platform, which is a significant advantage for companies implementing multi-cloud strategies.

Databricks serves as a comprehensive, open analytics platform designed for seamless development, collaboration, analytics, and AI solutions.

The Databricks workspace delivers a consolidated environment equipped with tools to handle a wide range of data-related tasks, such as:

  • Scheduling and managing data processes, particularly ETL operations
  • Overseeing security, governance, high availability, and disaster recovery strategies
  • Facilitating ML workflows, including model development, tracking, and deployment
  • Supporting the implementation of generative AI applications

When won’t you need Databricks?

Databricks isn’t a universal solution suitable for every company. For smaller businesses, it may not be the best fit, especially from a financial perspective. The platform’s costs can add up quickly, making it less feasible for small projects or organizations with limited budgets. Additionally, unlocking the full potential of Databricks often requires specialized training.

As an alternative, you can choose Snowflake, Azure Notebooks, PySpark, or Pandas and save data on physical disks in CSV format.

Why do you need Databricks?

While I’ve mentioned some disadvantages of using Databricks, it’s also important to point out the many benefits. Many companies use Databricks because it helps gather data in one place, making it easier for teams to work together. It can grow with your needs, so it works for both small and large projects. Since it’s cloud-based, you can choose to set it up on AWS, Azure, or Google Cloud.

Databricks also provides strong security features, like encryption, identity management, and access controls, which can help keep your data safe.

Using computing to empower ELT processes

Databricks offers computations in the following variants.

  • Serverless compute for notebooks: Instantly available, flexible compute resources designed to run SQL and Python code within notebooks.
  • Serverless compute for jobs: Effortless and elastic compute resources that automatically handle the infrastructure setup, so you can execute your Databricks jobs seamlessly.
  • All-purpose compute: Dedicated compute resources provisioned for data analysis in notebooks, with options to start, stop, or restart using the UI, CLI, or REST API – meaning you have full control over lifecycles. Just remember that a cluster start could last four to five minutes.
  • Jobs compute: Purpose-built compute for executing automated tasks, dynamically created by the Databricks job scheduler when a new job is triggered. The compute shuts down upon job completion and cannot be restarted.
  • Instance pools: Pre-warmed compute instances that stand by, ready for use to minimize startup and scaling delays, configurable through the UI, CLI, or REST API.

Expert tip #1:

When it comes to making payments using Databricks, we pay for DBU, which is a unit that can be compared to a commission charged by Databricks. As for serverless, we only pay for this DBU, and I can provide an example price here.

Interactive Serverless Compute$0.74/DBU-hour

The rate of $0.74 per DBU means that for each hour of compute resources used for a given task, you will be charged 74 cents for each DBU unit.

Databricks uses a “pay-as-you-go” pricing model, meaning users only pay for the actual usage of resources, making it a flexible solution for various computational needs.

However, when it comes to clusters, we pay for server usage, as usual, from a given cloud vendor, as well as the DBU commission. Here, is an example.

Pricing structure for the DS3 v2

This table outlines the pricing structure for the DS3 v2 instance across different payment options, including “Pay As You Go,” 1-year and 3-year reserved VM plans, and Spot pricing, with respective savings percentages for the reserved and Spot options.

You can find more about prices here.

Instance pools consist of idle instances that are kept ready for immediate use. Databricks does not charge for DBU (Databricks Units) when instances are idle in the pool. Users only pay for the actual runtime of the instances, which can lead to significant savings. With ready-to-use instances, the time required to start a cluster is significantly reduced.

Additionally, by predefining their Databricks Runtime version, users can further speed up the launch process. However, you still pay for the virtual machine resources, so this feature is particularly beneficial when planning for a quick machine startup.

Expert tip #2:

Choosing the right Databricks compute option depends on project needs: Serverless is best for short ad-hoc tasks, Jobs Compute for regular tasks that need more control, and Instance Pools for frequent, similar tasks to optimize time and costs.

Serverless Compute for Jobs: Ideal for running short, ad-hoc tasks and automating resource management to improve productivity.

Jobs Compute: Best suited for regularly scheduled tasks and complex data pipelines that need more control over cluster configuration.

Instance Pools: Useful for organizations running similar tasks frequently, as it reduces cluster startup time and costs by reusing instances.

Databricks storage: how does it really work?

ADSL Gen 2 is a type of object storage in Azure that enables the organizing of files in a hierarchical structure. This makes it possible to work with a data lake in the cloud and store data in files like CSV or Parquet.

To be clear – Databricks adds more value with a  akehouse. It supports operational flexibility, enabling both historical analysis and real-time data streaming processing. A data lakehouse combines the best features of a data warehouse and a data lake, providing enhanced reliability, scalability, and performance. To implement a data lakehouse, three key components are needed:

  1. Delta lake: Open-source software that helps maintain a transaction log for all records stored in the data lake files.
  2. SQL query engine: Allows data analysts to use SQL to query tables created in a delta lake.
  3. Data Catalog: Provides visibility into the data stored in a data lake

The components and data flow in a Data Lakehouse system:

Data Lakehouse

Expert tip:

The Data Lakehouse is a modern approach to data management that combines the advantages of a data warehouse and a data lake. A very important feature is data partitioning using PySpark. This allows users to easily partition data by date and organize it by months or years. This approach helps improve data management and processing efficiency.

What are the benefits of using PySpark?

Databricks is built on Apache Spark, a powerful engine for big data and machine learning. Here’s what it offers:

DataFrames

DataFrames are the main data structure in Apache Spark. Think of them like a spreadsheet or SQL table: they organize data into rows (records) and columns (fields). Each column can hold different types of data.

Lazy Evaluation

Spark optimizes data processing by figuring out the best way to run your code, but it doesn’t actually execute any steps until an action is performed. Actions include operations like collect, count, or saving data to a file. This allows Spark to process data more efficiently.

APIs and Libraries

PySpark, Spark’s Python interface, provides many tools and libraries, including:

  • Structured streaming: Process streaming data using the same code you would for static data. The Spark SQL engine handles the data continuously and incrementally as it arrives.
  • Pandas API on Spark: This feature allows you to run pandas code on large datasets across multiple nodes, giving you the flexibility to work with smaller datasets locally and scale up to distributed computing for production workloads.
  • Machine learning (MLlib): MLlib is Spark’s library for scalable machine learning. It provides consistent APIs for building and optimizing machine learning workflows.
  • GraphX: A library for graph analytics, GraphX offers tools for creating graphs, performing graph-parallel computations, and running algorithms, making it easier to work with connected data.

These features make Databricks a versatile platform for data engineering, analytics, and machine learning.

Expert tip:

The first thing that impressed me about Databricks was PySpark. I used to work with Spark on on-premises setups, and it was often complicated and difficult. When I first tried Databricks and saw how easy it was to use Spark in the cloud, it felt amazing. The cloud made everything simpler and much more powerful.

Is the DevOps process straightforward with Databricks?

Workspace and code development

Databricks is a powerful platform for data analysis that makes it easy to work with data. It organizes everything in a workspace, which is a place where you can access different tools and resources, like notebooks and code repositories.

  • Notebooks: These are the main tools for writing and running code in different programming languages like Python, Scala, or SQL. Think of them as interactive documents where you can write code, see results, and even add notes.
  • Repos (Repositories): They let you connect to version control systems like Git, making it simple to keep track of changes to your code and collaborate with others.

Expert tip #1: Notebooks are perfect for quick prototyping and data exploration. Use Repos when you need to work with production code or on test and development environments.

Expert tip #2: Integrate with CI/CD: As a project grows, it’s a good idea to implement continuous integration and delivery (CI/CD) practices to automate the process of deploying code to different environments. Databricks connects with Azure DevOps and other CI/CD tools.

Expert tip #3: A well-organized folder structure in the workspace makes it easier to navigate and manage code. For example, you could create separate folders for notebooks, ETL scripts, ML models, and so on.

Jobs orchestration

With its built-in features, Databricks makes it easy to schedule, monitor, and manage complex workflows. As a result, it helps to use resources efficiently and reduces project completion time.

Databricks jobs – A job is the main way to schedule and organize tasks in Databricks. It can consist of one or multiple tasks that perform different operations, like ETL processes, data analysis, or training machine learning models. This lets you group tasks into logical units, making it easier to manage and reuse code.

Tasks – Each task within a job represents a specific piece of work to be done. Tasks can include notebooks, SQL queries, or Spark scripts, allowing for a flexible approach to handling various types of operations.

Expert tip #1: It’s worth considering creating separate workspaces for different stages of the application lifecycle, such as development, testing, and production.

Expert tip #2: Creating a Job: Users can create jobs using the Databricks user interface or the Jobs API. You can set different parameters, such as the schedule for running the job, the types of tasks, and the computing resources needed.

Managing Dependencies: Within a job, you can define dependencies between tasks to control the order in which they run. It’s also possible to set conditional logic that determines which tasks should run based on the results of previous tasks.

IAC with Databricks

Infrastructure as Code (IaC) in Databricks is an approach that allows you to manage and set up resources and infrastructure in the Databricks environment using code. This means you can automate tasks like creating clusters, setting up data connections, or configuring security settings. To do this, you can use the Databricks Terraform provider. It helps you work with almost all Databricks resources, making it easier to create and manage your Databricks setup using simple code instructions.

Expert tip: With an IaC approach, you can create jobs, tasks, notebooks, and clusters in Databricks. This is very helpful because you can automate this work and make it easier to set up different types of environments, such as production, testing, or acceptance.

This approach fits perfectly with cloud governance practices and cloud appliance.

For more information visit this page.

Monitoring

Databricks offers the ability to monitor job execution results and configure email notifications or integrate with systems like Slack or Microsoft Teams, so teams can quickly respond to potential issues.

Expert Tip: Check out a tutorial on how to enable notifications in Microsoft Teams.

Additionally, Databricks logs events related to the cluster and provides a dashboard with insights into metrics for hardware, GPU, and Spark. Examples of Spark metrics include: Active tasks, Total completed tasks. For hardware, metrics such as CPU utilization and received through network are available.

A diagram illustrating the received through network:

Received through network

What is the Databricks Unity Catalog?

Before diving into the definition of Unity Catalog, let’s first understand the problem it solves. Imagine you would like to create a Databricks workspace, where you need to manage users and control their access to resources like clusters and data. This can be challenging when you have multiple workspaces for different environments (dev, UAT, prod) or projects. Managing user permissions, access controls, and metadata separately for each workspace can be difficult and time-consuming.

Unity Catalog addresses this issue by providing a centralized governance layer for data and AI. It offers a single interface for managing user access, metadata, and auditing across all your workspaces. Instead of duplicating user roles and access controls for each workspace, you can manage everything from one place.

Unity Catalog

Key Features of Unity Catalog:

  • Centralized Management: Unity Catalog supports the management of users and metadata across different workspaces, making it easier to control access and manage data within an organization. This helps better regulate permissions and access to resources.
  • Hierarchical Metadata Structure: Unity Catalog introduces a three-level namespace consisting of metastore, catalog, and schema. This structure enables users to organize database objects like tables and views in a way that simplifies management and auditing.
  • Managed Tables and Volumes: Unity Catalog supports both managed tables (where Unity Catalog controls the entire data lifecycle) and external tables (where access is managed by Unity Catalog, but the data lifecycle is controlled by the cloud provider).
  • Data Lineage Tracking: Unity Catalog enables data lineage tracking at the column level, allowing users to understand the sources of the data and its transformations over time.

The Hierarchical Metadata Structure:

The Hierarchical Metadata Structure:

Expert Tip #1: Remember that some Databricks features, like serverless, only work with a created Unity Catalog.

Expert Tip #2: Grant permissions Wisely: Implement a robust permissions strategy by carefully granting access to users for catalogs, schemas, and database objects. Use the principle of least privilege to ensure that users have only the necessary permissions they need to perform their tasks.

How to start?

Databricks offers several options for testing the platform without incurring costs. Here are the details of the available trials:

Databricks Free Trial

Duration: 14 days.

Resources: Users get access to the full Databricks platform, including ETL, data analytics, and machine learning features.

Cost: After the trial period ends, you only pay for the compute resources you use. Registration: You can sign up on the Databricks website, choosing your preferred cloud provider (AWS, Azure, or Google Cloud). More information here.

Databricks Community Edition

Duration: Unlimited, but with limited resources.

Resources: Does not require your own cloud account or providing compute resources. Ideal for learning and testing the basic features of Databricks.

Registration: Simply provide your details and verify your email address. You can start using the Community Edition at no additional cost. More information here.

5 things worth remembering

  1. When implementing Databricks, you must always remember Unity Catalog. Unity Catalog is a foundation for building other things in Databricks.
  2. The next thing to remember is cluster management and looking at costs, as costs are obviously very important, and it’s always good to present the cost of running a job to the business.
  3. The third thing worth remembering is the DevOps approach. Using repositories and CI/CD tools with Databricks is always a fantastic practice.
  4. Next, it’s important to focus on monitoring. Always set up alerts and notifications for your jobs. This will help you reduce the risk of errors and fix any issues that come up in the future. Also, look at the metrics and charts provided by Databricks.
  5. The last thing to remember is to keep checking for new features in Databricks, as something changes almost every week. As more functionalities become available, the product keeps getting better.

To learn more about data engineering, data science and data governance strategies that drive growth and create new revenue streams, get in touch with our experts by filling out this form.

 

About the authorKasper Kalfas

Cloud Architect

With over 8 years’ experience in software development, Kasper has been designing cloud infrastructures, developing DevOps solutions and creating data lakehouses for companies across sectors. Specializing and certified in Amazon Web Services (AWS), Azure and Google Cloud Platform (GCP), he’s passionate about finding innovative answers to complex problems and exploring opportunities offered by new technologies. After work, he tests and reviews data and AI tools on his blog, where he’s building a community of API and AI enthusiasts.

Subscribe to our newsletter

Sign up for our newsletter

Most popular posts

Privacy policyTerms and Conditions

Copyright © 2024 by Software Mind. All rights reserved.