25 Feb 2022

Unit testing of Google Cloud Dataflow applications (Apache Beam)

In this article, I’ll describe the best practices we’ve developed while working with the Apache Beam framework, which allows us to simplify the implementation of unit tests for Dataflow applications and make the tests easier to understand. The effective testing of such applications requires, of course, employing “standard” best-practices of writing unit tests such as good test isolation, test determinism, testing API – not implementation, testing one thing in one test, and using clear input data and expected outputs. However, in order to make the most out of using Apache Beam for testing Dataflow applications, it’s worth implementing a few additional ones. What are these? Read on to find out.

Google Cloud Dataflow applications 101

Although this text focuses on testing Dataflow applications, it’s worth briefly recalling what they are. Dataflow applications have certain characteristics that distinguish them from other programs and their task is usually to transport data from one side of the corporate ecosystem to another. These programs often use the following scheme: one part reads data, another part implements some business logic, and a third part stores data. Data is usually read and stored in files, databases, data warehouses, message queues and similar entities.

Separate plumbing from testing business logic

When testing such a solution, it’s tempting to test the correctness of functionalities in conjunction with data sources. The simpler the business logic between program input and output, the bigger the temptation. However, it can lead to several unpleasant consequences, which I will describe below.

Problem 1. Long and non-deterministic tests

The most obvious consequence of this approach is that the test setup grows drastically. Let’s assume that we need to load records from a CSV file, process each record and then store the processed records on a Kafka topic. In the test setup, a file would have to be generated and then deleted after test execution, or it would have to be permanently stored in the program’s resources. The Kafka instance would have to be launched in a container and initialized accordingly, which means, for example, creating a topic and deleting it after the test.

Unit testing of Google Cloud Dataflow applications

 

Additionally, data would have to be read and stored, which requires disk operations and communication with test instances of services running in containers or even on an external server . These operations are time-consuming, increase the test duration and are risky if done over the network, so the test determinism decreases.

Problem 2. All cases in one test

Time-consuming and lengthy tests do not directly affect the quality of a product. However, they can trigger certain behaviors in developers, which could have a negative impact. Let’s consider a drastic case (which does occur in real life) when the whole test setup causes the test to last about 30 seconds. We don’t want the tests to take many minutes to run, because that slows things down. On the other hand, we need to test 5 corner cases. This ultimately causes the tests to last more than 3 minutes, which is an unacceptable situation. In this case, the developer is faced with a decision: either cram all the test cases into one test, making it completely incomprehensible and prone to erosion over time, or worse, removing some of the corner cases from being tested. By only testing the business logic, we can test all cases in less than a second.

Problem 3. Asynchronous assertions

Assertions are also problematic in such tests. If we want to check if the expected data has been kept in storage (e.g., a message queue or a data warehouse), we cannot simply check if the expected data is in the target location. Storages that we deal with in Dataflow applications usually do not operate via transactions and storing data may be delayed due to various optimizations (e.g., cost optimization or batch storage). Therefore, it is necessary to check if the data has already been stored in the loop. Of course, a test cannot run indefinitely, so you need to define the path/conditions for invalid termination, known as timeouts (the time after which we consider the code to be not working). Such tests will not pass when the code does not work properly AND when the termination conditions are badly defined, i.e., the timeout is set too short.

Problem 4. Not leveraging Beam’s test API

Data processing applications often operate on data divided into portions according to an event-time (the time of loading into the pipeline or a certain timestamp) – e.g., data should be processed every 15 minutes. Apache Beam offers a set of really useful classes that allow you to simulate windows and data timestamps. If the processing window is set for 1 day, you can effortlessly generate test data for several days. Unfortunately, you can easily prevent yourself from taking advantage of these benefits. If an application consumes data from a message queue on input (e.g., a Kafka topic) which should be processed in 15-minute portions, and in a test the data input is an instance of such a queue running in a container, then we would have to generate messages in such a queue with a 15-minute delay. If the window size is bigger than a few seconds, then such tests have no validity. And even if the window time size was very small, it would be very difficult and non-deterministic to direct the data generation in the test.
In short, you should separate business logic testing from pipeline input-output testing. It is worth limiting yourself to a few integration tests that check if the whole application reads, processes and stores data correctly. Basic test data sets are sufficient for this. Additional test cases should be verified with a focus on specific elements of the application. Moreover, if the structure of the application is designed properly, it is possible to test the entire pipeline logic by applying it to PCollection prepared in a test. Also, in case of more complex solutions, it’s worth considering moving a step further and completely separating business logic and its tests from Apache Beam framework.

New call-to-action

Don’t write unit tests on huge objects

Google Cloud Dataflow is a data processing tool. It enables converting objects of one type to objects of another type, changing the values of selected attributes of objects and combining different objects. Classes of processed objects may consist of dozens of attributes, but individual transformations often operate on only a few. This could be filtering out objects based on a certain date, filling in a few fields based on objects from the old system, or aggregating all objects with the same ID. When testing these transformations, there is no need to create full objects with dozens of fields. It is a battle-tested practice to focus only on those attributes that are relevant to the transformation that’s being tested.

Problem 1. You cannot see what is being tested

The Post object from the public Stackoverflow.com database consists of 23 attributes. If every transformation in the program operated on Post objects, most tests would have to create such objects. It is easy to imagine the appearance of a test class where each test begins with 23 lines of builder, which creates the object that’s being tested. Apart from the excessive length of such tests, using very large objects in their implementation has another negative effect. The readability of such tests strongly decreases – you can’t see at first glance which fields are important for the tested logic and which have been set only to create the object. For example, when testing a transformation that filters out old posts, it will be hard to immediately see which date is taken into account because the Post class has 6 of them.

Problem 2. Lack of test isolation

Of course, in such a class most of these 23 attributes are irrelevant to the business logic being tested, so in every test, these irrelevant attributes will have the same value. No self-respecting developers will be indifferent to such code duplication and will move the creation of test objects to one place. However, there are quite a few ways to do this.

public class TestObjects {
    public static Post post1() {
        return Post.builder()
            .id(“1”)
            .tags(List.of(“Java”))
            …
            .build();
    }

    public static Comment comment1() {
        return Comment.builder()
            .postId(“1”)
            …
            .build();
    }
}

class PostSearchTest {
    @Test
    void shouldFindPostByTag() {
        Post post = TestObjects.post1();
        Comment comment = TestObjects.comment1();
        …
        PAssert.that(
            allPosts.apply(filterByTag(“Java”))
        ).containsInAnyOrder(post);
    }
}

 
One way is to prepare a set of global objects for your tests, which can speed up writing tests and serve as a collection of sample attribute values. If they are closed inside a class as a bunch of static factory methods, then we have an Object Mother pattern. However, such a set will usually lack the object needed to test the corner case we are concerned with at a particular moment. It will also be difficult to keep track of the relationship between objects – the input object and the corresponding output object. This causes a subconscious resistance to add more test objects so as not to complicate our lives even more. This, in turn, leads to a reluctance to add new test scenarios, ultimately leaving the code not fully covered.

From a testing perspective, globally shared test objects make it hard to see what’s being tested and which attributes matter. Test isolation also suffers, because when we modify a global object for a new test, we change the data on which tests previously using that object relied. This way a case-specific test that relies on a particular value of a certain field can becomes useless. The worst part is that we won’t find out about it.

The solution that’s best at simplifying data in tests without sacrificing readability and isolation seems to be a combination of Object Mother and Builder design patterns. Instead of global, predefined objects in plain Object Mother, we prepare global builders of objects with all fields pre-set to some default values. Even though the difference between both approaches seems small, it’s actually significant. Using Builder lets us set only desired fields of the test object. The key here is to set exactly all test-relevant attributes of test objects in each test.

public class TestObjects {
    public static PostBuilder post() {
        return Post.builder()
            .id(123)

            .tags(List.of(“C#”))            …;
    }

    public static CommentBuilder comment() {
        return Comment.builder()
            .postId(5)
            …;
        
    }
}

class PostSearchTest {
    @Test
    void shouldFindPostByTag() {      

        Post post = TestObjects.post().id(“1”).tag(“Java”).build();
     
        Comment comment = TestObjects.comment().postId(“1”).build();
        …
        PAssert.that(
            allPosts.apply(filterByTag(“Java”))
        ).containsInAnyOrder(post);
    }
}


When a builder that creates a test object is properly set, the test reader immediately sees the important fields (and does not see the irrelevant ones). He/she can also see the relationships between objects because identifiers are visible.

Additionally, it’s much harder to violate the isolation of tests because each test explicitly sets all the key fields. So, from the point of view of the test, the values of the other attributes, as well as any future changes to them, should not matter or affect the outcome of the test.

Summary

There are many articles on test writing best practices, so I tried to avoid repeating what can be found in many other sources. I focused on my experience with ETL pipeline testing and ETL-specific best practices developed by our team. Using the tips proposed in this article will lead to the creation of readable tests that better document the code and improve test coverage by making it easier to implement. The value of such tests is considerable and devoting an appropriate amount of attention to them is crucial, as their absence or poor quality may be a direct reason for the failure of an entire project.  If you want to dive deeper into this subject and discuss most efficient testing strategies, fill in the contact form – we’re eager to share our experience and help you solve all the challenges you’re facing.

Share

Related posts

09 Jun 2022

A stable partnership is essential to software outsourcing

02 Jun 2022

Controlling CI/CD pipelines with commit messages

24 Mar 2022

3 Data Trends Set to Shape 2022

We may already be just one Email away from working together!

contact icon
Witold Gawlowski
Chief Commercial Officer
Jakub Śmietana
Sales Manager
Przemysław Jarecki
Sales Director
Piotr Brzózka
Sales Director
Ravi Saini
Sales Manager
Jacek Szmatka
VP of Business Development
scroll down icon back to
top