hazelcast-jet

Posted on: 2019-10-08 2019-11-19
Tags: Data, Persistence

Introduction to Hazelcast Jet

1. Introduction

In this tutorial, we’ll learn about Hazelcast Jet. It’s a distributed data processing engine provided by Hazelcast, Inc. and is built on top of Hazelcast IMDG

If you want to learn about Hazelcast IMDG, here is an article for getting started.

2. What is Hazelcast Jet?

Hazelcast Jet is a distributed data processing engine that treats data as streams. It can process data that is stored in a database or files as well as the data that is streamed by a Kafka server.

It can perform aggregate functions over infinite data streams by dividing the streams into subsets and applying aggregation over each subset. This concept is known as windowing in the Jet terminology.

We can deploy Jet in a cluster of machines and then submit our data processing jobs to it. Jet will make all the members of the cluster automatically process the data. Each member of the cluster consumes a part of the data, and that makes it easy to scale up to any level of throughput.

Here are the typical use cases for Hazelcast Jet:

Real-Time Stream Processing
Fast Batch Processing
Processing Java 8 Streams in a distributed way
Data processing in Microservices

3. Setup

To setup Hazelcast Jet in our environment, we just need to add a single Maven dependency to our pom.xml.

Here’s how we do it:

<dependency>
    <groupId>com.hazelcast.jet</groupId>
    <artifactId>hazelcast-jet</artifactId>
    <version>0.6</version>
</dependency>

Including this dependency will download a 10 Mb jar file which provides us with the all the infrastructure we need to build a distributed data processing pipeline.

The latest version for Hazelcast Jet can be found here.

4. Sample Application

In order to learn more about Hazelcast Jet, we’ll create a sample application that takes an input of sentences and a word to find in those sentences and returns the count of the specified word in those sentences.

4.1. The Pipeline

A Pipeline forms the basic construct for a Jet application. Processing within a pipeline follows these steps:

draw data from a source
transform the data
drain the data into a sink

For our application, the pipeline will draw from a distributed List, apply the transformation of grouping and aggregation and finally drain to a distributed Map.

Here’s how we write our pipeline:

private Pipeline createPipeLine() {
    Pipeline p = Pipeline.create();
    p.drawFrom(Sources.<String> list(LIST_NAME))
      .flatMap(
        word -> traverseArray(word.toLowerCase().split("\\W+")))
      .filter(word -> !word.isEmpty())
      .groupingKey(wholeItem())
      .aggregate(counting())
      .drainTo(Sinks.map(MAP_NAME));
    return p;
}

Once we’ve drawn from the source, we traverse the data and split it around the space using a regular expression. After that, we filter out the blanks.

Lastly, we group the words, aggregate them and drain the results to a Map.

4.2. The Job

Now that our pipeline is defined, we create a job for executing the pipeline.

Here’s how we write a countWord function which accepts parameters and returns the count:

public Long countWord(List<String> sentences, String word) {
    long count = 0;
    JetInstance jet = Jet.newJetInstance();
    try {
        List<String> textList = jet.getList(LIST_NAME);
        textList.addAll(sentences);
        Pipeline p = createPipeLine();
        jet.newJob(p)
          .join();
        Map<String, Long> counts = jet.getMap(MAP_NAME);
        count = counts.get(word);
        } finally {
            Jet.shutdownAll();
      }
    return count;
}

We create a Jet instance first in order to create our job and use the pipeline. Next, we copy the input List to a distributed list so that it’s available over all the instances.

We then submit a job using the pipeline that we have built above. The method newJob() returns an executable job that is started by Jet asynchronously. The join method waits for the job to complete and throws an exception if the job is completed with an error.

When the job completes the results are retrieved in a distributed Map, as we defined in our pipeline. So, we get the Map from the Jet instance and get the counts of the word against it.

Lastly, we shut down the Jet instance. It is important to shut it down after our execution has ended, as Jet instance starts its own threads. Otherwise, our Java process will still be alive even after our method has exited.

Here is a unit test that tests the code we have written for Jet:

@Test
public void whenGivenSentencesAndWord_ThenReturnCountOfWord() {
    List<String> sentences = new ArrayList<>();
    sentences.add("The first second was alright, but the second second was tough.");
    WordCounter wordCounter = new WordCounter();
    long countSecond = wordCounter.countWord(sentences, "second");
    assertTrue(countSecond == 3);
}

5. Conclusion

In this article, we’ve learned about Hazelcast Jet. To learn more about it and its features, refer to the manual.

As usual, the code for the example used in this article can be found over on Github.

getdocs

3043