svencowart (3) [Avatar] Offline
#1
I am uncertain about how to separate the concept of a StreamTask from a Partition? It's mentioned, “State Stores are Assigned Per Task -
The statement above could be interpreted to mean that each partition has its own state store, but that is not the case. Partitions are assigned to a StreamTask and each StreamTask has it’s own state store.”

What defines a StreamTask? At what point are there multiple copies of a stream topology running via multiple copies of the same StreamTask? In my mind, if I run a StreamTask then there is one instance of the StreamTask so why is there a need to repartition the data? The only way I could map the logic in my mind from the excerpt above, is if each broker has it's own copy of a StreamTask and then I can see the necessity to repartition the data.

I apologize if my question seems silly as I am still fairly new to Kafka and Kafka Streams. I just find the excerpt about a StateStore important and lacking proper explanation.
Bill Bejeck (45) [Avatar] Offline
#2
svencowart,

Thanks for asking the question.

The first point to keep in mind is that you only need to repartition the data when you change keys and your topic has more than one partition, and partitions are used for parallel processing of data.

The second point is that you don't directly run or control a StreamTask. A StreamTask is created by Kafka Streams and is assigned to a StreamThread for processing. The reason a state store

Depending on the number of topics in your topology and the number partitions you may have multiple StreamTask objects, but it's important to keep in mind each one is distinct there are never "copies" of StreamTask objects, each one is distinct.

Does this clear things up?

HTH,
Bill