Jacek Laskowski (37) [Avatar] Offline
#1
Hi,

I'm wondering what's a StateStore exactly, esp. after I read the trick to avoid writing to a topic to create a KTable in https://docs.confluent.io/current/streams/faq.html#how-can-i-convert-a-kstream-to-a-ktable-without-an-aggregation-step.

Is a StateStore a topic in the "lowest depths"? Are there any use case where they should be avoided? I understand that topics are permanent data sources and sinks while state stores are temporary data storage.

Jacek
Bill Bejeck (44) [Avatar] Offline
#2
Hi Jacek,

I'll try to answer your questions in order.

1. In my opinion, a StateStore is a mechanism for "remembering" previously seen records, but I would not call it a topic as the idea of a topic implies anyone can consume the data, while StateStores are more of a private access variable, again in my opinion.

With stream processing, you are usually processing the latest available record, but sometimes you need to apply logic from a record you've already processed, and StateStores make this possible. State is necessary for any stream processing framework; sometimes you'll need to use the information you've already read to create a complete picture. I'm sure you already know this, I'm just trying to lay out my thoughts on the issue.

2. As for avoiding the use of a StateStore, I can't think of any situation-specific situation where you should avoid a StateStore; instead, it's a question of whether you need a stateful operation at all.

Thanks for the questions and I hope this answers your questions.

Cheers,
Bill
Jacek Laskowski (37) [Avatar] Offline
#3
Hi Bill,

Thanks Bill for the answers. That makes a lot of sense. My understanding of the purpose of StateStores also advanced a bit too and the last sentence where you said "it's a question of whether you need a stateful operation at all." is exactly when a StateStore is the only option built-in / natively supported by Kafka Streams.

As to whether a StateStore is or is not a topic, I'm yet to dig deeper in the code and see myself how Kafka Streams implements the concept. I would not be surprised if it was a topic (as that would make implementation more consistent).

Jacek
Bill Bejeck (44) [Avatar] Offline
#4
Hi Jacek,

Just to be clear, StateStores are not implemented as topics. A StateStore can either be persistent or in-memory. Persistent StateStores are implemented by using RocksDB under the covers and is stored on local disk. Furthermore, each task has its StateStore, so there is no sharing of information between tasks. In the case of persistent StateStores, this means there is a RocksDB instance on local disk for each task, for example, five tasks means 5 RocksDB instances. In-memory stores just use a HashMap or an LRU Map when a max size is specified.

Topics only come into play when considering fault tolerance. By default, StateStores are backed by a changelog (which is a topic) which gives StateStores fault tolerance. When a Kafka Streams application starts up, StateStores are re-populated by replaying the data from the changelog topic. In the case of a persistent store, we use local checkpoint files for restoration, but if those files were gone for some reason, then the StateStore is restored by reading data from the changelog topic.

The changelog topic backup can be turned off if desired.

Hope this clears things up a little more.

Thanks,
Bill
Jacek Laskowski (37) [Avatar] Offline
#5
Hi Bill,

That was exactly the kind of (low-level) answer I was dreaming of. Thanks!

Jacek