duanjingjing (6) [Avatar] Offline
#1
Regarding "bypassing the process", why is that not an appropriate thing to do? For a given TodoList, there's always a single TodoServer that handles it (based on its name). So I think we can bypass the TodoDatabase process and just let the TodoServer handle persistence. The database will be consistent because a TodoServer is internally synchronized. You can't do two operations to the same database/file simultaneously.

On the other hand, I can see a potential problem with the proposed solution of "handling requests concurrently". Imagine client just issues two updates to the same TodoList. The TodoDatabase would spawn off two workers to handle them. Assuming those two workers are unrelated, they may be scheduled to run at arbitrary order. It's possible that the 2nd update runs first and you will end up having the 1st update override the 2nd update. Now you have a data consistency problem. This can be solved by "pooling", meaning we always send updates to a TodoList to the same worker.

Also, I can see a benefit of separating updating the TodoList and persisting it to the database. When they are separated, you can achieve higher concurrency level. This is better than doing both in the TodoServer.

Thoughts?
sjuric (86) [Avatar] Offline
#2
All excellent points. My answers are inline.

duanjingjing wrote:Regarding "bypassing the process", why is that not an appropriate thing to do? For a given TodoList, there's always a single TodoServer that handles it (based on its name). So I think we can bypass the TodoDatabase process and just let the TodoServer handle persistence. The database will be consistent because a TodoServer is internally synchronized. You can't do two operations to the same database/file simultaneously.


That is indeed true from the standpoint of this simple application. All requests indeed go through the TodoServer, so no additional synchronization is required.

However, by removing synchronization from the database, we're dropping all concurrency guarantees in that layer, and have to ensure proper synchronization in the layer above. If you have some other process which wants to persist a to-do list, or maybe you want to use the database for some other kind of data, you need to make sure that all db calls are properly synced.

This may or may not be fine, depending on the particular use case.

duanjingjing wrote:
On the other hand, I can see a potential problem with the proposed solution of "handling requests concurrently". Imagine client just issues two updates to the same TodoList. The TodoDatabase would spawn off two workers to handle them. Assuming those two workers are unrelated, they may be scheduled to run at arbitrary order. It's possible that the 2nd update runs first and you will end up having the 1st update override the 2nd update. Now you have a data consistency problem. This can be solved by "pooling", meaning we always send updates to a TodoList to the same worker.


Yes, I need to mention this issue explicitly in the next edition. This is indeed a problem with that approach, and it's one of the reasons why I propose the pooling solution with hashing based distribution: same items always go to the same db worker, different items may be handled in different processes.



duanjingjing wrote:
Also, I can see a benefit of separating updating the TodoList and persisting it to the database. When they are separated, you can achieve higher concurrency level. This is better than doing both in the TodoServer.


There's no "one size fits all" approach here. It all depends on the case you have and guarantees you want to make. Updating cache and db concurrently should improve efficiency, but you lose consistency. If the cache update succeeds, but db store op fails, then your cache is not in sync with the disk anymore. Depending on the specific scenario, this might (but need not) be a problem.

The approach I'd most likely consider in a real system is to first store to db, and only if that succeeds, to the cache. This avoids the problem of "ghosts": cache will mostly point to the new data, with a brief interval of pointing to the stale data (in the period after data is stored, but the cache is not yet updated). At the same time, this approach keeps processes "locked" for a minimal time. Cache is not busy while db is storing, and db worker is not busy while the cache is updating.

But again, the appropriate solution really depends on the particular nuances of the use case, so in some other situations another approach might work better.

Keep in mind that this whole to-do example is radically simplified. The example serves as a mock of a real app, but it's kept extremely simple, so we can focus on the Erlang/OTP specifics, rather then wrestle with nuances of the problem we're solving. The main goal is to make readers aware of how processes tick, and what are the trade-offs. Once you know that, you can choose the most appropriate solution for each specific problem. Judging by your questions, you're already there smilie

duanjingjing (6) [Avatar] Offline
#3
Thanks for your comments. Really appreciate it!

sjuric wrote:All excellent points. My answers are inline.

duanjingjing wrote:Regarding "bypassing the process", why is that not an appropriate thing to do? For a given TodoList, there's always a single TodoServer that handles it (based on its name). So I think we can bypass the TodoDatabase process and just let the TodoServer handle persistence. The database will be consistent because a TodoServer is internally synchronized. You can't do two operations to the same database/file simultaneously.


That is indeed true from the standpoint of this simple application. All requests indeed go through the TodoServer, so no additional synchronization is required.

However, by removing synchronization from the database, we're dropping all concurrency guarantees in that layer, and have to ensure proper synchronization in the layer above. If you have some other process which wants to persist a to-do list, or maybe you want to use the database for some other kind of data, you need to make sure that all db calls are properly synced.

This may or may not be fine, depending on the particular use case.
Good point. If there are additional clients, it's definitely better to have a database layer that synchronizes the persistence. I was just thinking of the example in the book so I didn't think of additional clients.

duanjingjing wrote:
On the other hand, I can see a potential problem with the proposed solution of "handling requests concurrently". Imagine client just issues two updates to the same TodoList. The TodoDatabase would spawn off two workers to handle them. Assuming those two workers are unrelated, they may be scheduled to run at arbitrary order. It's possible that the 2nd update runs first and you will end up having the 1st update override the 2nd update. Now you have a data consistency problem. This can be solved by "pooling", meaning we always send updates to a TodoList to the same worker.


Yes, I need to mention this issue explicitly in the next edition. This is indeed a problem with that approach, and it's one of the reasons why I propose the pooling solution with hashing based distribution: same items always go to the same db worker, different items may be handled in different processes.

Yes indeed.



duanjingjing wrote:
Also, I can see a benefit of separating updating the TodoList and persisting it to the database. When they are separated, you can achieve higher concurrency level. This is better than doing both in the TodoServer.


There's no "one size fits all" approach here. It all depends on the case you have and guarantees you want to make. Updating cache and db concurrently should improve efficiency, but you lose consistency. If the cache update succeeds, but db store op fails, then your cache is not in sync with the disk anymore. Depending on the specific scenario, this might (but need not) be a problem.

The approach I'd most likely consider in a real system is to first store to db, and only if that succeeds, to the cache. This avoids the problem of "ghosts": cache will mostly point to the new data, with a brief interval of pointing to the stale data (in the period after data is stored, but the cache is not yet updated). At the same time, this approach keeps processes "locked" for a minimal time. Cache is not busy while db is storing, and db worker is not busy while the cache is updating.
Nice trick. This sounds like a good idea. Thanks for sharing it!

But again, the appropriate solution really depends on the particular nuances of the use case, so in some other situations another approach might work better.

Keep in mind that this whole to-do example is radically simplified. The example serves as a mock of a real app, but it's kept extremely simple, so we can focus on the Erlang/OTP specifics, rather then wrestle with nuances of the problem we're solving. The main goal is to make readers aware of how processes tick, and what are the trade-offs. Once you know that, you can choose the most appropriate solution for each specific problem. Judging by your questions, you're already there smilie

Sounds like you are or thinking of writing a 2nd edition. What's new in it?

sjuric (86) [Avatar] Offline
#4
duanjingjing wrote:Thanks for your comments. Really appreciate it!
Sounds like you are or thinking of writing a 2nd edition. What's new in it?


To be honest, I didn't even discuss this with Manning yet, but I don't think the 2nd edition is needed at this point, given that the 1st one has been released only a few months ago.

I'd say that the major factor is the amount of changes, especially breaking ones, in subsequent releases of Elixir and Erlang. Once we have enough of those, the book should be updated. Since most of the book is focused on concurrency, fault-tolerance, and distribution via OTP, and these concepts and underlying APIs are not likely to change much, I expect that most of the updates will be required in the first part of the book, which deals with the Elixir language. Also, at that point I plan to address the errata, and do minor revisions of some parts, inspired by question such as the ones you asked in this thread.

When it comes to expanding the book, I currently don't think about treating other topics (e.g. Agents, Tasks, or upcoming GenRouter). I never envisioned EiA as a "reference bible". Instead, the aim is to treat topics which are essential, but likely unusual to most readers which presumably arrive from OO. This keeps the book less cumbersome: the book has a reasonable size while still treating the most important topics. Once you gain the confidence about the fundamental concepts, it should be fairly easy to research the remaining areas on your own.

Of course, suggestions are always welcome. Every idea, proposal, or criticism will be carefully considered and addressed smilie
duanjingjing (6) [Avatar] Offline
#5
sjuric wrote:
duanjingjing wrote:Thanks for your comments. Really appreciate it!
Sounds like you are or thinking of writing a 2nd edition. What's new in it?


To be honest, I didn't even discuss this with Manning yet, but I don't think the 2nd edition is needed at this point, given that the 1st one has been released only a few months ago.

I'd say that the major factor is the amount of changes, especially breaking ones, in subsequent releases of Elixir and Erlang. Once we have enough of those, the book should be updated. Since most of the book is focused on concurrency, fault-tolerance, and distribution via OTP, and these concepts and underlying APIs are not likely to change much, I expect that most of the updates will be required in the first part of the book, which deals with the Elixir language. Also, at that point I plan to address the errata, and do minor revisions of some parts, inspired by question such as the ones you asked in this thread.

When it comes to expanding the book, I currently don't think about treating other topics (e.g. Agents, Tasks, or upcoming GenRouter). I never envisioned EiA as a "reference bible". Instead, the aim is to treat topics which are essential, but likely unusual to most readers which presumably arrive from OO. This keeps the book less cumbersome: the book has a reasonable size while still treating the most important topics. Once you gain the confidence about the fundamental concepts, it should be fairly easy to research the remaining areas on your own.

Of course, suggestions are always welcome. Every idea, proposal, or criticism will be carefully considered and addressed smilie


Totally agree. I think the OTP stuff you covered in the book is very good and time-tested. The examples in the book are particularly useful to me. I've been trying to write them myself and compare them with your solution. That has worked very well for me. I've finished the first 7 chapters and the remaining chapters are looking even more interesting to mesmilie