Someone (4) [Avatar] Offline
#1
Hi

I read through MEAPv4. In general I liked it, could have used it in the past and know people to whom I would recommend the final version.

There is some stuff I want to point out that can be IMHO improved.

1.) Engeneering is all about trade-offs. I found this insight very clarifying and you somewhere in chapter 2 or 3 you should point that out loudly. You are not going to make everybody happy on your budget. You have to find a compromise with the users.

2.) I find it surprising that you ignore encoding in chapter 7. In particular I would have expected "Listing 7.1. musicians-metadata.json" to define the encoding of the musicians.csv file. In my experience this is one of the more common and difficult to catch errors.

3.) In chapter "8 reloading data from the source files" the trade-off subject pops up again. I'm grateful for pointing out how you can approach it but in practice, I expect it to be more difficult than you make it appear. Your original ETL may be based on a certain python version and you may have to port it to new versions. Also as you use newer versions of a library, you have to port your old code to work with the new version. Also the etl-process may be influenced by some business rules in the database that are better properly versionized. This all adds a lot of weight to your development process that has to be justified.
Recovering from source may sound good until you realize that it takes a few hours to load a day of data and recovering from source and a backup from last year actually takes a whole month of loading.
In my career, I only supported old formats when it was very clear that it was necessary. And I never regretted when I did not support old formats.

4.) Finally I'm undecided about the "not production code" python examples. On the plus side it concretely shows the process. But then you can't apply it to production. Maybe you can put candidates in the various categories in an Appendix? "monintoring:nagios; backup:...."

I'm curious about the next version.

Best,
Henryk
Tryggvi Björgvinsson (2) [Avatar] Offline
#2
Hi Henryk,

Thanks for the message. I really appreciate your feedback. It's great and very helpful. I'll try to write down my thoughts to each of your points below.

Someone wrote:Engeneering is all about trade-offs. I found this insight very clarifying and you somewhere in chapter 2 or 3 you should point that out loudly. You are not going to make everybody happy on your budget. You have to find a compromise with the users.


YES! This is a really good point. This is exactly why I go through the whole exercise of prioritizing needs. That helps you find your trade-offs. I'll definitely take this pointer to heart and will point it out as loudly as I can in chapter 3 when I discuss how to prioritize needs (as one of the reasons for doing it).

Someone wrote:I find it surprising that you ignore encoding in chapter 7. In particular I would have expected "Listing 7.1. musicians-metadata.json" to define the encoding of the musicians.csv file. In my experience this is one of the more common and difficult to catch errors.


Oh yeah. I hate encodings. I come from a country where you need to deal with encodings all the time. CSV on the Web that I use in the chapter includes the possibility of defining encoding and defaults to utf-8 which is also the default in Python3. I don't know if it's helpful to mix that into the example or if it will just blur the learning experience (I'm having a really difficult time trying to keep the book short and avoid too much detail).

Someone wrote:In chapter "8 reloading data from the source files" the trade-off subject pops up again. I'm grateful for pointing out how you can approach it but in practice, I expect it to be more difficult than you make it appear. Your original ETL may be based on a certain python version and you may have to port it to new versions. Also as you use newer versions of a library, you have to port your old code to work with the new version. Also the etl-process may be influenced by some business rules in the database that are better properly versionized. This all adds a lot of weight to your development process that has to be justified.
Recovering from source may sound good until you realize that it takes a few hours to load a day of data and recovering from source and a backup from last year actually takes a whole month of loading.
In my career, I only supported old formats when it was very clear that it was necessary. And I never regretted when I did not support old formats.


I agree with you. There are of course a lot of different aspects we must consider all the time. Supporting old formats may in the end be more hassle and definitely not as easy as I make it out to be. It's all about circumstances. It may even be more hassle supporting older formats than just converting them to newer formats (at the expense of not keeping the source intact).

I'm trying to keep the examples short and teach the mindset rather than the exact solution. The way I implement the examples are probably not applicable to most situations but thinking about how you'll read old data and if you're going to support it (or if you're going to convert) + measuring if you're doing it properly is a mindset that's needed no matter what solution you use.

The metrics are also important to define based on your situation. Time to recover may be the metric you're interested in (MTTR - Mean Time To Recovery), not whether something is recoverable or not. Reducing the MTTR is a perfectly valid quality cycle iteration (and metric).

Someone wrote:Finally I'm undecided about the "not production code" python examples. On the plus side it concretely shows the process. But then you can't apply it to production. Maybe you can put candidates in the various categories in an Appendix? "monintoring:nagios; backup:...."


You've touched on the biggest problems I've had with the book. It's very easy to make the book excruciatingly dry, theoretical and boring. Just talk about all the different topics from a solution agnostic, generic overview. I didn't want that kind of of a book.

I want the book to be more action-oriented. Have people do some coding to get into this mindset of working on improvements in iterations and automatically measuring the quality levels. Now I could have picked different solutions and written the code around them but as you said, there are trade-offs.

The data environment of all readers is not homogeneous. We all have a different environment, different tools and in my experience that just throws confusion into the mix. It's easier for people to disregard the message by just saying: "I don't use postgres like he does" or "my backup solution already includes integrity matching so I don't have to worry about recoverability".

This heterogeneous environment and me afraid of not being able to teach the theoretical mindset through action if I use specific solutions is why I chose to use easier to understand and non-production code. Hoping people will know what to think about in their own environment.

In addition to that I'd find myself obliged to help the reader set up the solution if I introduce a specific one (and set it up on all kinds of operating systems). I've given the specific tools and production code some thought. I go back and forth about whether this is a good idea or not. I've been toying with the idea of using containers, like Docker, to make it easier to set up the environment and some concrete tools. I just don't want this to turn into a book on how to use Docker (I've actually gotten some reviews that say the book successfully teaches people how to use click like that was something I wanted to teach).

I'll give this some more thought and maybe stumble upon a good way to teach with production code without affecting the real lesson, allowing people to disregard the lesson because they're environment is different, and keeping it interesting and understandable even if it becomes more technical because of solution-specific implementations.

As you can see, I'm on the fence but you've actually pin-pointed the biggest trouble I've had with writing the book. Getting the lesson across in an understandable and useful way.

Thanks again for your feedback. I really appreciate it and it helps me improve my book (and I'm all for improvements).

/Tryggvi