560191 (2) [Avatar] Offline
Hi smilie

We are doing our master's thesis at the IT University of Copenhagen, and we have a series of questions, that we hope there exists some useful answers for smilie

We are working with a setup very similar to the spam-filter application case from chapter 3 in the "Practical Probabilistic Programming" book, and our questions regard the efficiency of learning for such a model. In essence, we have several model instances, which all reference some "shared" Beta-elements for learning, which in effect results in quite a large net of connected elements. We are looking to be able to perform the learning of our Beta-elements, but without having to evaluate the entire network of connected elements at once, but instead train and learn from each individual model instance one at a time instead.

Here are some more specific questions:

- Why does EMWithVE use a completely different setup (ExpectationMaximizationWithFactors) compared to the other inference algorithms when used with EM? What are the optimizations / differences that apply here - and is there some litterature that you could point us to that would help us understand some of the differences?

- If we attempt use GeneralizedEM with VE, it seems that that all active elements in the universe (thereby all our connected model instances) are passed as inputs to the inference algorithm. As the amount of model instances increases, this quickly becomes infeasible for an algorithm such as VE.
If we consider the spam filter case from Chapter 3, would it not be possible to use the inference algorithm on each sample separately and then combine their results during the expectation step, rather than attempting to calculate the sufficient statistics for all model instances' elements all at once?
We figured that this splitting-approach might be feasible with VE (if each individual model instance is not very complex), and also have the added benefit of being parallelizable (since each sample can be reasoned about separately) if we can use StructuredVE for the task.
Is there a reason why this approach is not used? Is it not feasible? If it is possible, could you provide some pointers for how we can achieve this goal?

To bring about our perspective, we are trying to optimize our training-setup for our thesis work, such that an alteration to the probabilistic model will take a little time as possible to see the effect of - both in regards to training and of course evaluation. The setup with our model instances getting tangled into each other due to the shared Beta-elements seems to meddle with the efficiency of most inference algorithms in Figaro that are usable with EM. Is there some other approach that we could go with as an alternate setup?

As another note, we believe that we are able to build our model in such a fashion that we should have little to no hidden variables (namely 0 in the learning-case, and only a single one in the evaluation-phase), which should help the efficiency of whatever inference algorithm we end up with.
Also, according to litterature (https://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf), if one has no hidden variables, then you are in fact in the "complete data case", meaning that Maximum Likelihood Estimation should be feasible for the problem, namely the simple learning of the frequencies of one's dataset, rather than requiring the use of EM. Is there some way to access the MLE logic that is used as part of the EM-algorithm from somewhere in the source code?

Thanks a lot,
Hoping the best,
Best regards,
Christian and Kenneth,
Students at the IT University of Copenhagen, Denmark
560191 (2) [Avatar] Offline
We received a response on our GitHub issue:

This thread may be closed/deleted (Couldn't find any option to do so)