The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

arindamb (1) [Avatar] Offline
#1
Hi All,
I am new to machine learning and data mining. This is my first post and hope you will not find this irrelevant. Let me state you my problem.

I have a file with 180 columns. The file contains around 200,000 records. So effectively I have matrix of 230,000 x 180. Now I want to find the co-occurrence of records across columns for particular record.
For example if the file contains 5 columns, then I need to find the count of (column1,coulmn2), (column1,column3), (column1, column4), (column1, column5), (column2, column3),(column2,column4),(column2,column5),(column3,column4),(column3,column5),(column4,column5). Now this need to be repeated for 200,000 records and we need to come up with the count.
I am comfortable using python and R. I have also heard of map reducer in AWS which can do similar tasks. Can you please advise and recommend the quick and the best process?

Regards,
Arindam.
peter.harrington (82) [Avatar] Offline
#2
Re: Co-occurance of words
Hi Arindam,

What you are trying to do is find frequent itemsets, such as column2 & column 5. For speed you probably want to use the FP Growth algorithm in chapter 12.

I have used a Java Version which can do 1M rows in 29 seconds on my laptop. Here it is:
https://cgi.csc.liv.ac.uk/~frans/KDD/Software/FPgrowth/fpGrowth.html

You may have to do a little bit of formatting. It sounds like your data is binary. So output the column number if the value is 1 and if if the value is 0 don't output anything.