The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

476469 (1) [Avatar] Offline
#1
First of all, love the book so far. I have been reading everything I can get my hands on with regards to Machine Learning, and this by far is one of the better books on the application of ML algorithms and how they operate in the real world.

Within Chapter 5, there are no code examples either in the book, or on GitHub, so I decided to give it a try. I pre-processed the data from Kaggle, converted the categorical columns to numerical columns, imputed the missing values, trained the model using the random forest algorithm, and evaluated the model using 10-fold cross validation, and it looks like the best score that I can get is a 0.579, not the 0.81 that the book has. I looked at the data again, and was thinking that maybe it was a part of the pre-processing, but that looks fine. Below is my eval code, which is based upon examples from the book. If this looks ok, then my problem would be with the data itself.

Sorry for the code review question here, but the examples were pretty straight forward, so replicating the results that you have felt like a no-brainer.

Any help would be greatly appreciated.

# Evaluate the model using cross-validation with k-fold cross validation
def evaluate_model(K, features, target):
from sklearn.metrics import roc_auc_score

N = features.shape[0]

AUC_all = []

preds_kfold = np.empty(N)
folds = np.random.randint(0, K, size=N)

for idx in np.arange(K):
features_train = features[folds != idx, :]
target_train = target[folds != idx]
features_test = features[folds == idx, :]

# Build and predict for CV fold
model = train(features_train, target_train)
preds_kfold[folds == idx] = predict(model, features_test)

AUC_all.append(roc_auc_score(target, preds_kfold))

maxAccuracy = np.argmax(AUC_all)

print("Maximum = %.3f" % (np.max(AUC_all)))

return(maxAccuracy)