The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

m.dr (70) [Avatar] Offline
#1
I am working on a dataset with numeric data. I want to compute mean, median, min, max for each row.

So I write something like this:
computedVec <- vector(mode='numeric', length=nrow(dataSet))
for (i in 1:nrow(dataSet)) {
computedVec[i] = max(dataSet[i,])
}
----------
Q.1:
If I do a max or min function - works fine, returns max or min for each row.
But a mean or media returns the following error for each ROW:

10: In computedVec[i] = mean(dataSet[i, ]) :
number of items to replace is not a multiple of replacement length
Don't understand why its giving an error on mean and median. I thought all I was doing is to assign to computedVec[i] the mean for the row, which would be a similar call to max or min for the row?

The 10 below is for the 10th row as I am just working with 10 ROWS for now. But it had an error message for each of th 1st 9 ROWS as well. I have some missing values and I tried na.rm as well.
----------
Q.2:
Is there a way to not do the loop and just call a stat function on the dataset to get the mean / median / max / min by ROW or by COLUMN into a vector or such. For example max(dataSet) returns the max of all the numbers - just wondering if there is a way to call those functions by ROW or by COLUMN without having to call them through a for loop. I understand I can write my own, just wondering if R already has something built in to be used?
----------
Q.3:
To build on Q.2: Again when I do the mean / median / max / min - I would like to see if there is a way to make the call and return me a 2 column vector with the ROW NUMBERS (or ids). The first column would be the ROW NUMBERS and the second would be the values of the stat function for that ROW. Again I understand I can write it - but just wondering if R already has a package or capabilities to do that automatically.
----------
Q.4:
To build on 3 again I need to get the TOP 10 ROW NUMBERS for each of mean / median / max / min. Again I know I can write it but I would think that there would be a package that can do this?
----------
I am just trying to see if there is a way to avoid the iterations and try to take advantage of built in R functions for some of this basic computations - whether BY ROW or BY COLUMN and return TOP N counts?

Again just getting started on R - from a db / java background and TOP N and such are available - so been trying to find them in R but not much luck yet.

Thanks for your help.
robert.kabacoff (170) [Avatar] Offline
#2
Re: some questions around statistical measures
Hi,

I will try to answer all your questions at once. You want to try to avoid using loops wherever possible. In this case, the apply() function should work for you. The format is

apply(x, margin, function)

where x is the data frame or matrix, margin is 1 for rows, 2 for columns, and function is any function you want (built in or written by you).

Som apply(x, 1, median) will get the row medians of x.

Here is an example. Give it a try.


options(digits=3)

# create some data
x <- matrix(rnorm(15, 10, 2), nrow=3)
x <- as.data.frame(x)
x

# write a function to create some statistics
mystats <- function(x, na.omit=TRUE){
if (na.omit)
x <- x[!is.na(x)]
m <- mean(x)
md <- median(x)
min <- min(x)
max <- max(x)
return(c(mean=m, median=md, min=min, max=max))
}

# apply the function to the rows of the data frame
results <- apply(x, 1, mystats)
results

# transpose the results
t(results)
m.dr (70) [Avatar] Offline
#3
Re: some questions around statistical measures
hi Robert -

That worked out really well. Thank you! It even returned the row numbers as column headers which exactly what I needed. The apply() family of functions is what I needed.
----------
I got a chance to look up the apply family of functions and also in your book. Which brings me to the next question - and last in this sequence.

I would like to create a user-defined function that applies a certain formula across the columns. In sec 5.5 in your book and also other places I found user-defined functions that apply in a similar way but they are not column specific.
----------
For example if I have a dataSet with columns c1, c2, c3. I would like to return sum of columns as:

ComputeStats <- function(x, na.omit=TRUE) {
voteTotal <- (sum(1 * x$c1, 2 * x$c2, 3 * x$c3))
return(voteTotal)
}
voteTotals <- apply(dataSet, 1, ComputeStats)

and expect similar to what you showed for mean, media etc. would happen.
But it complains on dataSet$c1 - so I tried dataSet[c1], dataSet[,c1] and even dataSet[,1]
----------
No luck.

Anyway - just thought check to see if that can be done without looping. Again I found examples of apply that can work simple custom functions for above such as x = x+2 which applies x+2 across all columns individually.

Just wondering if this last aspect can be managed
----------

Again thank you for the suggestions Robert and really appreciate your book in understanding R and more importantly how to stepwise analyze data.

Btw a suggestion would be to add some material with using ggplot2.

Thanks again.