creating NA?

I'm trying to calculate the mean number of unique fruits per person (my usual practice data). This works perfectly well with both these lines of code:

with(df, tapply(fruit, names, FUN = function(x) length(unique(x))))->uniques
sum(uniques)/length(unique(df$names))

aggregate(df[,"fruit"], by=list(id=names), FUN = function(x) length(unique(x)))->d1
sum(d1$x)/length(unique(df$names))

My problem is that when I use the code on my real data it doesn't work. My real data is prescribing data, where I want mean number of unique drugs per person. With the tapply code, it has appeared to create brand new patient ids that do not exist in the original df. it has also given back 1000s of NA values. There are no missing values in my id column and none in drug_code column either

with(dt3, tapply(drug_code, id, FUN = function(x) length(unique(x))))->uniques    

head(uniques)
                   uniques
Patient HAI0000001      NA
Patient HAI0000003      NA
Patient HAI0000008      NA
Patient HAI0000010      NA
Patient HAI0000014      NA
Patient HAI0000020      NA

table(dt3$id=="Patient HAI0000001")  ##checking to see if HA10000001 occurs in original df. the dim of df are 228954 rows and 5 cols

FALSE 
228954

For the aggregate code I get an error:

aggregate(dt3[,"drug_code"], by=list(id=id), FUN = function(x) length(unique(x)))->d1

Error in aggregate.data.frame(as.data.frame(x), ...) : 
  arguments must have same length

I don't understand whats happening. My real data is similar to my practice data in that it has an id col and has a drug/fruit column. there are no missing data in either df. I know lapply is better for dataframes, but I don't necessarily need a df back. And in any case the tapply code works on practice data which is a df. Does anyone have any idea of what is happening here?

Practice DF:

 names<-as.character(c("john", "john", "john", "john", "john", "mary", "mary","mary","mary","mary", "jim", "sylvia","ted","ted","mary", "sylvia", "jim", "ted", "john", "ted"))
dates<-as.Date(c("2010-07-01",  "2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01", "2010-08-12",  "2010-11-11", "2010-05-12",  "2010-12-03", "2010-07-12",  "2010-12-21", "2010-02-18",  "2010-10-29", "2010-08-13",  "2010-11-11", "2010-05-12",  "2010-04-01", "2010-05-06",  "2010-09-28", "2010-11-28" ))
fruit<-as.character(c("kiwi","apple","banana","orange","apple","orange","apple","orange", "apple", "apple", "pineapple", "peach", "nectarine", "grape", "melon", "apricot", "plum", "lychee", "watermelon", "apple" ))
df<-data.frame(names,dates,fruit) 

example of real data:

head(dt3)
        id         quantity   date_of_claim drug_code  index
1  Patient HAI0000560        1    2009-10-15 R03AC02 2010-04-06
2  Patient HAI0000560        1    2009-10-15 R03AK06 2010-04-06
3  Patient HAI0000560       30    2009-10-15 R03BB04 2010-04-06
4  Patient HAI0000560       30    2009-10-15 A02BC01 2010-04-06
5  Patient HAI0000560       50    2009-10-15 M02AA15 2010-04-06
6  Patient HAI0000560       30    2009-10-15 N02BE51 2010-04-06

In your case you are asking fir a single number: the mean of all the individual lengths of a particular vector (unique(fruits)) within patient-id. This shws you first the indivdual unique counts and then the mean function result:

> with(df,  tapply(fruit, names, function(x) length(unique(x)) ))
   jim   john   mary sylvia    ted 
     2      5      3      2      4 
> mean ( with(df,  tapply(fruit, names, function(x) length(unique(x)) )) )
[1] 3.2

I would comment that your test for containment of a particular value in your code above had a trailing space which might have caused problems. "string " will not equal "string" . I have put a copy of the use the trim function in pkg::gdata in my .Rprofile file to make it easier for me to handle this possibility.


I might be missing something, but wouldn't a simple tapply work here? The line below calculates the number of different fruits per person

x=tapply(df$fruit,df$names,function(x){length(unique(x))})

And then mean(x) would give you the average across people?

链接地址: http://www.djcxy.com/p/38290.html

上一篇: 总计超过2组

下一篇: 创造NA?