safe version of subset

As subset() manual states:

Warning : This is a convenience function intended for use interactively

I learned from this great article not only the secret behind this warning, but a good understanding of substitute() , match.call() , eval() , quote() , ‍ call , promise and other related R subjects, that are a little bit complicated.

Now I understand what's the warning above for. A super-simple implementation of subset() could be as follows:

subset = function(x, condition) x[eval(substitute(condition), envir=x),]

While subset(mtcars, cyl==4) returns the table of rows in mtcars that satisfy cyl==4 , enveloping subset() in another function fails:

sub = function(x, condition) subset(x, condition)

sub(mtcars, cyl == 4)
# Error in eval(expr, envir, enclos) : object 'cyl' not found

Using the original version of subset() also produces exactly the same error condition. This is due to the limitation of substitute()-eval() pair: It works fine while condition is cyl==4 , but when the condition is passed through the enveloping function sub() , the condition argument of subset() will be no longer cyl==4 , but the nested condition in the sub() body, and the eval() fails - it's a bit complicated.

But does it exist any other implementation of subset() with exactly the same arguments that would be programming-safe - ie able to evaluate its condition while it's called by another function?


Just because it's such mind-bending fun (??), here is a slightly different solution that addresses a problem Hadley pointed to in comments to my accepted solution.

Hadley posted a gist demonstrating a situation in which my accepted function goes awry. The twist in that example (copied below) is that a symbol passed to SUBSET() is defined in the body (rather than the arguments) of one of the calling functions; it thus gets captured by substitute() instead of the intended global variable. Confusing stuff, I know.

f <- function() {
  cyl <- 4
  g()
}

g <- function() {
  SUBSET(mtcars, cyl == 4)$cyl
}
f()

Here is a better function that will only substitute the values of symbols found in calling functions' argument lists. It works in all of the situations that Hadley or I have so far proposed.

SUBSET <- function(`_dat`, expr) {
   ff <- sys.frames()
   n <- length(ff)
   ex <- substitute(expr)
   ii <- seq_len(n)
   for(i in ii) {
       ## 'which' is the frame number, and 'n' is # of frames to go back.
       margs <- as.list(match.call(definition = sys.function(n - i),
                                   call = sys.call(sys.parent(i))))[-1]
       ex <- eval(substitute(substitute(x, env = ll),
                             env = list(x = ex, ll = margs)))
   }
   `_dat`[eval(ex, envir = `_dat`),]
}

## Works in Hadley's counterexample ...
f()
# [1] 4 4 4 4 4 4 4 4 4 4 4

## ... and in my original test cases.
sub <- function(x, condition) SUBSET(x, condition)
sub2 <- function(AA, BB) sub(AA, BB)

a <- SUBSET(mtcars, cyl == 4)  ## Direct call to SUBSET()
b <- sub(mtcars, cyl == 4)     ## SUBSET() called one level down
c <- sub2(mtcars, cyl == 4)
all(identical(a, b), identical(b, c))
# [1] TRUE

IMPORTANT: Please note that this still is not (nor can it be made into) a generally useful function. There's simply no way for the function to know which symbols you want it to use in all of the substitutions it performs as it works up the call stack. There are many situations in which users would want it to use the values of symbols assigned to within function bodies, but this function will always ignore those.


The [ function is what you're looking for. ?"[". mtcars[mtcars$cyl == 4,] is equivalent to the subset command and is "programming" safe.

sub = function(x, condition) {
 x[condition,]
}

sub(mtcars, mtcars$cyl==4)

Does what you're asking without the implicit with() in the function call. The specifics are complicated, however a function like:

sub = function(x, quoted_condition) {
  x[with(x, eval(parse(text=quoted_condition))),]
}

sub(mtcars, 'cyl==4')

Sorta does what you're looking for, but there are edge cases where this will have unexpected results.


using data.table and the [ subset function you can get the implicit with(...) you're looking for.

library(data.table)
MT = data.table(mtcars)

MT[cyl==4]

there are better, faster ways to do this subsetting in data.table , but this illustrates the point well.


using data.table you can also construct expressions to be evaluated later

cond = expression(cyl==4)

MT[eval(cond)]

these two can now be passed through functions:

wrapper = function(DT, condition) {
  DT[eval(condition)]
}

Here's an alternative version of subset() which continues to work even when it's nested -- at least as long as the logical subsetting expression (eg cyl == 4 ) is supplied to the top-level function call.

It works by climbing up the call stack, substitute() ing at each step to ultimately capture the logical subsetting expression passed in by the user. In the call to sub2() below, for example, the for loop works up the call stack from expr to x to AA and finally to cyl ==4 .

SUBSET <- function(`_dat`, expr) {
    ff <- sys.frames()
    ex <- substitute(expr)
    ii <- rev(seq_along(ff))
    for(i in ii) {
        ex <- eval(substitute(substitute(x, env=sys.frames()[[n]]),
                              env = list(x = ex, n=i)))
    }
    `_dat`[eval(ex, envir = `_dat`),]
}

## Define test functions that nest SUBSET() more and more deeply
sub <- function(x, condition) SUBSET(x, condition)
sub2 <- function(AA, BB) sub(AA, BB)

## Show that it works, at least when the top-level function call
## contains the logical subsetting expression
a <- SUBSET(mtcars, cyl == 4)  ## Direct call to SUBSET()
b <- sub(mtcars, cyl == 4)     ## SUBSET() called one level down
c <- sub2(mtcars, cyl == 4)    ## SUBSET() called two levels down

identical(a,b)
# [1] TRUE
> identical(a,c)
# [1] TRUE
a[1:5,]
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
# Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
# Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
# Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
# Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2

** For some explanation of the construct inside the for loop, see Section 6.2, paragraph 6 of the R Language Definition manual.

链接地址: http://www.djcxy.com/p/70902.html

上一篇: R列表到数据帧

下一篇: 安全版本的子集