How to delete a row by reference in data.table?

2018-06-24 13:22:43

My question is related to assignment by reference versus copying in data.table . I want to know if one can delete rows by reference, similar to

DT[ , someCol := NULL]

I want to know about

DT[someRow := NULL, ]

I guess there's a good reason for why this function doesn't exist, so maybe you could just point out a good alternative to the usual copying approach, as below. In particular, going with my favourite from example(data.table),

DT = data.table(x = rep(c("a", "b", "c"), each = 3), y = c(1, 3, 6), v = 1:9)
#      x y v
# [1,] a 1 1
# [2,] a 3 2
# [3,] a 6 3
# [4,] b 1 4
# [5,] b 3 5
# [6,] b 6 6
# [7,] c 1 7
# [8,] c 3 8
# [9,] c 6 9

Say I want to delete the first row from this data.table. I know I can do this:

DT <- DT[-1, ]

but often we may want to avoid that, because we are copying the object (and that requires about 3*N memory, if N object.size(DT) , as pointed out here. Now I found set(DT, i, j, value) . I know how to set specific values (like here: set all values in rows 1 and 2 and columns 2 and 3 to zero)

set(DT, 1:2, 2:3, 0) 
DT
#      x y v
# [1,] a 0 0
# [2,] a 0 0
# [3,] a 6 3
# [4,] b 1 4
# [5,] b 3 5
# [6,] b 6 6
# [7,] c 1 7
# [8,] c 3 8
# [9,] c 6 9

But how can I erase the first two rows, say? Doing

set(DT, 1:2, 1:3, NULL)

sets the entire DT to NULL.

My SQL knowledge is very limited, so you guys tell me: given data.table uses SQL technology, is there an equivalent to the SQL command

DELETE FROM table_name
WHERE some_column=some_value

in data.table?

Good question. data.table can't delete rows by reference yet.

data.table can add and delete columns by reference since it over-allocates the vector of column pointers, as you know. The plan is to do something similar for rows and allow fast insert and delete . A row delete would use memmove in C to budge up the items (in each and every column) after the deleted rows. Deleting a row in the middle of the table would still be quite inefficient compared to a row store database such as SQL, which is more suited for fast insert and delete of rows wherever those rows are in the table. But still, it would be a lot faster than copying a new large object without the deleted rows.

On the other hand, since column vectors would be over-allocated, rows could be inserted (and deleted) at the end, instantly; eg, a growing time series.

the approach that i have taken in order to make memory use be similar to in-place deletion is to subset a column at a time and delete. not as fast as a proper C memmove solution, but memory use is all i care about here. something like this:

DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, (col) := 1:1e6] }
keep.idxs = sample(1e6, 9e5, FALSE) # keep 90% of entries
DT.subset = data.table(col1 = DT[['col1']][keep.idxs]) # this is the subsetted table
for (col in cols){
  DT.subset[, (col) := DT[[col]][keep.idxs]]
  DT[, (col) := NULL] #delete
}

Here is a working function based on @vc273's answer and @Frank's feedback.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat,del.idxs)   ## Pls note 'del.idxs' instead of 'keep.idxs'

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
>

PS. Since I am new to SO, I could not add comment to @vc273's thread :-(

链接地址: http://www.djcxy.com/p/68790.html

上一篇: 在一个函数调用中将多个列添加到R data.table？

下一篇: 如何通过data.table中的引用删除行？