fast, efficient way to loop over millions of rows and match columns

2018-07-02 02:52:13

I'm working with eye tracking data right now, so have a HUGE dataset (think millions of rows) and so would like a fast way to do this task. Here's a simplified version of it.

The data tells you where the eye is looking at each time point, and for each file we are looking at. X1,Y1 to the coordinates of the point we're looking at. There are multiple time points for each file (representing the eye looking at different location in the file through time).

Filename    Time    X1    Y1
   1         1      10    10
   1         2      12    10

I also have a file of where items are located for each filename. Each file contains (in this simplified case) two objects. X1,Y1 are the lower left coordinates and X2, Y2 are the upper right. You can imagine this as giving the bounding box where the item is located in each file. Eg

Filename    Item    X1   Y1   X2   Y2
  1          Dog    11   10   20   20

What I'd like to do is add another column to the first data frame that tells me what object the person is looking at during each time for each file. If there are not looking at any of the objects, I'd like the column to say "none". Things on the border count at as being looked at. Eg

Filename    Time    X1    Y1   LookingAt
   1         1      10    10    none
   1         2      12    11    Dog

I know how to do this the for loop way, but it takes forever (and crashed my RStudio). I'm wondering if there might be a faster, more efficient way I'm missing.

Here's the dput for the first dataframe (These contain more rows that the example I showed above):

structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 
3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L, 
2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", 
"6"), class = "factor"), X1 = structure(c(1L, 4L, 3L, 2L, 1L, 
4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"
), class = "factor"), Y1 = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 
3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename", 
"Time", "X1", "Y1"), row.names = c(NA, -9L), class = "data.frame")

And here's the dput for the second:

structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", 
"3"), class = "factor"), Item = structure(1:4, .Label = c("Cat", 
"Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L, 
4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), 
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", 
"13", "35"), class = "factor"), X2 = structure(c(1L, 3L, 
4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), 
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", 
"13", "35"), class = "factor")), .Names = c("Filename", "Item", 
"X1", "Y1", "X2", "Y2"), row.names = c(NA, -4L), class = "data.frame")

Using data.table and the sample data you provided, I would approach it as follows:

# getting the data in the right format
datcols <- c("X","Y")
lucols <- c("X1","X2","Y1","Y2")
setDT(dat)[, (datcols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = datcols
           ][, Filename := as.character(Filename)]
setDT(lu)[, (lucols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = lucols
          ][, `:=` (Filename = as.character(Filename),
                    X1 = pmin(X1,X2), X2 = pmax(X1,X2),   # make sure that 'X1' is always the lowest value
                    Y1 = pmin(Y1,Y2), Y2 = pmax(Y1,Y2))]  # make sure that 'Y1' is always the lowest value

# matching the 'Items' to the correct rows
dat[, looked_at := lu$Item[Filename==lu$Filename &
                      between(X, lu$X1, lu$X2) &
                      between(Y, lu$Y1, lu$Y2)],
    by = .(Filename,Time)]

which gives:

> dat
   Filename Time  X  Y looked_at
1:        1    1 10 10       Cat
2:        1    2 15 20        NA
3:        1    3 12 25        NA
4:        2    1 11 15        NA
5:        2    2 10 10        NA
6:        3    1 15 11        NA
7:        3    2 25 12        NA
8:        3    5 20 15     House
9:        3    6 10 10     Mouse

Used data:

dat <- structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), 
                     Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"), 
                     X = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor"), 
                     Y = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), 
                .Names = c("Filename", "Time", "X", "Y"), row.names = c(NA, -9L), class = "data.frame")
lu <- structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"), 
                     Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"), 
                     X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), 
                     X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), 
                     Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"), 
                     Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")), 
                .Names = c("Filename", "Item", "X1", "X2", "Y1", "Y2"), row.names = c(NA, -4L), class = "data.frame")

链接地址: http://www.djcxy.com/p/89540.html

上一篇: 将丰富的MarkDown转换为纯文本

下一篇: 快速，高效的方式循环数百万行并匹配列