快速,高效的方式循环数百万行并匹配列

我现在正在使用眼动追踪数据,所以有一个巨大的数据集(想想成百万行),所以想要一个快速的方法来完成这个任务。 这是它的简化版本。

数据告诉你眼睛在每个时间点看什么,以及我们正在看的每个文件。 X1,Y1到我们正在查看的点的坐标。 每个文件有多个时间点(代表着眼睛在不同时间看着文件中的不同位置)。

Filename    Time    X1    Y1
   1         1      10    10
   1         2      12    10

我还有一个文件,显示每个文件名的项目位置。 每个文件都包含(在这个简化的情况下)两个对象。 X1,Y1是左下角坐标,X2,Y2是右上角。 你可以想象这是给每个文件中项目所在的边界框。 例如

Filename    Item    X1   Y1   X2   Y2
  1          Dog    11   10   20   20

我想要做的是在第一个数据框中添加另一列,告诉我每个文件在每个时间内人们正在查看的对象。 如果没有查看任何对象,我希望列可以说“无”。 在边界上的事情被视为正在计算。 例如

Filename    Time    X1    Y1   LookingAt
   1         1      10    10    none
   1         2      12    11    Dog

我知道如何做到for循环的方式,但它需要永远(并使我的RStudio崩溃)。 我想知道是否可能有更快,更有效的方式我错过了。

这里是第一个数据帧的输入(这些包含更多的行,我上面展示的例子):

structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 
3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L, 
2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", 
"6"), class = "factor"), X1 = structure(c(1L, 4L, 3L, 2L, 1L, 
4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"
), class = "factor"), Y1 = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 
3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename", 
"Time", "X1", "Y1"), row.names = c(NA, -9L), class = "data.frame")

这是第二次投资:

structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", 
"3"), class = "factor"), Item = structure(1:4, .Label = c("Cat", 
"Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L, 
4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), 
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", 
"13", "35"), class = "factor"), X2 = structure(c(1L, 3L, 
4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), 
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", 
"13", "35"), class = "factor")), .Names = c("Filename", "Item", 
"X1", "Y1", "X2", "Y2"), row.names = c(NA, -4L), class = "data.frame")

使用data.table和您提供的示例数据,我会按如下方式处理它:

# getting the data in the right format
datcols <- c("X","Y")
lucols <- c("X1","X2","Y1","Y2")
setDT(dat)[, (datcols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = datcols
           ][, Filename := as.character(Filename)]
setDT(lu)[, (lucols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = lucols
          ][, `:=` (Filename = as.character(Filename),
                    X1 = pmin(X1,X2), X2 = pmax(X1,X2),   # make sure that 'X1' is always the lowest value
                    Y1 = pmin(Y1,Y2), Y2 = pmax(Y1,Y2))]  # make sure that 'Y1' is always the lowest value

# matching the 'Items' to the correct rows
dat[, looked_at := lu$Item[Filename==lu$Filename &
                      between(X, lu$X1, lu$X2) &
                      between(Y, lu$Y1, lu$Y2)],
    by = .(Filename,Time)]

这使:

> dat
   Filename Time  X  Y looked_at
1:        1    1 10 10       Cat
2:        1    2 15 20        NA
3:        1    3 12 25        NA
4:        2    1 11 15        NA
5:        2    2 10 10        NA
6:        3    1 15 11        NA
7:        3    2 25 12        NA
8:        3    5 20 15     House
9:        3    6 10 10     Mouse

使用的数据:

dat <- structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), 
                     Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"), 
                     X = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor"), 
                     Y = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), 
                .Names = c("Filename", "Time", "X", "Y"), row.names = c(NA, -9L), class = "data.frame")
lu <- structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"), 
                     Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"), 
                     X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), 
                     X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), 
                     Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"), 
                     Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")), 
                .Names = c("Filename", "Item", "X1", "X2", "Y1", "Y2"), row.names = c(NA, -4L), class = "data.frame")
链接地址: http://www.djcxy.com/p/89539.html

上一篇: fast, efficient way to loop over millions of rows and match columns

下一篇: pick" in Github App for Mac