限制分层数据的大小以重现示例

2018-05-30 10:43:53

我想为这个问题提出一个可重现的例子（RE）：合并期间与数据帧列相关的错误。要被认定为拥有RE，这个问题只缺少可重现的数据。但是，当我试图使用非常标准的dput(head(myDataObj)) ，生成的输出是14MB大小的文件。问题是我的数据对象是数据框的列表，所以head()限制似乎不能递归地工作。

我还没有找到dput()和head()函数的任何选项，这些选项将允许我递归地控制复杂对象的数据大小。除非我在上面说错了，否则还有什么其他方法可以创建最小的RE数据集？在这种情况下，您会推荐我吗？

沿@使用MrFlick的评论线条lapply ，你可以使用任何的apply系列函数来执行head或sample ，以便根据您的需要的功能，以减少大小两种资源和用于测试目的（我发现使用大型数据集的子集或子样本对于调试甚至制图来说更为合适）。

应该注意的是， head和tail提供了结构的第一位或最后一位，但有时候它们在RE方面没有足够的方差，并且当然不是随机的，这就是sample可能变得更有用的地方。

假设我们有一个分层树结构（列表列表......），并且我们希望对每个“叶”进行子集化，同时保留树中的结构和标签。

x <- list( 
    a=1:10, 
    b=list( ba=1:10, bb=1:10 ), 
    c=list( ca=list( caa=1:10, cab=letters[1:10], cac="hello" ), cb=toupper( letters[1:10] ) ) )

注意：在下面，我实际上不能区分使用how="replace"和how="list"之间的区别。

还要注意：这对data.frame叶节点来说不会很好。

# Set seed so the example is reproducible with randomized methods:
set.seed(1)

您可以通过这种方式在递归应用中使用默认head ：

rapply( x, head, how="replace" )

或者传递一个修改行为的匿名函数：

# Complete anonymous function
rapply( x, function(y){ head(y,2) }, how="replace" )
# Same behavior, but using the rapply "..." argument to pass the n=2 to head.
rapply( x, head, how="replace", n=2 )

以下是每个叶子的随机sample排序：

# This works because we use minimum in case leaves are shorter
# than the requested maximum length.
rapply( x, function(y){ sample(y, size=min(length(y),2) ) }, how="replace" )

# Less efficient, but maybe easier to read:
rapply( x, function(y){ head(sample(y)) }, how="replace" )

# XXX: Does NOT work The following does **not** work 
# because `sample` with a `size` greater than the 
# item being sampled does not work (when 
# sampling without replacement)
rapply( x, function(y){ sample(y, size=2) }, how="replace" )

链接地址: http://www.djcxy.com/p/4223.html

上一篇: Limiting size of hierarchical data for reproducible example

下一篇: How to create example data set from private data (replacing variable names and levels with uninformative place holders)?