Errors related to data frame columns during merging

2018-06-08 04:06:35

The following piece of code is supposed to load and prepare datasets from a specified directory for further data analysis. The problem is that the code generates the following errors during attempts to merge data (one for each merging option). I'm confused about what is going on here. However, my gut feeling tells me that the errors might be due to the absence of column names in some data frames. I would appreciate a clarification. Also, please advise on the preferred merging option (between #1 and #2). Thank you!

UPDATE 2 (Reworked with minimal reproducible example, previous versions removed):

Current error (Merging Option 1 enabled):

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

Current error (Merging Option 2 enabled):

Error in `[.data.frame`(x, rep.int(NA_integer_, nyy), nm.x, drop = FALSE) : undefined columns selected

Required Packages: none, other than standard ones.

Source Code (includes reproducible data):

# load the datasets of transformed data

# real data
#dataSets <- loadDataSets(SRDA_DIR)

# reproducible example data
# (generated via `dput(lapply(dataSets, head))`, thanks to @MrFlick)

dataSets <- list(structure(list(`NA` = c("284", "284", "284", "284", "284", 
"284"), `NA` = c("490", "490", "490", "490", "490", "490")), .Names = c(NA_character_, 
NA_character_), SQL = structure("ClNFTEVDVCBnLmdyb3VwX2lkLCB1LnVzZXJfaWQKRlJPTSBzZjA1MTQuZ3JvdXBzIGcsIHNmMDUxNC51c2VycyB1LCBzZjA1MTQucHJvamVjdF9oaXN0b3J5IHBoLCBzZjA1MTQucHJvamVjdF90YXNrIHB0LCBzZjA1MTQucHJvamVjdF9ncm91cF9saXN0IHBnbApXSEVSRSBwaC5wcm9qZWN0X3Rhc2tfaWQgPSBwdC5wcm9qZWN0X3Rhc2tfaWQKQU5EIHB0Lmdyb3VwX3Byb2plY3RfaWQgPSBwZ2wuZ3JvdXBfcHJvamVjdF9pZApBTkQgZy5ncm91cF9pZCA9IHBnbC5ncm91cF9pZApBTkQgcGgubW9kX2J5ID0gdS51c2VyX2lk", class = "base64"), indicatorName = structure("Y29udHJpYlBlb3BsZQ==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("85684", 
"172552", "228484", "173865", "94140", "179097"), Enabled = c("1", 
"1", "1", "1", "1", "1"), `Repo URL` = c("http://svn.tr51.org/svn/variomat/trunk/", 
"http://svn.hyperic.org/?root=Hyperic+SIGAR", "http://code.google.com/p/ufolder/source/browse", 
"https://svn.canoo.com/trunk/webtestclipse/", "http://www.rasilon.net/svn/sptools/trunk/sptools", 
"http://trac.pocoo.org/repos/pygments"), `Repo Instructions` = c("Login is currently disabled.", 
"For anonymous access, simply issue the command 'svn co http://svn.hyperic.org/projects/sigar'  For developer access, send email to sourceforge user &quot;hyperic&quot;.", 
"Anonymous browsing", "https://svn.canoo.com/trunk/webtestclipse/", 
"SVN stuff to go here.  If you just want a copy of the source, run svn co http://www.rasilon.net/svn/sptools/trunk/sptools", 
"The Subversion repository is at http://trac.pocoo.org/repos/pygments."
)), .Names = c("Project ID", "Enabled", "Repo URL", "Repo Instructions"
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgZW5hYmxlZCwgdXJsX3ByaW1hcnksIGluc3RydWN0aW9uc19wdWJsaWMKRlJPTSBzZjA1MTQuZXh0ZXJuYWxfdG9vbF9saW5rcw==", class = "base64"), indicatorName = structure("ZGV2TGlua3M=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgRW5hYmxlZCwgUmVwbyBVUkwsIFJlcG8gSW5zdHJ1Y3Rpb25z", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`NA` = c("1343228", 
"230959", "1938195", "1883362", "404683", "650286"), `NA` = c("6", 
"6", "6", "6", "6", "6"), `NA` = c("http://sourceforge.net/p/aprpg/discussion", 
"http://sourceforge.net/project/memberlist.php?group_id=230959", 
"http://www.polishavenue.com", "http://sourceforge.net/p/wakemypc/tickets", 
"http://sourceforge.net/apps/trac/graphz/", "http://testando1"
)), .Names = c(NA_character_, NA_character_, NA_character_), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgcHJlZmVycmVkX3N1cHBvcnRfdHlwZSwgcHJlZmVycmVkX3N1cHBvcnRfcmVzb3VyY2UKRlJPTSBzZjA1MTQuZ3JvdXBzCldIRVJFIHByZWZlcnJlZF9zdXBwb3J0X3R5cGUgPSA2", class = "base64"), indicatorName = structure("ZGV2U3VwcG9ydA==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("1692507", 
"1095949", "685064", "900864", "976917", "1949934"), `Development Team Size` = c(1, 
1, 1, 1, 1, 1)), .Names = c("Project ID", "Development Team Size"
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgQ09VTlQodXNlcl9pZCkKRlJPTSBzZjA1MTQudXNlcl9ncm91cApXSEVSRSBncmFudGN2cyA9IDEKR1JPVVAgQlkgZ3JvdXBfaWQ=", class = "base64"), indicatorName = structure("ZGV2VGVhbVNpemU=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgRGV2ZWxvcG1lbnQgVGVhbSBTaXpl", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`NA` = c("1844416", 
"1849571", "1850512", "1850521", "1854556", "1855148"), `NA` = c("0", 
"0", "0", "0", "0", "0"), `NA` = c("1", "1", "1", "1", "1", "1"
)), .Names = c(NA_character_, NA_character_, NA_character_), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgdXNlX3dpa2ksIHVzZV9mb3J1bQpGUk9NIHNmMDUxNC5ncm91cHM=", class = "base64"), indicatorName = structure("ZG1Qcm9jZXNz", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("2107960", 
"2068039", "2156229", "2068032", "2068046", "2081469"), `Project Age` = c(5, 
6.5, 4, 6.5, 6.5, 6)), .Names = c("Project ID", "Project Age"
), row.names = c(NA, 6L), class = "data.frame"), structure(list(
    `Project ID` = c("708994", "1586967", "581072", "738614", 
    "758081", "782990"), `Project License` = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "afl", "apache", "apache2", 
    "artistic", "boostlicense", "bsd", "cddl", "eclipselicense", 
    "educom", "fair", "gpl", "ibm", "ibmcpl", "iosl", "jabber", 
    "lgpl", "mit", "mpl", "mpl11", "ms-rl", "nasalicense", "ncsa", 
    "nethack", "none", "nposl3", "osl", "other", "php", "php-license", 
    "psfl", "public", "publicdomain", "python", "qpl", "sissl", 
    "sunpublic", "website", "wxwindows", "zlib", "zope"), class = "factor"), 
    `License Restrictiveness` = structure(c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c("Highly Restrictive", 
    "Permissive", "Restrictive", "Unknown"), class = "factor")), .Names = c("Project ID", 
"Project License", "License Restrictiveness"), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDUxNC5ncm91cHM=", class = "base64"), indicatorName = structure("cHJqTGljZW5zZQ==", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("2", 
"3", "7", "11", "12", "14"), `Latest Release` = c("Snapshots", 
"7.5", "gedit 0.9.5", "r2-00", "0.9.7", "dhiggen_merge-5.0"), 
    `Project Maturity` = structure(c(NA, 3L, 1L, 3L, 1L, 3L), .Label = c("Alpha/Beta", 
    "Stable", "Mature"), class = "factor")), .Names = c("Project ID", 
"Latest Release", "Project Maturity"), SQL = structure("ClNFTEVDVCBmcC5ncm91cF9pZCwgTUFYKGZyLm5hbWUpCkZST00gc2YwNTE0LmZyc19wYWNrYWdlIGZwLCBzZjA1MTQuZnJzX3JlbGVhc2UgZnIsIHNmMDUxNC5mcnNfc3RhdHVzIGZzCldIRVJFIGZwLnBhY2thZ2VfaWQgPSBmci5wYWNrYWdlX2lkCkdST1VQIEJZIGZwLmdyb3VwX2lk", class = "base64"), indicatorName = structure("cHJqTWF0dXJpdHk=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgTGF0ZXN0IFJlbGVhc2U=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`NA` = c("1660372", 
"1590394", "1590772", "85777", "1591062", "1591181"), `NA` = c("0", 
"0", "0", "0", "0", "0")), .Names = c(NA_character_, NA_character_
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgdXNlX3dpa2kKRlJPTSBzZjA1MTQuZ3JvdXBz", class = "base64"), indicatorName = structure("cHViUm9hZG1hcA==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(ID = c("141", "66", 
"55", "45", "75", "80"), `Software Type` = c("Clustering", "Database", 
"Desktop", "Development", "Financial", "Games")), .Names = c("ID", 
"Software Type"), SQL = structure("ClNFTEVDVCB0cm92ZV9jYXRfaWQsIGRlc2NyaXB0aW9uCkZST00gc2YwNTE0LnRyb3ZlX2Zyb250cGFnZQ==", class = "base64"), indicatorName = structure("c29mdHdhcmVUeXBl", class = "base64"), resultNames = structure("SUQsIFNvZnR3YXJlIFR5cGU=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("142", 
"129", "120", "119", "107", "106"), `User Community Size` = c("153237", 
"3299", "135710", "16249", "6042", "2508")), .Names = c("Project ID", 
"User Community Size"), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgZG93bmxvYWRzCkZST00gc2YwNTE0LnN0YXRzX3Byb2plY3RfYWxs", class = "base64"), indicatorName = structure("dXNlckNvbW11bml0eVNpemU=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgVXNlciBDb21tdW5pdHkgU2l6ZQ==", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"))

# Merging Option 1

flossData <- data.frame(dataSets[[1]][1])

# merge all loaded datasets by common column ("Project ID")
silent <- lapply(seq(2, length(dataSets) - 1),
                 function(i) {merge(flossData, dataSets[[1]][i],
                                    by = "Project ID",
                                    all.y = TRUE)})

# Merging Option 2

#flossData <- Reduce(function(...) 
#  merge(..., by.x = "row.names", by.y = "Project ID", all = TRUE),
#  dataSets)

# Additional Transformations

# convert presence of Repo URL to integer
flossData[["Repo URL"]] <- as.integer(flossData[["Repo URL"]] != "")

# convert License Restrictiveness' factor levels to integers
#flossData[["License Restrictiveness"]] <- 
#  as.integer(flossData[["License Restrictiveness"]])

# convert User Community Size from character to integer
flossData[["User Community Size"]] <- 
  as.integer(flossData[["User Community Size"]])

# remove NAs
#flossData <- flossData[complete.cases(flossData[,3]),]
rowsNA <- apply(flossData, 1, function(x) {any(is.na(x))})
flossData <- flossData[!rowsNA,]

Environment:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plspm_0.4.1   turner_0.1.7  tester_0.1.7  diagram_1.6.2 shape_1.4.1   amap_0.8-12  

loaded via a namespace (and not attached):
[1] tools_3.1.1

UPDATE 3:

An attempt to merge data frames, using reshape package ( reshape::merge_all(dataSets) ), resulted in the following error message: Error: cannot allocate vector of size 332.8 Gb . This is quite strange, considering that the total size of R objects stored in that directory and being merged is only 4.3 MB.

An attempt to merge data fames, using plyr package ( plyr::join_all(dataSets) ), resulted in the following error message: Error in ``[.data.frame``(x, by) : undefined columns selected . This seems to match the error message in the Merging Option 2.

lapply(dataSets, names)
pids = which(sapply(dataSets, FUN=function(x) { 'Project ID' %in% names(x) }))

acc = dataSets[[pids[1]]]

for (id in pids[2:length(pids)]) {
  acc = merge(acc, dataSets[[id]], by='Project ID', all=T)
}

Assumptions I had to make:

not all data sets from dataSet have Project ID column, so assumed you need to join only those which have it. So first I find them and put their indexes to pids

since there are data sets that don't have any project ids in common, I assumed you want to perform outer join - all=T flag in merge

Regarding your question about what way of joining to use, I'd say it's not a good idea to do this in R. I'd use an RDBMS whenever possible to do joins and other things before loading data to R

链接地址: http://www.djcxy.com/p/24798.html

上一篇: 如何加入（合并）数据框架（内部，外部，左侧，右侧）？

下一篇: 合并期间与数据帧列相关的错误