Options for caching / memoization / hashing in R

I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, eg memoise and R.cache , but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash package, which doesn't seem to underpin the two memoization packages.

Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?


As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (eg BerkeleyDB, Redis, MemcacheDB and scores of others).

It looks like the options are:

Hashing

  • digest - provides hashing for arbitrary R objects.
  • Memoization

  • memoise - a very simple tool for memoization of functions.
  • R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.
  • Caching

  • hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.
  • Key/value storage

    These are basic options for external storage of R objects.

  • stashr
  • filehash
  • Checkpointing

  • cacher - this seems to be more akin to checkpointing.
  • CodeDepends - An OmegaHat project that underpins cacher and provides some useful functionality.
  • DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.
  • Other

  • Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also pairlist , but it is deprecated.)
  • The data.table package supports rapid lookups of elements in a data table.

  • Use case

    Although I'm mostly interested in knowing the options, I have two basic use cases that arise:

  • Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
  • Memoization of monstrous calculations.
  • These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.


    Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages ( cacher and R.cache ), but there is no elaboration on usage options.

    Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)

  • Dirk Eddelbuettel: digest - a lot of other packages depend on this.
  • Roger Peng: cacher , filehash , stashR - these address different problems in different ways; see Roger's site for more packages.
  • Christopher Brown: hash - Seems to be a useful package, but the links to ODG are down, unfortunately.
  • Henrik Bengtsson: R.cache & Hadley Wickham: memoise -- it's not yet clear when to prefer one package over the other.
  • Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".


    For simple counting of strings (and not using table or similar), a multiset data structure seems like a good fit. The environment object can be used to emulate this.

    # Define the insert function for a multiset
    msetInsert <- function(mset, s) {
        if (exists(s, mset, inherits=FALSE)) {
            mset[[s]] <- mset[[s]] + 1L
        } else {
            mset[[s]] <- 1L 
        }
    }
    
    # First we generate a bunch of strings
    n <- 1e5L  # Total number of strings
    nus <- 1e3L  # Number of unique strings
    ustrs <- paste("Str", seq_len(nus))
    
    set.seed(42)
    strs <- sample(ustrs, n, replace=TRUE)
    
    
    # Now we use an environment as our multiset    
    mset <- new.env(TRUE, emptyenv()) # Ensure hashing is enabled
    
    # ...and insert the strings one by one...
    for (s in strs) {
        msetInsert(mset, s)
    }
    
    # Now we should have nus unique strings in the multiset    
    identical(nus, length(mset))
    
    # And the names should be correct
    identical(sort(ustrs), sort(names(as.list(mset))))
    
    # ...And an example of getting the count for a specific string
    mset[["Str 3"]] # "Str 3" instance count (97)
    

    I did not have luck with memoise because it gave too deep recursive problem to some function of a packaged I tried with. With R.cache I had better luck. Following is more annotated code I adapted from R.cache documentation. The code shows different options to do caching.

    # Workaround to avoid question when loading R.cache library
    dir.create(path="~/.Rcache", showWarnings=F) 
    library("R.cache")
    setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
    # In case we need the cache path, but not used in this example.
    cache.root = getCacheRootPath() 
    simulate <- function(mean, sd) {
        # 1. Try to load cached data, if already generated
        key <- list(mean, sd)
        data <- loadCache(key)
        if (!is.null(data)) {
            cat("Loaded cached datan")
            return(data);
        }
        # 2. If not available, generate it.
        cat("Generating data from scratch...")
        data <- rnorm(1000, mean=mean, sd=sd)
        Sys.sleep(1) # Emulate slow algorithm
        cat("okn")
        saveCache(data, key=key, comment="simulate()")
        data;
    }
    data <- simulate(2.3, 3.0)
    data <- simulate(2.3, 3.5)
    a = 2.3
    b = 3.0
    data <- simulate(a, b) # Will load cached data, params are checked by value
    # Clean up
    file.remove(findCache(key=list(2.3,3.0)))
    file.remove(findCache(key=list(2.3,3.5)))
    
    simulate2 <- function(mean, sd) {
        data <- rnorm(1000, mean=mean, sd=sd)
        Sys.sleep(1) # Emulate slow algorithm
        cat("Done generating data from scratchn")
        data;
    }
    # Easy step to memoize a function
    # aslo possible to resassign function name.
    This would work with any functions from external packages. 
    mzs <- addMemoization(simulate2)
    
    data <- mzs(2.3, 3.0)
    data <- mzs(2.3, 3.5)
    data <- mzs(2.3, 3.0) # Will load cached data
    # aslo possible to resassign function name.
    # but different memoizations of the same 
    # function will return the same cache result
    # if input params are the same
    simulate2 <- addMemoization(simulate2)
    data <- simulate2(2.3, 3.0)
    
    # If the expression being evaluated depends on
    # "input" objects, then these must be be specified
    # explicitly as "key" objects.
    for (ii in 1:2) {
        for (kk in 1:3) {
            cat(sprintf("Iteration #%d:n", kk))
            res <- evalWithMemoization({
                cat("Evaluating expression...")
                a <- kk
                Sys.sleep(1)
                cat("donen")
                a
            }, key=list(kk=kk))
            # expressions inside 'res' are skipped on the repeated run
            print(res)
            # Sanity checks
            stopifnot(a == kk)
            # Clean up
            rm(a)
        } # for (kk ...)
    } # for (ii ...)
    

    Related to @biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:

    R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that:

    simulate <- function(mean, sd) {
    key <- list(mean, sd)
    data <- evalWithMemoization(key = key, expr = {
        cat("Generating data from scratch...")
        data <- rnorm(1000, mean=mean, sd=sd)
        Sys.sleep(1) # Emulate slow algorithm
        cat("okn")
        data})
    }
    
    链接地址: http://www.djcxy.com/p/62994.html

    上一篇: 确保单个创建缓存项目的缓存

    下一篇: R中的缓存/记忆/散列选项