Cachenow captures the first argument passed to it without evaluating it, so
Cache(rnorm(1))now works as expected.
Cachenow works with base pipe |> (with R >= 4.1).
Cache, there should be no problems with previous cache databases being successfully recovered.
Cacheinternals so that digesting is more accurate, as the correct methods for functions are more accurately found, objects within functions are more precisely evaluated.
()in DESCRIPTION for functions;
.Rdfiles for exported methods (structure, the class, the output meaning);
postProcessnow also checks resolution when assessing whether to project
prepInputshas an internal
Cachecall for loading the object into memory; this was incorrectly evaluating all files if there were more than one file downloaded and extracted. This resulted in cases, e.g. shapefiles, being considered identical if they had the identical geometries, even if their data were different. This is fixed now as it uses the digest of all files extracted.
terra, because previous versions were causing collisions.
sfare used throughout
prepInputscan now take
funas a quoted expression on
x, the object loaded by
dlFuncan now be a quoted expression
objSize; now is primarily a wrapper around
lobstr::obj_size, but has an option to get more detail for lists and environments.
.robustDigestnow deals explicitly with numerics, which digest differently on different OSs. Namely, they get rounded prior to digesting. Through trial and error, it was found that setting
options("reproducible.digestDigits" = 7)was sufficient for all known cases. Rounding to deeper than 7 decimal places was insufficient. There are also new methods for
data.frame(which does each column one at a time to address the numeric issue)
postProcessTerra. This will eventually replace
postProcessas it is much faster in all cases and simpler code base thanks to the fantastic work of Robert Hijmans (
terra) and all the upstream work that
Cachecall. If there is a delay after this message, then it is the code following the
Cachecall that is (silently) slow.
retrycan now return a named list for the
exprBetween, which allows for more than one object to be modified between retries.
.robustDigestwas removing Cache attributes from objects under many conditions, when it should have left them there. It is unclear what the issues were, as this would likely not have impacted
Cache. Now these attributes are left on.
data.tableobjects appear to not be recovered correctly from disk (e.g., from Cache repository. We have added
data.table::copywhen recovering from Cache repository
ccdid not correctly remove file-backed raster files (when not clearing whole CacheRepo); this may have resulted in a proliferation of files, each a filename with an underscore and a new higher number. This fix should eliminate this problem.
maskInputs()when not passing
terraclass objects can now be correctly saved and recovered by
fixErrorscan now distinguish
testValidity = NAmeaning don’t fix errors and
testValidity = FALSErun buffering which fixes many errors, but don’t test whether there are any invalid polygons first (maybe slow), or
testValidity = TRUEmeaning test for validity, then if some are invalid, then run buffer.
reproducible.useNewDigestAlgorithm = 2which will have user visible changes. To keep old behaviour, set
options(reproducible.useNewDigestAlgorithm = 1)
options(reproducible.showSimilar)is set. It is now more compact e.g., 3 lines instead of 5.
reproducible will be slowly changing the defaults for vector GIS datasets from the
sp package to the
sf package. There is a large user-visible change that will come (in the next release), which will cause
prepInputs to read
.shp files with
sf::st_read instead of
raster::shapefile, as it is much faster. To change now, set
options("reproducible.shapefileRead" = "sf::st_read")
prepInputsfor shapefiles (
.shp) is now
sf::st_readif the system has
sfinstalled. This can be overridden with
options("reproducible.shapefileRead" = "raster::shapefile"), and this is indicated with a message at the moment this is occurring, as it will cause different behaviour.
Cachecan now be a character vector, allowing individual character arguments to be digested as character vectors and others to be digested as files located at the specified path as represented by the character vector.
objSizepreviously included objects in
emptyenv, so it was generally too large. Now uses the same criteria as
unzipmissing (thanks to @CeresBarros #202)
7z.exeon Windows if the object is larger than 2GB, if can’t find
prepInputsand family can now be a quoted expression.
prepInputscan now be
NAwhich means to treat the file downloaded not as an archive, even if it has a
postProcessespecially for very large objects (>5GB tested). Previously, it was running many
fixErrorscalls; now only calls
fixErrorson fail of the proximate call (e.g., st_crop or whatever)
retrynow has a new argument
exprBetweento allow for doing something after the fail (for example, if an operation fails, e.g.,
st_crop, then run
fixErrors, then return back to
st_cropfor the retry)
Cachenow has MUCH better nested levels detection, with messaging… and control of how deep the Caching goes seems good, via useCache = 2 will only Cache 2 levels in…
prepInputsfamily can now be NA … meaning do not try to unzip even if it is a
.zipfile or other standard archive extension
gdb.zipfiles (e.g., a file with a .zip extension, but that should not be opened with an unzip-type program) can now be opened with
prepInputs(url = "whateverUrl", archive = NA, fun = "sf::st_read")
prepInputscan now be a quoted function call.
preProcessnow does a better job with large archives that can’t be correctly handled with the default
unzipwith R, by trying
system2calls to possible
7z.exeor other options on Linux-alikes.
Copygeneric no longer has
fileBackedDirargument. It is now passed through with the
.... This was creating a bug with some cases where
fileBackedDirwas not being correctly executed.
fixErrors()now better handles
sfpolygons with mixed geometries that include points.
writeOutputs.Rasterattempted to change
Rasterclass objects using the setReplacement
dataType<-, without subsequently writing to disk via
writeRaster. This created bad values in the
Raster*object. This now performs a
writeRasterif there is a
updateSlotFilenamehas many more tests.
prepInputs(..., fun = NA)now is the correct specification for “do not load object into R”. This essentially replicates
preProcesswith same arguments.
Copydid not correctly copy
RasterStacks when some of the
RasterLayerobjects were in memory, some on disk;
FALSEin those cases, so
Copydidn’t occur on the file-backed layer files. Using
Filenamesinstead to determine if there are any files that need copying.
options("reproducible.useNewDigestAlgorithm" = 2)
options("reproducible.polygonShortcut" = FALSE)as there were still too many edge cases that were not covered.
RasterStackobjects with a single file (thus acting like a
RasterBrick) are now handled correctly by
prepInputsfamilies, especially with new
options("reproducible.useNewDigestAlgorithm" = 2), though in tests, it worked with default also
RSQLitenow uses a RNG during
dbAppend; this affected 2 tests (#185).
paddedFloatToCharto reproducible from SpaDES.core.
magrittrto allow the cached alternative,
%C%. With new
magrittrpipe now in compiled source code, more of the legacy code is required here.
reproducible.messageColourQuestionfor questions that require user input. Defaults are
greenrespectively. These are user-visible colour changes.
Cachecases where a
file.linkis used instead of saving.
options(reproducible.verbose = 0)will turn off almost all messaging.
postProcessand family now have
filename2 = NULLas the default, so not saved to disk. This is a change.
verboseis now an argument throughout, whose default is
getOption(reproducible.verbose), which is set by default to
1. Thus, individual function calls can be more or less verbose, or the whole session via option.
postProcessnow uses a simpler single call to
gdalwarp, if available, for
RasterLayerclass to accomplish
writeOutputsall at once. This should be faster, simpler and, perhaps, more stable. It will only be invoked if the
RasterLayeris too large to fit into RAM. To force it to be used the user must set
useGDAL = "force"in
postProcessor globally with
options("reproducible.useGDAL" = "force")
postProcesswhen using the new
gdalwarp, has better persistence of colour table, and NA values as these are kept with better reliability
Cachenow works as expected (e.g., with parallel processing, it will avoid collisions) with SQLite thanks to suggestion here: https://stackoverflow.com/a/44445010
Rasterclass objects to account for more of the metadata (including the colortable). This will change the digest value of all
Rasterlayers, causing re-run of
checkPaththat were moved to
Requirepackage. For backwards compatibility, these are imported and reexported
file.moveused to rename/copy files across disks (a situation where
DBItype functions now have default
Cache(prepInputs, ...on a file-backed
Raster*class object now gives the non-Cache repository folder as the
filename(returnRaster). Previously, the return object would contain the cache repository as the folder for the file-backed
versions; moved to Suggests:
Require. Now there are 12 non-base packages listed in Imports. This is down from 31 prior to Ver 1.0.0.
saveToCache. This would have resulted in C Stack overflow errors due to missing original file in the
unzipwhen extracting large (>= 4GB) files (#145, @tati-micheletti)
projectInputswhen converting to longlat projections,
Filenamesnow consistently returns a character vector (#149)
raster) are updated.
options('reproducible.cacheSaveFormat')on the fly; cache will look for the file by
cacheIdand write it using
options('reproducible.cacheSaveFormat'). If it is in another format, Cache will load it and resave it with the new format. Experimental still.
ANYas it would be dispatched for unknown classes that inherit from
environment, of which there are many and this should be intercepted
Requirecan now handle minimum version numbers, e.g.,
Require("bit (>=1.1-15.2)"); this can be worked into downstream tools. Still experimental.
file.symlinkif an existing Cache entry with identical output exists and it is large (currently
1e6bytes); this will save disk space.
preProcess). Includes 2 new functions,
tempfile2for use with
reproducible.tempPath, which is used for the new control of temporary files. Defaults to
file.path(tempdir(), "reproducible"). This feature was requested to help manage large amounts of temporary objects that were not being easily and automatically cleaned
conn; user may need to manually call
movedCacheif cache is not responding correctly. File-backed Rasters are automatically updated with new paths.
Raster*will have their filenames updated on the fly during a Cache recovery. User doesn’t need to do anything.
postProcessnow will perform simple tests and skip
projectInputswith a message if it can, rather than using
Cacheto “skip”. This should speed up
postProcessin many cases.
Cachehas change. Now,
cacheIdis shown in all cases, making it easier to identify specific items in the cache.
Copyonly creates a temporary directory for filebacked rasters; previously any
Copycommand was creating a temporary directory, regardless of whether it was needed
cropInputs.spatialObjectshad a bug when object was a large non-Raster class.
cropInputsmay have failed due to “self intersection” error when x was a
SpatialPolygons*object; now catches error, runs
crop. Great reprex by @tati-micheletti. Fixed in commit
Filenamesbugfix related to
prepInputsdoes a better job of keeping all temporary files in a temporary folder; and cleans up after itself better.
prepInputsnow will not show message that it is loading object into R if
fun = NULL(#135).
options("reproducible.useDBI" = FALSE)
DBIpackage directly, without
archivist. This has much improved speed.
options("reproducible.cacheSaveFormat"). This can be either
qs. All cached objects will be saved with this format. Previously it was
qs::qsave. In many cases, this has much improved speed and file sizes compared to
rds; however, testing across a wide range of conditions will occur before it becomes the default.
Cacheis now much faster, the default is to turn memoising off, via
options("reproducible.useMemoise" = FALSE). In cases of large objects, memoising should still be faster, so user can still activate it, setting the option to
useGDALcan now take
"force"as the default behaviour is to not use GDAL if the problem can fit into RAM and
rastertools will be faster than
Cacheand family has slightly modified functionality (see ?Cache new section
useCloud) and now has more tests including edge cases, such as
useCloud = TRUE, useCache = 'overwrite'. The cloud version now will also follow the
archivist; moved to Suggests.
tidyselect. Some of these went to Suggests.
postProcesscalls that use GDAL made more robust (including #93).
dplyras a direct dependency. It is still an indirect dependency through
reproducible.showSimilarDepthallows for a deeper assessment of nested lists for differences between the nearest cached object and the present object. This greater depth may allow more fine tuned understanding of why an object is not correctly caching
options("reproducible.futurePlan")to something other than
FALSE, then it will show download progress if the file is “large”.
googledrivev 1.0.0 (#119)
pkgDep2, a new convenience function to get the dependencies of the “first order” dependencies.
useCache, used in many functions (incl
postProcess) can now be numeric, a qualitative indicator of “how deep” nested
Cachecalls should set
useCache = TRUE– implemented as 1 or 2 in
pkgDepwas becoming unreliable for unknown reasons. It has been reimplemented, much faster, without memoising. The speed gains should be immediately noticeable (6 second to 0.1 second for
retryto use exponential backoff when attempting to access online resources (#121)
cloudFolderID. This is a new approach to cloud caching. It has been tested with file backed
RasterBrickand all normal R objects. It will not work for any other class of disk-backed files, e.g.,
bigmatrix, nor is it likely to work for R6 class objects.
downloadDatafrom Google Drive now protects against HTTP2 error by capturing error and retrying. This is a curl issue for interrupted connections.
rcnsterrors on R-devel, tested using
devtools::check(env_vars = list("R_COMPILE_PKGS"=1, "R_JIT_STRATEGY"=4, "R_CHECK_CONSTANTS"=5))
retry, a new function, wraps
trywith an explicit attempt to retry the same code upon error. Useful for flaky functions, such as
googldrive::drive_downloadwhich sometimes fails due to
Rcppfunctionality as the functions were no longer faster than their R base alternatives.
prepInputswas not correctly passing
cropInputswas reprojecting extent of y as a time saving approach, but this was incorrect if
SpatialPolygonthat is not close to filling the extent. It now reprojects
studyAreadirectly which will be slower, but correct. (#93)
CHECKSUMS.txtshould now be ordered consistently across operating systems (note:
base::orderwill not succeed in doing this –> now using
cloudSyncCachehas a new argument:
cacheIds. Now user can control entries by
cacheId, so can delete/upload individual objects by
%>%pipe that was long ago deprecated. User should use
%C%if they want a pipe that is Cache-aware. See examples.
optionsdescriptions now in
options("reproducible.cachePath")can take a vector of paths. Similar to how .libPaths() works for libraries,
Cachewill search first in the first entry in the
cacheRepo, then the second etc. until it finds an entry. It will only write to the first entry.
options("reproducible.useCache" = "devMode"). The point of this mode is to facilitate using the Cache when functions and datasets are continually in flux, and old Cache entries are likely stale very often. In
devMode, the cache mechanism will work as normal if the Cache call is the first time for a function OR if it successfully finds a copy in the cache based on the normal Cache mechanism. It differs from the normal Cache if the Cache call does not find a copy in the
cacheRepo, but it does find an entry that matches based on
userTags. In this case, it will delete the old entry in the
cacheRepo(identified based on matching
userTags), then continue with normal
Cache. For this to work correctly,
userTagsmust be unique for each function call. This should be used with caution as it is still experimental.
options("reproducible.useNewDigestAlgorithm" = FALSE). There is a message of this change on package load.
cloudCachewhich allows sharing of Cache among collaborators. Currently only works with
assessDataTypeinto single function (#71, @ianmseddy)
cc: new function – a shortcut for some commonly used options for
.rararchives, on systems with correct binaries to deal with them (#86, @tati-micheletti)
fastdigest::fastdigestas it is not return the identical hash across operating systems
prepInputson GIS objects that don’t use
raster::rasterto load object were skipping
prepInputswould cause virtually all entries in
CHECKSUMS.txtto be deleted. 2 cases where this happened were identified and corrected.
data.tableclass objects would give an error sometimes due to use of
attr(DT). Internally, attributes are now added with
data.table::setattrto deal with this.
prostProcessnow correctly matches extent (#73, @tati-micheletti)
New value possible for
options(reproducible.useCache = 'overwrite'), which allows use of
Cache in cases where the function call has an entry in the
cacheRepo, will purge it and add the output of the current call instead.
FALSE), which will be used in
prepInputs as possible directory sources (searched recursively or not) for files being downloaded/extracted/prepared. This allows the using of local copies of files in (an)other location(s) instead of downloading them. If local location does not have the required files, it will proceed to download so there is little cost in setting this option. If files do exist on local system, the function will attempt to use a hardlink before making a copy.
dlGoogle() now sets
options(httr_oob_default = TRUE) if using Rstudio Server.
CHECKSUMS now sorted alphabetically.
Checksums can now have a
CHECKSUMS.txt file located in a different place than the
assessDataTypeGDAL, used in
postProcess, to identify smallest
datatype for large Raster* objects passed to GDAL system call
gdalwarpsystem call if
raster::canProcessInMemory(x,4) = FALSEfor faster and memory-safe processing
Rasterobjects, including factor rasters
extractFromArchivefor large (>2GB) zip files. In the
unzipfails for zip files >2GB. This uses a system call if the zip file is too large and fails using
Cache()when deeply nested, due to
grep(sys.calls(), ...)that would take long and hang.
preProcess(url = NULL)(#65, @tati-micheletti)
clearCache(#67), especially for large
Rasterobjects that are stored as binary
rasterpackage changes in development version of
.robustDigestnow does not include
Cachesaving to SQLite database, via
options("reproducible.futurePlan"), if the
futurepackage is installed. This is
do.callfunction is Cached, previously, it would be labelled in the database as
do.call. Now it attempts to extract the actual function being called by the
do.call. Messaging is similarly changed.
reproducible.ask, logical, indicating whether
clearCacheshould ask for deletions when in an interactive session
dlFun, to pass a custom function for downloading (e.g., “raster::getData”)
prepInputswill automatically use
readRDSif the file is a
prepInputswill return a
fun = "base::load", with a message; can still pass an
envirto obtain standard behaviour of
clearCache- new argument
assessDataType, used in
postProcess, to identify smallest
datatypefor Raster* objects, if user does not pass an explicit
git2rupdate (@stewid, #36).
.prepareRasterBackedFile– now will postpend an incremented numeric to a cached copy of a file-backed Raster object, if it already exists. This mirrors the behaviour of the
.rdafile. Previously, if two Cache events returned the same file name backing a Raster object, even if the content was different, it would allow the same file name. If either cached object was deleted, therefore, it would cause the other one to break as its file-backing would be missing.
spades.XXXand should have been
copyFiledid not perform correctly under all cases; now better handling of these cases, often sending to
file.copy(slower, but more reliable)
extractFromArchiveneeded a new
Checksumfunction call under some circumstances
extractFromArchive– when dealing with nested zips, not all args were passed in recursively (#37, @CeresBarros)
prepInputs– arguments that were same as
Cachewere not being correctly passed internally to
Cache, and if wrapped in Cache, it was not passed into prepInputs. Fixed.
.prepareFileBackedRasterwas failing in some cases (specifically if it was inside a
do.call) (#40, @CeresBarros).
Cachewas failing under some cases of
Cache(do.call, ...). Fixed.
Cache– when arguments to Cache were the same as the arguments in
FUN, Cache would “take” them. Now, they are correctly passed to the
preProcess– writing to checksums may have produced a warning if
CHECKSUMS.txtwas not present. Now it does not.
convertRasterPathsto assist with renaming moved files.
prepInputs – new features
alsoExtractnow has more options (
"similar") and defaults to extracting all files in an archive (
postProcessaltogether if no
rasterToMatch. Previously, this would invoke Cache even if there was nothing to
prepInputsto aid in data downloading and preparation problems, solved in a reproducible, Cache-aware way.
postProcesswhich is a wrapper for sequences of several other new functions (
downloadFilecan handle Google Drive and ftp/http(s) files
compareNAdoes comparisons with NA as a possible value e.g.,
compareNA(c(1,NA), c(2, NA))returns
Cache – new features:
verbosewhich can help with debugging
useCachewhich allows turning caching on and off at a high level (e.g., options(“useCache”))
cacheIdwhich allows user to hard code a result from a Cache
Cachefunction calls, unless explicitly set on the inner functions
userTagsadded automatically to cache entries so much more powerful searching via
checksums now returns a data.table with the same columns whether
write = TRUE or
write = FALSE.
showCache now give messages and require user intervention if request to
clearCache would be large quantities of data deleted
memoise::memoise now used on 3rd run through an identical
Cache call, dramatically speeding up in most cases
asPath has a new argument indicating how deep should the path be considered when included in caching (only relevant when
quick = TRUE)
New vignette on using Cache
parallel-safe, meaning there are
tryCatch around every attempt at writing to SQLite database so it can be used safely on multi-threaded machines
bug fixes, unit tests, more
imports for packages e.g.,
updates for R 3.6.0 compact storage of sequence vectors
experimental pipes (
%C%) and assign
several performance enhancements
mergeCache: a new function to merge two different Cache repositories
memoise::memoise is now used on
loadFromLocalRepo, meaning that the 3rd time
Cache() is run on the same arguments (and the 2nd time in a session), the returned Cache will be from a RAM object via memoise. To stop this behaviour and use only disk-based Caching, set
options(reproducible.useMemoise = FALSE) .
Cache assign –
%<% can be used instead of normal assign, equivalent to
lhs <- Cache(rhs).
new option: reproducible.verbose, set to FALSE by default, but if set to true may help understand caching behaviour, especially for complex highly nested code.
all options now described in
All Cache arguments other than FUN and … will now propagate to internal, nested Cache calls, if they are not specified explicitly in each of the inner Cache calls.
Cached pipe operator
%C% – use to begin a pipe sequence, e.g.,
Cache() %C% ...
sideEffect can now be a path
digestPathContent default changed from FALSE (was for speed) to TRUE (for content accuracy)
searchFull, which shows the full search path, known alternatively as “scope”, or “binding environments”. It is where R will search for a function when requested by a user.
memoise::memoise for several functions (
available.packages) for speed – will impact memory at the expense of speed.
requireon those 20 packages, but
requiredoes not check for dependencies and deal with them if missing: it just errors. This speed should be fast enough for many purposes.
dplyr from Imports
RCurl to Imports
change name of
digestRasteraffecting in-memory rasters