Saturday, 21 September 2013

A few gotchas with R date-time classes

Date and time handling is essential to many modelling and analysis exercises, in R and other languages used for scientific computing. Over the past few months I tackled the mapping of date-time concepts between R and the .NET framework as part of the work on the rClr package. A few weeks ago Mollie Taylor posted on Date formats in R, which I found an interesting read, as I always have to remind myself of date-time formats when I need it. I thought I'd share what I learned with date-time handling in R in light of its mapping to .NET (Date-time mapping).

Date-time handling essentials

R has several classes that are representing date-time concepts: Date, POSIXct, , POSIXlt, each with its use of generic as functions to convert back and forth between them (Ripley and Hornick, 2001). Date effectively has a precision limited to one day, whereas POSIXt objects are down to a second. Importantly, POSIXt objects always have time zones attached to them, implicitely or explicitely; checl out ?Sys.timezone for details.

Important contributions have been made to date-time handling in R with the lubridate package (Grolemund and Wickham, 2011). It is very tempting to use lubridate classes, but because of the level of generality at which rClr aims however, it really needs to map to the core R date-time classes.

.NET has the types DateTime, DateTimeOffset, and TimeZoneInfo to deal with most date-time operations. A crucial difference with R is that DateTime purposely does not include time zone information, although it can be tagged as a UTC or Local date-time. Its system is overall less machine-dependent than R's, though not totally.

Daylight Savings Time. Whew! where do I start. You'd think this is bad enough to miss a breakfast catchup with friends, but it gets much worse when dealing with it in software. leap seconds are lurking, but thankfully I think I did not need to worry about it for R/.NET interop.

R date-time gotchas

Here are a few things I noticed when setting up unit tests for rClr . When converting date and times from UTC to local time you want to be careful which timezone you use, in particular avoid Sys.timezone without arguments.

Of course daylight saving times have to have a few gotchas; be careful of the effect if calculating time spans in time zones affected by DST:

If you create some time stamps to use as time series indexes, you have to choose between round stamps and time intervals consistents with the DST affected POSIXt objects: you cannot get both. 'Date' objects in R would work around the issue for daily time step, but if you need sub-daily and you need to think about it more carefully.


Conclusion

I highlighted in this post only a few gotchas: be assured there are more peculiarities and oddities both in R and .NET date-time handling (not to mention the COM stuff, *shiver*). A few take-home messages to avoid the main traps:
  • Use lubridate. I could not in rClr, but you probably should.
  • Use UTC as an explicit time zone in your data time stamp, if you can
  • Prefer, by a long shot, ISO 8601 date-time formats such as '2011-02-23 23:50:53', in R, Excel or anything else. Using it in your data and software will very likely save you a lot of grief down the track.

References

Brian D. Ripley and Kurt Hornik (2001), Date-Time Classes, in  http://www.r-project.org/doc/Rnews/Rnews_2001-2.pdf
Garrett Grolemund and Hadley Wickham (2011), Dates and Times Made Easy with lubridate, Journal of Statistical Software, April 2011, Volume 40, Issue 3. http://www.jstatsoft.org/v40/i03/paper
Choosing Between DateTime, DateTimeOffset, and TimeZoneInfo http://msdn.microsoft.com/en-us/library/bb384267.aspx

Tuesday, 25 June 2013

rClr: low level access to .NET from R

rClr is a package to access arbitrary .NET code seamlessly. The "CLR" acronym part of the package name stands for Common Language Runtime. C# and R being languages I regularly use, I have felt the need for better interoperability between these for a few years. What started as week-end investigation out of curiosity grew to rClr. There has already been a few rounds of beta releases and it is quite functional running on Windows and using the Microsoft .NET Framework, hence this post. I used it regularly for my work for the past 9 months. Running on other operating systems with the Mono  CLR is also supported and is almost at feature-parity. After a bit more testing a tarball will be available.

A new beta version of the binary for Windows package is currently available at rClr on Codeplex, alongside the source code under LGPL 2.1. While likely to work as is on many Windows boxes, you may need to install the latest Microsoft Visual C++ runtime. Instructions on how to do this are at the web site.

A quick tour with some sample code, starting with a customary "Hello world" with a bit of GUI for good measure.
 The following sample shows that some of the package functions help to discover the content of loaded assemblies (i.e. .NET dynamic libraries), to reduce the need to get back to the source code.
A "complex" .NET object is essentially an external pointer (structure similar to that in rJava)
The package is designed to allow access to existing .NET code without modification to that code (well, for code well designed for access anyway). rClr is also designed to be made as intuitive as possible for users accustomed to R programming idioms. A corollary of that design is that data types are converted to their natural representation in each runtime whenever possible without ambiguity. The following table gives the conversion table for the most used unidimensional vector. This is not an exhaustive list of supported conversions.


mode type class length clrType
character character character 3 System.String[]
numeric integer integer 3 System.Int32[]
numeric double numeric 3 System.Double[]
logical logical logical 3 System.Boolean[]
numeric double Date 3 System.DateTime[]
numeric double POSIXct 3 System.DateTime[]
character character character 1 System.String
numeric integer integer 1 System.Int32
numeric double numeric 1 System.Double
logical logical logical 1 System.Boolean
numeric double Date 1 System.DateTime
numeric double POSIXct 1 System.DateTime

I've used  rClr to access environmental time stepping models in C#, to combine it with the statistical and visualisation strengths of R. One of the tutorials on the web site is a self-contained simplified use case.




Roadmap

I am presenting at the useR conference in a couple of weeks. First attendance, and really looking forward to meet a new crowd.
A few wrinkles needs ironing out for a first stable release of course, notably for running on *nix and MacOS (I "only" develop and test on a Debian box). Trailblazing testers and contributors are very welcome. The build process is inherently more complicated than your average package but this is alleviated with configure scripts. You can post questions/discussions through the web site.
Submission to CRAN is probably the next big item on the list, in preference to more features. While codeplex is fine for my codebase management needs it is not a typical go-to place for R users.

Acknowledgements

I gratefully acknowledge Kosei Abe for the nicely crafted R.NET library that is in places reused in the rClr package. R.NET is primarily designed for .NET developers to access the R engine, but I envisaged a growing role for it in rClr.
The package rJava by Simon Urbanek and other contributors also was a natural source of insight in my early investigations on how to tackle in-process interop of R and .NET.
Simon Knapp a few years ago presented a neat way to mix in-process R code with .NET via Python for .NET, and this led to the idea of the rClr package.