The world through Dutch eyes in Sydney
Category Archives: R
Advanced Excel user
About 10 years ago I started to look for data processing tools that can be an alternative to excel. The need arose at work to use a data analysis platform that allows the user:
- to have better reproducibility of the analysis
- to experience fewer data quality problems in the sense that corrupting cells in excel is so hard to avoid, especially if there are multiple users
- to apply more powerful algorithms
- to deal better with expanding data files without ending up copying formulas into the new data
- create great looking and complex graphs
- the not spend a great deal of money
Don’t get me wrong, excel is a versatile tool that has many useful applications. I have used array formulas in Excel, have applied it to data files with hundreds of thousands of lines, done the VBA thing and used all of this in a professional manufacturing environment in high volume electronics
If you have, like me, a more than passing interest in climate science and you are an engineer it helps if you have some understanding of the R statistical environment. Once you have mastered the key concepts in the global warming discussion, the question that inevitably presents itself is: what does the data tell you? The data beats the mainstream media and politics if you want to know what is going on.
Once you arrive at that question, a whole new universe opens itself. Many sources of information are available that allow you to directly look at the data. The KNMI (Dutch bureau of meteorology) becomes your friend as they have a website that provides access to essentially all climatic data and climate model output. Many universities publish raw data on their websites. You find that many scientist and bloggers alike use such websites to get their data. What happens if you have the data? Things get harder! Climatic data is often multidimensional, discouragingly lengthy and complex. As so many, I started by dumping the data into Excel as this is the data processing tool that most of us know. It is tempting to use Excel because we use it so often. Bad idea!
The problem with Excel, when handling large amounts of data, is that Excel puts the data in the center. The processing of the data and turning it into information is hard, slow and inconvenient in Excel. And this is where programming or scripting languages come in. Scripting languages such as R put the data processing algorithm in the center. This turns out to be nifty: if more data becomes available, you only have to obtain the new data (e.g. download it from the KNMI climate explorer) and run your script. And voila, updated information appears at the press of a button.
R is an open source statistical programming environment. That means it is free. If you ask the R guys what R is, you get this:
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes
an effective data handling and storage facility,
a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on hard copy, and
a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
Well, I have some thoughts about the ‘simple’! After some time (it can be a while, the R learning curve can be steep) you start to understand that in R you can:
- Load data in a variety of formats (e.g. excel, csv, text etc)
- Apply analytical functions to the data (mostly statistical, e.g. mean, standard deviation to name a few very basic functions)
- Create scripts (programs) where you store your analysis into, so you can run them again later
- Plot your analysis results into charts
I use R at work, being in quality means to regularly work with real world data.
Recently I read a blog post by Willis Eschenbach about snow cover in the Norther Hemisphere. Willis is smart guy who blogs regularly at WUWT. In his blog post he speaks about the impact of albedo on climate. He says a couple of smart things about it, as he usually does. Read the post yourself if you are interested. He presents an analysis of northern hemisphere snow cover and he presents a chart, which I recreated at home using R. He was so kind as to link to his R code that he used for his analysis. The data comes from Rutgers University who have a department that tracks snow cover.
Figure 1: Northern Hemisphere snow cover analysis
This looks like a complex image with a substantial amount of data processing, and it is! The original data is shown in the top panel, a trend in the second panel, a seasonal analysis (average) in the third panel and finally the remainder between the observed data and the seasonal component in the fourth panel. In R you can do this with less than 10 lines of code! Try doing that with Excel!
The R code is here and let me talk you through the code.
# data source http://climate.rutgers.edu/snowcover/table_area.php?ui_set=0&ui_sort=0 options(digits=4) rsnow=read.csv("rutgers snow 2013.csv")
A reference to the source of the data is provided as a comment (# means comment). The ‘options’ command determines how many significant digits I want and the read.csv function looks for the csv file “rutgers snow 2013.csv” in my working directory. It reads the data into a variable called “rsnow”.
The ts function creates a time series from the date (newts) and allows us to apply an analysis relevant to time series. The function looks at data onwards from 1966 with a weekly resolution (frequency = 52). The window function selects the relevant time period in order to avoid data with gaps. And the third line checks for empty data records.
The decompose function is an excellent example of powerful high level analysis functions in R. It decomposes the time series into a seasonal (average) time series, a residual and and a trend. The object “mydecompose” is then plotted and a title of the plot is defined.
R is a pretty cool environment if you are interested in data and data analysis. It certainly takes a while to learn but it may be a good skill to have.
Basic linear regression is very important in engineering and science. In linear regression one tries to fit a line to whatever data you are sitting on top of. The objective is off course to investigate if that line would reveal a meaningful relationship between the variables of your data. The linear regression can also be used to test a hypothesis that tells you if the new-found relationship is statistically significant.
Depending on the format of your data, it is pretty easy to do a simple linear regression in R. To do that, we first need to get some data. In this case I use a simple data file that contains measurements of the boiling point of water as a function of height. The data is available on the internet but I downloaded to my hard disk. Here is the content of the file:
In the last post I have done something very simple: I called the built-in data set ‘sunspots’ and drew a basic plot. With very basic means I can do some more to this plot. E.g. I can add a title and titles for the axes:
plot(sunspots,xlab=”years”,ylab = “sunspot count”,main = “Sunspot numbers”)
Learning to work comfortably with a powerful statistics and graphics package has long been a wish. In my work as a quality manager I often come across problems that require me to perform either some statistical analysis or create graphs or both. I have laboured for years with Excel as do so many. But there are better ways!
Having read some of the work of professor Tufte (e.g. his amazing book “The Visual Display of Quantitative Information“) I am aware of the many other software solutions out there to tackle this problem: Minitab, Mathlab, S and R to just name a few. The attractive aspect of R is that it is available as open source software and free. And it is amazingly powerful! You can find several good examples of the graphing capabilities of R here. Read more of this post