About 10 years ago I started to look for data processing tools that can be an alternative to excel. The need arose at work to use a data analysis platform that allows the user:
Don’t get me wrong, excel is a versatile tool that has many useful applications. I have used array formulas in Excel, have applied it to data files with hundreds of thousands of lines, done the VBA thing and used all of this in a professional manufacturing environment in high volume electronics
Today, I still use Excel periodically but mostly for quick & dirty analysis or for easy viewing of data sets. Functions such as the filter function, remove duplicates, conditional formatting and some formulas in Excel are wonderful and allow for fast, easy and simple analysis of your data.
But the lack of reproducibility, the lack of effective validation methods if your formulas grow big, the focus on the data rather than on the script, the meager graphing abilities, the slow speed when analysing large data and the deplorable use of memory have certainly not made Excel the tool of choice for professional data analysis for me.
My skill level in R has grown over the last 10 years to a level that I can develop an interactive R script in the same time or faster compared to setting up a proper Excel sheet.
R is flexible
This is perhaps the biggest advantage and drawback of R at the same time: there are many ways to do the same thing in R and this may frustrate the inexperienced user. In the end R’s flexibility allows you to develop your own coding style.
R has great support
The support sphere for R is just amazing! StackOverflow anyone?
R is free!
R comes under a generic GNU license and is essentially free to use.
R packages
There are 1000s of useful R libraries (packages) out there. If you need a data or statistical analysis problem to be resolved, chances are there is a package out there that does it for you!
R is powerful
R is constantly being expanded and becomes increasingly powerful. It supports very large data, interacts with a variety of database systems, spark clusters etc.
The Tidyverse
R guru Hadley Wickham has been a driving force in developing a ‘universe’ of code libraries that are aimed at making data analysis easier. I am much aware that there are supporters and detractors with respect to the Tidyverse but I am squarely in the supporter camp!
Why or when not to use R?
In the end, R is a tool with strengths and weaknesses. R can’t do everything and other tools or computer languages should be used instead. The learning curve to master R is a little steep. An useful text about this is the R Inferno. Another good text pointing out the difficulties in learning R can be found here.
]]>
Once you arrive at that question, a whole new universe opens itself. Many sources of information are available that allow you to directly look at the data. The KNMI (Dutch bureau of meteorology) becomes your friend as they have a website that provides access to essentially all climatic data and climate model output. Many universities publish raw data on their websites. You find that many scientist and bloggers alike use such websites to get their data. What happens if you have the data? Things get harder! Climatic data is often multidimensional, discouragingly lengthy and complex. As so many, I started by dumping the data into Excel as this is the data processing tool that most of us know. It is tempting to use Excel because we use it so often. Bad idea!
The problem with Excel, when handling large amounts of data, is that Excel puts the data in the center. The processing of the data and turning it into information is hard, slow and inconvenient in Excel. And this is where programming or scripting languages come in. Scripting languages such as R put the data processing algorithm in the center. This turns out to be nifty: if more data becomes available, you only have to obtain the new data (e.g. download it from the KNMI climate explorer) and run your script. And voila, updated information appears at the press of a button.
R is an open source statistical programming environment. That means it is free. If you ask the R guys what R is, you get this:
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes
an effective data handling and storage facility,
a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on hard copy, and
a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
Well, I have some thoughts about the ‘simple’! After some time (it can be a while, the R learning curve can be steep) you start to understand that in R you can:
I use R at work, being in quality means to regularly work with real world data.
Recently I read a blog post by Willis Eschenbach about snow cover in the Norther Hemisphere. Willis is smart guy who blogs regularly at WUWT. In his blog post he speaks about the impact of albedo on climate. He says a couple of smart things about it, as he usually does. Read the post yourself if you are interested. He presents an analysis of northern hemisphere snow cover and he presents a chart, which I recreated at home using R. He was so kind as to link to his R code that he used for his analysis. The data comes from Rutgers University who have a department that tracks snow cover.
Figure 1: Northern Hemisphere snow cover analysis
This looks like a complex image with a substantial amount of data processing, and it is! The original data is shown in the top panel, a trend in the second panel, a seasonal analysis (average) in the third panel and finally the remainder between the observed data and the seasonal component in the fourth panel. In R you can do this with less than 10 lines of code! Try doing that with Excel!
The R code is here and let me talk you through the code.
# data source http://climate.rutgers.edu/snowcover/table_area.php?ui_set=0&ui_sort=0 options(digits=4) rsnow=read.csv("rutgers snow 2013.csv")
A reference to the source of the data is provided as a comment (# means comment). The ‘options’ command determines how many significant digits I want and the read.csv function looks for the csv file “rutgers snow 2013.csv” in my working directory. It reads the data into a variable called “rsnow”.
The ts function creates a time series from the date (newts) and allows us to apply an analysis relevant to time series. The function looks at data onwards from 1966 with a weekly resolution (frequency = 52). The window function selects the relevant time period in order to avoid data with gaps. And the third line checks for empty data records.
The decompose function is an excellent example of powerful high level analysis functions in R. It decomposes the time series into a seasonal (average) time series, a residual and and a trend. The object “mydecompose” is then plotted and a title of the plot is defined.
R is a pretty cool environment if you are interested in data and data analysis. It certainly takes a while to learn but it may be a good skill to have.
When you read the news from the IPCC about the latest climate predictions it is hard to not be impressed by the confidence that the climate scientists have in their models. They make statements such as:
It is extremely likely that more than half of the observed increase in global average surface temperature from 1951 to 2010 was caused by the anthropogenic increase in greenhouse gas concentrations and other anthropogenic forcings together. The best estimate of the human induced contribution to warming is similar to the observed warming over this period.
And extremely likely in this case means 95% certainty. But I ask myself, what does this mean?
The IPCC has just published its fifth summary for policymakers (SPM) and a couple of weeks ago they have leaked the report from the so-called working group 1 (WG1 which looks at the physics of climate change). One would expect that these documents would contain an explanation of how this level of confidence was determined. And that there would be a section in the report that would provide insight in the calculation of these confidence intervals. But you would be disappointed.
So what is the meaning of a confidence interval in statistics? Well, your gut feeling would tell you it must be something like this: suppose I am interested in the average weight of men older than 40 years in my suburb. I would then go out and collect data by weighing some of my neighbours and calculate their average weight. Some of my neighbours may think this is a bit strange so I may only get ten neighbours to agree with me. But in the end I have calculated an average weight of 75kg and I hope that this would tell me something about the weight of all men in my area. Off course, I did not measure all the men in my area and because of that, I am not certain of the average. If I would have measured all men, I would be absolutely certain. So I need to make an estimate of my confidence. In some way I have calculated a 95% confidence interval around this average and it turns out to be 71 to 78kg. Now what does that mean?
Well, it depends on which statistical theory you believe in. If you are a ‘frequentist’ then a confidence interval means something like this: if you repeat your experiment an infinite number of times, which is pretty hard to do, and you would collect all the data and calculate all infinite averages, the real and true average of the weight of men in my area would be between 71kg and 78kg, 95% of the time. This says nothing about the one experiment that I just did. Not very helpful, is it?
On the other hand, if you are a Bayesian statistician, the conclusion is a bit closer to home. These guys call this interval a credible interval and it means what we really want: the chance that the average from my experiment will be between 71kg and 78kg is 95% percent. This makes sense but unfortunately, the math is now a lot harder!
But for both interpretations, the confidence interval tells me something really important: I can never be totally confident about the average weight of the men in my area until I have measured them all. In other words, my confidence comes from the data that I collected!
But things get really tricky when it comes to climate. Unlike other disciplines, climate scientist cannot easily do experiments with the climate. They have to rely on computer models instead to make predictions. That means that there is no observational data to work with and to calculate averages from. So how do they do it?
The most common way goes like this: you get a bunch of very smart physicists and let them make a computer program that simulates the climate of the planet. This program will predict the average temperature of the planet. You let them run the computer models many times and every time you run it, you change some parameter of the model. For instance, you let it run with slightly different starting conditions. Or you let it run with some physical parameters set to a different value. As a result, every model run will be a little different. Then you ask your colleagues to make their own version of the models. And you let them run it many times with different parameters as well. If you are the IPCC you do this for more than 100 models. In the end, you pile all those climate model runs on top of each other and you see a kind of envelope emerging. This is called ‘ensemble runs’. The average of all runs then becomes the ‘most likely’ value of your prediction. (in reality, this is much more complex, e.g. the IPCC runs so-called scenarios for how they think carbon dioxide emission will be in the future). In this way, climate scientist work out how much variation there is to be expected in their predictions.
Because of the physics that goes into the models, and some of it is not at all well understood at the fundamental level, climate scientist claim they are able to separate natural variation (what the climate would do if we were not there) from man’s influence. They model, and publish in the reports, graphs that show how the world would have warmed (or not) with and without man’s carbon dioxide emissions. In my humble view, there are big problems in this area. For example, the influence of clouds and aerosols (soot and dust) is not well understood. Climate sensitivity (how much will the planet warm if CO2 doubles) is another area of heated debate.
We are now getting closer to what we want. But first ask yourself, is this way of coming up with an estimate of variation a good way? It may be the only way, but suppose your model has a problem? For instance, it does not get the clouds right? And as most models are based on the same basic physics, it may well be that they all get the clouds wrong. That would mean that all your runs are suddenly suspect. What does that do you’re your estimate of variation?
But we have to move on! Now, what about the data? Yes, this is where the data comes in. There are 4 main data sets available for global temperatures and they are all very close. I now have my model runs and my data so I can start comparing. How does one do this? The math is somewhat involved but you could do a statistical test and ask yourself if the model predictions and the observations are the same. Actually, the temperature trends of the models and observations are compared because the trend is not so sensitive to year on year variation. And the trend is actually what we are interested in. For such a test you have to define the confidence level and it is customary to choose 95%.
Independent statisticians do such work (e.g. Lucia Liljegren or Steve McIntyre) and what they find is remarkable: the average of all those climate model runs is not consistent with observations with 95% confidence! In fact, they find that most models predict temperature trends that are much higher than what we observe. See the figure below.
Figure 1: On the horizontal axis we see climate model runs and on the vertical axis we see the temperature trend. The red line is the average trend between 1979-2013 from HadCrut4 data (image source).
The box plots show the computer model ensemble runs. Note that the observed trend is well below the vast majority of the climate model predictions for the period mentioned!
To return to the IPCC statement, they are 95% confident that at least half of the warming between 1951 and 2010 was caused by humans. That is a lot of confidence given the challenges that are still out there.
What should we make of this? Off course, the IPCC speaks about the period from 1951 to 2010. And the data is clear, there has been warming in this period. Probably about 0.8 of a degree. But when we look at the last 15 years it becomes clear that the tool of choice for the climatologists, computer simulations, makes predictions that are well above what is observed. If we would follow the scientific method, we are getting ourselves into serious problems. The scientific method requires us to test, through experiment or observation, if the real world behaves as predicted by our theory. If our tests agree with the theory, our confidence increases, otherwise it decreases.
Now ask yourself again, how certain are you about this 95%?
Paul
Skeptical view:
Consensus view:
I am not a climate scientist, but an engineer who started to read up on climate change a couple of years ago after a friend started to ask some uncomfortable questions about the consensus view. Disturbingly, I found that there is a lot that cannot be explained by the consensus. My views on climate change can be characterized as follows: the basic physics of radiative transfer of heat through the atmosphere is correct and more CO2 will warm the planet, all else being the same. Where I disagree with the consensus is that this warming will be catastrophic (as far as we can tell today) and that it does not warrant draconian measures on the world’s economy. As Australian film maker Topher Field says about climate change mitigation measures in his 50 to 1 project: if they are expensive enough to be effective, they are unaffordable. If they are affordable, they are not going to be effective.
]]>Depending on the format of your data, it is pretty easy to do a simple linear regression in R. To do that, we first need to get some data. In this case I use a simple data file that contains measurements of the boiling point of water as a function of height. The data is available on the internet but I downloaded to my hard disk. Here is the content of the file:
Alps<-read.table(“alps-R (1).dat”, header=TRUE)
I then need to attach the data so that I can refer to the data as ‘Bpt’ and ‘Pressure’ instead of the longer ‘Alps$Bpt’ and ‘Alps$Pressure’.
attach(Alps)
The result is that the commands ‘Pressure’ and ‘Alps$Pressure’ result in the same:
> Pressure
[1] 20.79 20.79 22.40 22.67 23.15 23.35 23.89 23.99 24.02 24.01 25.14 26.57
[13] 28.49 27.76 29.04 29.88 30.06
To visualise the data I can say:
Plot(Alps)
And I get, as expected, this standard plot:
Now I would like to make a linear model in R. This turns out to be fairly easy once you understand the R notation. I need to tell R the following:
Alps.lm=glm(Pressure ~ BPt)
Alps.lm creates an object that will contain everything that is I need for my linear model. I call a standard R function ‘glm’ and this will create the linear model with Pressure as the dependent variable and Bpt as the independent variable. The operator ~ tells R what is dependent and independent.
I can now ask R for a summary of this generalised linear model:
Summary(Alps.lm)
And I get a lot:
Call:
glm(formula = Pressure ~ BPt)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.25717 -0.11246 -0.05102 0.14283 0.64994
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81.06373 2.05182 -39.51 <2e-16 ***
BPt 0.52289 0.01011 51.74 <2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
(Dispersion parameter for gaussian family taken to be 0.05420953)
Null deviance: 145.93778 on 16 degrees of freedom
Residual deviance: 0.81314 on 15 degrees of freedom
AIC: 2.5629
Number of Fisher Scoring iterations:
Without going into the details, but summary tells me what is the coefficient of the line I am trying to fit (0.52) and also the value where the line will intercept the Y-axis (-81.06). So I can now try to plot this line. R provides a very easy way to do this:
abline(Alps.lm)
And this simple command will add the regression line to my graph:
At this point I have achieved all that I wanted: I have calculated a linear model to my data, I have learner what is the coefficient and the intercept of that line and I have plotted the line onto my graph. All of that with these 5 lines of code:
Alps<-read.table(“alps-R (1).dat”, header=TRUE)
Attach(Alps)
Alps.lm=glm(Pressure ~ BPt)
Plot(Alps)
Abline(Alps.lm)
I would like to see you do that with Excel in just 5 steps!
]]>
plot(sunspots,xlab=”years”,ylab = “sunspot count”,main = “Sunspot numbers”)
Now I have titles with just a simple command.
In the next step I would like to change the plot form a line plot to a dot plot. I do that by adding the type = “p” instruction to the plot command:
plot(sunspots,type=”p”,xlab=”years”,ylab = “sunspot count”,main = “Sunspot numbers”)
Easy enough. But now I would like to add a line the points that represents the average of the data. Now I need two commands:
Having read some of the work of professor Tufte (e.g. his amazing book “The Visual Display of Quantitative Information“) I am aware of the many other software solutions out there to tackle this problem: Minitab, Mathlab, S and R to just name a few. The attractive aspect of R is that it is available as open source software and free. And it is amazingly powerful! You can find several good examples of the graphing capabilities of R here.
So where does one start? First of all to get R onto your computer. That can be achieved by going to the R website. You can download the R package from that website. This is the easy part! As you will discover with me, R has a rather steep learning curve.
Once R is on your system and you have started the software it looks something like this:
One of the fun things about R is that it contains many ready made data sets. One of them is ‘sunspots’ and by just typing:
> plot(sunspots)
You get this result:
This is a good example of how powerful R can be: with just one simple command you can plot a complex chart! In a next blog I will give some more examples.
]]>This is my weblog and it is a very new experience for me. The title and sub-title demands some explanation: Uppyn is the name of my viking ancestors and it is an inspiring thing to be able to refer to one’s ancestry in the title of this blog. The sub-title is pretty obvious: the world through Dutch eyes in Sydney. This blog will therefore be all about how I perceive the world while I live in Sydney. My observations, some of which will be serious, some funny (you be the judge of that), some stimulating will be presented here.
I start this blog as well to find easier ways to communicate with friends, family and anybody who would be interested. A few topics hold my profound interest:
And I will blog about these topics more than about others. I also hope that this blog can become a place where I can show the strangeness of the world and share it with you.
Finally, as a non-native speaker, I hope that this blog will help me improve my English language skills as well as my ability to commit my thought to paper (in an internet kind of way).
This blog is a kind of experiment and I thought that the following quote from Ralph Waldo Emerson is very fitting:
All life is an experiment. The more experiments you make the better.
Enjoyable reading!
Paul
]]>