Tim Berners-Lee’s missing star
Most of you Open Data enthusiasts out there will be familiar with Tim Berners-Lee’s five star system, a no nonsense rating system for the usefulness and utility of a openly released data set:
![]() |
1 star for releasing data at all (even PDF of scanned paper) |
![]() ![]() |
2 stars for releasing it in structured, machine-readable formats (e.g. Excel file) |
![]() ![]() ![]() |
3 stars for releasing it using non-proprietary file formats (e.g. CSV file) |
![]() ![]() ![]() ![]() |
4 stars for releasing it as linked open data |
![]() ![]() ![]() ![]() ![]() |
5 stars for linking the data to other linked data sources |
For those not up to speed, here’s Sir Tim explaining in a short video (first 2 minutes will do it for this purpose):
As stated in Matt’s earlier post in praise of CSV, we firmly believe that the biggest bang for the buck comes from reaching 3 stars fast and then aiming for the fourth and the fifth star as a part of your organizations’ long term data platform strategy.
However, there is a missing star in Tim’s grading system. Releasing your data in CSV or other structured, machine-readable, non-proprietary format is certainly worthy of three stars, but if you are releasing dozens, hundreds or even thousands of data sets, you should also aim to do so in a consistent, well-documented manner across all your data sets.
Why? Because a developer or a data scientist hacking away at your data should not have to determine the structure of each data set individually. They’ll want to be able to write a generic piece of code that slurps up any (or all) of your data sets in the same way. If you have a 100 different data sets, structured in a 100 slightly different ways, it will take them almost a 100 times longer to make use of all your valuable data.
The same goes for the discoverability of the available data. Provide proper, machine-readable directories. And for associations with meta-data, whether in the data file or provided in separate files with a clear association (see Matt’s post for details).
Oh, and you want to avoid the files to be prepared by hand, even final touch-ups. It will lead to mistakes. If you do, make sure you write tests that check for your consistent structure and other possible errors before publishing a data set.
So, that said, here’s our revised version of Tim Berners-Lee’s 5 star system:
![]() |
1 star for releasing data at all (even PDF of scanned paper) |
![]() ![]() |
2 stars for releasing it in structured, machine-readable formats (e.g. Excel file) |
![]() ![]() ![]() |
3 stars for releasing it using non-proprietary file formats (e.g. CSV file) |
![]() ![]() ![]() ![]() |
3.5 stars for using consistent format, discoverability methods and meta-data associations across all your data sets |
![]() ![]() ![]() ![]() |
4 stars for releasing it as linked open data |
![]() ![]() ![]() ![]() ![]() |
5 stars for linking the data to other linked data sources |
In your early open data initiatives, aim for at least 3.5 stars!



I think there are a few more things that could make it into the 3.5th star: data providers work with vintages and publish every vintage in a new directory, so you can’t automatically find the latest content. I would also argue that csv’s with time across columns is bad practice, especially if the rightmost columns are footnotes, not to talk about the unstandardized way of placing table/metric/series/entity/data point annotations that make things unnecessarily hard.
Star 4 and 5 will gradually evolve, but I think the big step is getting from OPEN to USEABLE OPEN data.
jurgen
May 25, 2012 at 5:50 pm
I like CSV and is my main storage mechanism for data sets for reasons stated before. But should that be such a goal for this listl? To me, open source tends to be about accessibility and understandable for many users.
Often, I see novice users more bewildered about CSV data than Excel. While CSV is better, I think the emphasis should be on users understanding. I rarely see formats like Excel being huge barriers. Besides CSV, merely stating “non-proprietary” can be much worse than Excel. To wit, xml structured data to a novice or even intermediate user can be absolute hell. Even in R, it can be a lot of work to get an XML document imported. Given the choice, Excel is much better option.
To me, it makes more sense for Excel and CSV (e.g., machine readable format) to be 2 stars and consistent format/dicoverability (3.5 stars) be placed at 3 stars. Although, “consistent” needs to be defined. It’s very easy to say data needs to be consistent, but that can be a deep topic.
Tom Schenk Jr. (@tomschenkjr)
May 25, 2012 at 6:59 pm
I can agree with this, up to a point. TBL’s rationale here is probably that when the GOVERNMENT is publishing data, it should do so in formats that are non-proprietary to ensure that there are absolutely no discriminating issues or “favoritism” involved. And I agree with him.
I will on the other hand agree with you that for most people, Excel files are more approachable than CSV files for human consumption. However, I’d add the fact that there are other and even more approachable methods for that, e.g. using proper, web-based data publishing solutions such as – touting my own horn – DataMarket’s Enterprise solution.
Hjalmar Gislason
May 25, 2012 at 7:31 pm
Good point! So, how about making associated meta-data 4/5 star rated before going beyond 3 star for the many data sets?
Kerstin Forsberg
May 25, 2012 at 8:07 pm