Is there such a thing as the perfect data format? No, of course not, but does anything come close? Yes. Trusty old comma-separated values, or CSV.
CSV gets a lot of flak and I think it’s due a little TLC. It doesn’t excite anyone, it’s unfashionable, and it’s old technology — these are all good things for a data format, where you don’t want fast-changing fads to get in the way of data communication. Yes it has its blemishes, but who doesn’t? It’s an excellent fit for statistical data, so do away with the trouble of finding that perfect format and demand CSV for six supremely practical reasons:
- CSV isn’t proprietary. CSV has existed for decades and no-one owns the format. You needn’t worry about paying to use it or buying proprietary software to open and save it. Every spreadsheet application supports it and since CSV is open and unchanging, every spreadsheet application will continue to support it for a long time.
- Excel supports CSV. Whether we like it or not much of the data that comes from governments, statistics agencies, and companies is stored in Excel spreadsheets, and while these are theoretically machine-readable they tend towards an ambiguity and complexity that’s difficult for computer programs to understand. The older and more widely-used Excel formats are proprietary (newer versions aim to change that but haven’t been entirely successful) and contain bugs, macros and formulas abound, pie charts are embedded all over the place, and the data hierarchies created by its users (I include myself here) can often be ambiguous and hard for a computer program to comprehend. Many of these problems are solved by saving a spreadsheet to CSV, and either you or your source can convert an Excel spreadsheet to a CSV with a few clicks click of a mouse button.
- CSV and non-technical people are friends. You’re not likely to be able to demand that data is provided in a particular format, and you’re even less likely to be able to demand that wonderful format you’ve invented. You’ll be lucky to get Excel documents. So asking for CSV is a good bet and risk-free. People can understand it and non-technical staff can make it for you.
- CSV is tabular data. If you want to keep the data permanently, or if you’re going to do any serious data manipulation, you’re almost certainly going to put it in a relational database. CSV is very well suited for this because its structure is identical to a database table. It won’t be in third normal form, but it will be easy to convert it into third normal form, and it’s easy to (programmatically) pivot if you need to.
- CSV is incredibly easy to parse. CSV is unusual in that no formal specification exists for the format but that doesn’t mean you’ll have difficulty parsing it with a computer program. The closest thing to a spec is RFC 4180; its definition of the format runs to seven bullet points and just over 300 words. And you’ll be hard-pressed to find a programming language that doesn’t come with a CSV parser built in.
- Tim Berners-Lee likes it. “Save the best for last”, as the saying goes, and this one’s a corker. Tim Berners-Lee, the man who invented the Web, has a five-star system for open data, and using CSV immediately gets you three stars: by making your data “available as machine-readable structured data […] plus non-proprietary format (e.g. CSV instead of Excel)”. Getting the fourth and fifth stars is more difficult (it involves a lot more theoretical heavy-lifting) but getting three stars from Tim Berners-Lee can only be a good thing.
CSV isn’t perfect, and the most obvious downsides are it’s lack of support for metadata and character-encoding. If you want metadata for your CSV you’ll either need to store it elsewhere — probably on a publicly-accessible server — or squeeze it into the data file itself in an ugly fashion.
The first idea is great if done correctly. To paraphrase Tim Berners-Lee, if you generate a small, separate, metadata file for each datafile the results can be harvested and, like the data itself, distributed and harvested as linked data. Any open dataset can be registered at thedatahub.org, data.gov.uk, and data.gov, among others.
But what’s more likely is that the metadata will be dumped at either the beginning or the end of the CSV file as if it were a second embedded set of CSV keys and values, and it will cause you some minor trouble.
There’s also the perennial problem of character-encoding. A CSV file has no in-built way to describe what character-encoding it uses, so you’re out of luck unless it’s been downloaded from a server that sends a Content-Type header — and even that shouldn’t be trusted. Instead, resign yourself to asking for a particular character-encoding and cushioning yourself with a heuristic.
But don’t let those two minor issues put you off: as Winston Churchill was once overheard saying, CSV really is the least worst data format. It provides a format that is both programmatically easy to read and simple for non-technical people to manage. It might not be perfect but it comes as close as is practically possible.
These are the slides from the presentation by our founder and CEO, Hjalmar Gislason at the Boston Data Visualization Meetup on April 5, 2012
We’re frequently asked: What is the best tool to visualize data?
There is obviously no single answer to that question. It depends on the task at hand, and what you want to achieve.
Here’s an attempt to categorize these tasks and point to some of the tools we’ve found to be useful to complete them:
The right tool for the task
Simple one-off charts
The most common tool for simple charting is clearly Excel. It is possible to make near-perfect charts of most chart types using Excel – if you know what you’re doing. Many Excel defaults are sub-optimal, some of the chart types they offer are simply for show and have no practical application. 3D cone shaped “bars” anyone? And Excel makes no attempt at guiding a novice user to the best chart for what she wants to achieve. Here are three alternatives we’ve found useful:
- Tableau is fast becoming the number one tool for many data visualization professionals. It’s client software (Windows only) that’s available for $999 and gives you a user-friendly way to create well crafted visualizations on top of data that can be imported from all of the most common data file formats. Common charting in Tableau is straight-forward, while some of the more advanced functionality may be less so. Then again, Tableau enables you to create pretty elaborate interactive data applications that can be published online and work on all common browser types, including tablets and mobile handsets. For the non-programmer that sees data visualization as an important part of his job, Tableau is probably the tool for you.
- Tableau’s visual gallery is a great way to see what the program is capable of.
- DataGraph is a little-known tool that deserves a lot more attention. A very different beast, DataGraph is a Mac-only application ($90 on the AppStore) originally designed to create proper charts for scientific publications, but has become a powerful tool to create a wide variety of charts for any occasion. Nothing we’ve tested comes close to DataGraph when creating crystal-clear, beautiful charts that are also done “right” as far as most of the information visualization literature is concerned. The workflow and interface may take a while to get the grips of, and some of the more advanced functionality may lie hidden even from an avid user for months of usage, but a wide range of samples, aggressive development and an active user community make DataGraph a really interesting solution for professional charting. If you are looking for a tool to create beautiful, yet easy to understand, static charts DataGraph may be your tool of choice. And if your medium is print, DataGraph outshines any other application on the market.
- The best way to see samples of DataGraph’s capabilities is to download the free trial and browse the samples/templates on the application’s startup screen.
- R is an open-source programing environment for statistical computing and graphics. A super powerful tool, R takes some programming skills to even get started, but is becoming a standard tool for any self-respecting “data scientist”. An interpreted, command line controlled environment, R does a lot more than graphics as it enables all sorts of crunching and statistical computing, even with enormous data sets. In fact we’d say that the graphics are indeed a little bit of a weak spot of R. Not to complain about the data presentation from the information visualization standpoint, most of the charts that R creates would not be considered refined and therefore needs polishing in other software such as Adobe Illustrator to be ready for publication. Not to be missed if working with R is the ggplot2 package that helps overcome some of the thornier of making charts and graphs for R look proper. If you can program, and need a powerful tool to do graphical analysis, R is your tool, but be prepared to spend significant time to make your outcome look good enough for publication, either in R or by exporting the graphics to another piece of software for touch-up.
- The R Graphical Manual holds an enormous collection of browsable samples of graphics created using R – and the code and data used to make a lot of them.
Videos and custom high-resolution graphics
If you are creating data visualization videos or high-resolution data graphics, Processing is your tool. Processing is an open source integrated development environment (IDE) that uses a simplified version of Java as its programming language and is especially geared towards developing visual applications.
Processing is great for rapid development of custom data visualization applications that can either be run directly from the IDE, compiled into stand-alone applications or published as Java Applets for publishing on the web.
The area where we have found that Processing really shines as a data visualization tool, is in creating videos. It comes with a video class called MovieMaker that allows you to compose videos programmatically, frame-by-frame. Each frame may well require some serious crunching and take a long time to calculate before it is appended to a growing video file. The results can be quite stunning. Many of the best known data visualization videos are made using this method, including:
- Aaron Koblin’s Flight Patterns
- Jer Thorp’s Kepler Exoplanet Candidates
- DataMarket’s own Earthquakes and Eruptions
Many other great examples showing the power of Processing – and for a lot more than just videos – can be found in Processing.org’s Exhibition Archives.
As can be seen from these examples Processing is obviously also great for rendering static, high-resolution bitmap visualizations.
So if data driven videos, or high-resolution graphics are your thing, and you’re not afraid of programming, we recommend Processing.
Charts for the Web
There are plenty – dozens, if not hundreds – of programming libraries that allow you to add charts to your web sites. Frankly, most of them are sh*t. Some of the more flashy ones use Flash or even Silverlight for their graphics, and there are strong reasons for not depending on browser plugins for delivering your graphics.
We believe we have tested most of the libraries out there, and there are only two we feel comfortable recommending, each has its pros and cons depending on what you are looking for:
- Take a look at the HighCharts Demo gallery for an idea of HighChart’s capabilities (and limitations)
- You will find samples of gRaphaël charts on the project’s home page
Special Requirements and Custom Visualizations
If you want full control of the look, feel and interactivity of your charts, or if you want to create a custom data visualization for the web from scratch, the out-of-the box libraries mentioned above will not suffice.
In fact – you’ll be surprised how soon you run into limitations that will force you to compromise on your design. Seemingly simple preferences such as “I don’t want drop shadows on the lines in my line chart”, or “I want to control what happens when a user clicks the X-axis” and you may already be stretching it with your chosen library. But consider yourself warned: The compromises may well be worth it. You may not have the time and resources to spend diving deeper, let alone writing yet-another-charting-tool™
However, if you are not one to compromise on your standards, or if you want to take it up a notch and follow the lead of some of the wonderful and engaging data journalism happening at the likes of the NY Times and The Guardian, you’re looking for something that a charting library is simply not designed to do.
The tool for you will probably be one of the following:
- Take a look at the demos on the Raphaël project page for an idea of its capabilities
Protovis is originally written by Mike Bostock (now data scientist at Square) and Jeffrey Heer of the Stanford Visualization Group. Their architectural approach is ingenious, but it also takes a bit of an effort to wrap your head around, so be prepared for somewhat of a learning curve. Luckily there are plenty of complete and well-written examples and decent documentation. Once you get going, you will be amazed at the flexibility and power that the Protovis approach provides.
- The wide range of examples available on the project’s web site will certainly testify to this flexibilitiy.
- D3.js or “D3″ for short is in many ways the successor of Protovis. In fact Protovis is no longer under active development by the original team due to the fact that its primary developer – Mike Bostock – is now working on D3 instead.
D3 builds on many of the concepts of Protovis. The main difference is that instead of having an intermediate representation that separates the rendering of the SVG (or HTML) from the programming interface, D3 binds the data directly to the DOM representation. If you don’t understand what that means – don’t worry, you don’t have to. But it has a couple of consequences that may or may not make D3 more attractive for your needs.
The first one is that it – almost without exception – makes rendering faster and thereby animations and smooth transitions from one state to another more feasible. The second is that it will only work on browsers that support SVG so that you will be leaving Internet Explorer 7 and 8 users behind – and due to the deep DOM integration, enabling VML rendering for D3 is a far bigger task than for Protovis – and one that nobody has embarked on yet.
- That said, many of the examples on the D3 website are simply mind-blowing
After thorough research of the available options, we chose Protovis as the base for building out DataMarket’s visualization capabilities with an eye on D3 as our future solution when modern browsers finally saturate the market. We see that horizon about 2 years from now.
Here are the slides from Hjalmar Gislason’s presentation at Strata in Santa Clara on Feb 29th, 2012
Note that most images are links to further information such as demonstrations, libraries, blog posts, etc.
As some of you may have seen in our blog post last week, we have been preparing a major upgrade of DataMarket.com with new functionality and new subscription plans.
Well, today is the day! We’re proud to introduce the new DataMarket.com:
…and while the new plans are geared largely towards data publishers as mentioned in last week’s post, there’s certainly something exciting in store for everybody.
Removing the pay-wall for data seekers
We’re shifting the focus of our subscription plans to data publishers, helping them publish their data, manage their data offerings and enable visualizations and interactivity on top of it.
This means that end-user features that have up until now only been available to our Pro subscribers have been made available free of charge. As a registered user, you can now download any data you have access to on DataMarket in any format (CSV, Excel, bitmaps and vector images), connect to live data from Excel and create your own “Live reports”, that have actually been upgraded dramatically and are now called “Topic pages”.
Topic pages are essentially dashboards or reports that any user can easily create, so that the latest data on whatever is most important to you, your industry or area of interest is instantly available, up-to-date in one place. You can either keep your topic pages private to yourself, publish them for anybody to see, or share them among a selected group of users.
You can “Follow” any topic page that you have access to, making them easily accessible on your home page (when logged in). This way you can keep a track of any updated data or new insights from the topic page author.
Users can now upload their own data. This part is still in BETA so bear with us on any quirks and errors that may arise. The data format is fairly strict, but still so that if you have your data in Excel files or CSV it should not be too hard to reshape it so that our importer understands it. We do provide templates that show – by example – how to format the data before uploading it, and we are working to streamline this process.
Anybody can upload a data set for their own private use, but to publish or sell data you need to subscribe to one of our publisher plans (see also below).
To update an existing data set, simply upload an updated file. For large collections of data, or data that is frequently updated, subscribers to our Corporate plan and higher can be set up with automated ways to maintain their data by connecting directly with their file repositories or databases.
Data publisher plans
Any registered user can now upload data sets for private use on DataMarket.com. This is a brilliant way to test the upload mechanism and run your own data against some of the fantastic data available from other data sources.
- Pro ($59/month): Allows a user to publish and sell his uploaded data. Even the casual user can use this plan to make his data available to a large audience in an interactive and user-friendly way, and potentially make money by selling subscriptions to his research and insights.
- Corporate (starting at $299/month): allows automated data updates, group sharing, on-site branding and a range of off-site possibilities such as easily building dashboards to publish on other web sites. Perfect for an organization that wants to sell their data online yet maintain a strong identity on DataMarket.com, or an organization that wants to work with their own data, compare and view in relation to data from other data sources and share these findings among themselves.
- Enterprise (contact us for pricing): Targeted at research and analyst firms, this plan offers a full rebranding of DataMarket’s system to run as an integrated part of customers’ web sites using live data.
A full overview of our plans is available on the “Plans & pricing” page on DataMarket.com
To learn more about DataMarket in general, you might want to take a look at our product tour. It will give you the run down on all the important things DataMarket can do for you.
US office and new customers
To follow-up on these new plans and our existing business, we are setting up a sales and marketing office in the US. Our offices are in the Boston area, more precisely in Cambridge, MA. This is where we plan to build out our business operations while development will stay in Iceland.
We’re also very pleased to announce the first two customers of our Enterprise plan. Both are fantastic research companies well known within their respective fields of expertize: Yankee Group and Lux Research.
We will release details about how these companies are using our platform later on, and actually hope to have several more such announcements to make before long.
- – -
Needless to say, we’re thrilled about all of this, and very excited about the times ahead!
Looking forward to your feedback.
Note the reference to a new major customer (and more to come) and all the exciting new things that are hinted at. I’ll give you a few additional keywords:
- Data Uploads
- Topic pages (build your own automated dashboards and reports)
- Group sharing and private content
- More end-user functionality for free
- World-leading premium data providers
- …and more
This is by far our biggest upgrade since the launch of the international data offering last year. We will publish in-depth descriptions and examples here as we launch next week.
So, here it goes. Please share this with your media contacts and friends:
DataMarket Announcing Data Publishing Solutions for Research Companies at Strata
BOSTON — February 21, 2012
DataMarket, the company behind the leading data portal DataMarket.com, is launching a range of data publishing solutions for research companies, analysts and data enthusiasts at O’Reilly’s Strata Conference next week.
These solutions allow customers to easily publish their data sets and collections and make them available for users to search, visualize, compare and download, either for free or for a fee.
Ranging from simple uploads of data sets for private use on DataMarket.com to full rebranding of DataMarket’s system to run on top of customers’ databases as an integrated part of their web site, these new solutions open exciting possibilities to data providers of all sizes.
“We’re excited about DataMarket’s Enterprise solution as a new interactive and visual tool to analyze our research data”, says Carl Howe VP of Data Sciences Research at Yankee Group, one of several information companies already underway implementing DataMarket’s solutions as a part of their research and data publishing process. “We believe that tools such DataMarket’s will democratize access to the “big data” driving today’s mobile ecosystem today, so we’re excited to be working together to bring that capability to our analysts and users.”
DataMarket’s new data publishing solutions will be launched and immediately available to new and existing customers on February 29th. Details on functionality and pricing will be announced at the launch.
- – -
DataMarket helps business users find and understand data, and data providers efficiently publish and monetize their data and reach new audiences.
DataMarket’s unique data portal – DataMarket.com (http://DataMarket.com/) – provides visual access to billions of facts and figures from a wide range of public and private data providers including the United Nations, the World Bank, Eurostat and the Economist Intelligence Unit.
For further information contact:
Hjalmar Gislason, founder and CEO
For those of you not familiar with the background, DataMarket is originally founded in Iceland in the summer of 2008. That’s where our product team is – and will be – located. We initially launched our services here and for the local market mainly, but with the obvious intention to broaden our scope. The opportunity for an active market place for data is obviously a global one and certainly not limited to our tiny island of 320 thousand inhabitants!
In fact, today – January 24th – marks the first anniversary of our international data offering.
A lot has happened since. We’ve learned a bit about what works, and a lot about what doesn’t in the emerging field of data markets. We’ve managed to build a significant and largely recurring revenue base, even though some of the revenues are coming from services we didn’t necessarily foresee a year ago. We’ve established good connections with some of the most interesting data providers out there. And we’ve learned a lot from feedback from our users and customers. Some of that feedback has already been incorporated in our product and technology.
At the Strata conference in late February, we will announce a range of new features, subscription plans and data sources, all resulting from the lessons we’ve learned in the last 12 months. More on that later!
The US office is also a result of this learning curve. Despite all the wonders of modern communication technologies, location still matters. Nothing beats meeting people face to face, looking them in the eye, listening to them describe their challenges and watching their reaction to your demo, your pitch and your sales arguments. Hardly anything sells itself over the Internet. Even Google has an army of people doing traditional sales: wining and dining, manning call centers, networking, meeting, greeting and doing business like business has been done for ages. And they’re Google!
Also, it turns out that there are more enterprise level opportunities in our business than we originally thought. And while data and feature subscription plans can indeed be marketed and sold online, enterprise solutions most certainly can not.
So, we’re setting up an office in the United States to build out our sales, marketing and business development operations.
And why Cambridge? First of all, the East coast was almost a no-brainer for us. The industries that have expressed the most interest in what we are doing, the research, media and financial industries are stronger on the East coast than the West. Looking at our sales pipeline it’s dominated by companies in Boston, New York and Washington D.C. This is also true of investors interested in the type of business we’re building. To overgeneralize, the data start-ups we’ve seen funded on the West coast tend to be more in the social, consumer oriented end of the spectrum, while those on the East coast seem to be more of the B2B, business analytics, financial nature.
We had our eyes set pretty firmly on New York, but then in a few weeks timespan late last year we saw good success with a few really interesting leads in the Boston area. In fact we’ve already signed a couple of super-interesting customers there and there are more in the pipelines. The research industry is really strong in the Boston area and that industry seems to be quite interconnected giving us a lot of opportunities to work the network and get more business going for us. Last but not least we value being close to the great universities in the area. So Cambridge it is.
And the commute from Boston to New York quite convenient – especially compared to the commute from Iceland.
I (Hjalmar) will be moving over in a few weeks time to start building the team and our success in this very dynamic market. I’d be most interested in hearing from people that would like to join our team or look into opportunities in working with us. If you are interested, please do not hesitate to get in touch.