In high-quality software there are various things going on behind the scenes to make the user experience more pleasant. Many of these are things entirely invisible to the user, but would degrade the user experience if they were NOT done. These are the things that give you that “it just works” feeling interacting with good software.
At DataMarket we take pride in doing such things. One of them is data downsampling. There are various reasons a product like ours needs to downsample data, the main two being:
- Avoid sending excessive amounts of data “over the wire” in order to speed up interactions. There is no need to send a million data points over the internet in order to render a chart on a monitor that is mere 1800 pixels across.
- Speed up calculations and rendering of charts by only focusing on the data points that best represent the overall trends in the data.
But let’s give the word to Sveinn:
As human beings, we often wish to visualize certain information in order to make better sense of it. This can be somewhat challenging in the case of large amounts of data, especially when viewing a large data set in an interactive way. Receiving and rendering all the raw data can be time consuming and is often unnecessary since the user cannot even perceive most of it, at least when viewing all the data at once. One solution is to downsample the data, retaining only the important visual characteristics.
A recently published master’s thesis in computer science at the University of Iceland explores the problem of downsampling line charts (especially time series) for visual representation. The topic was initially suggested by DataMarket since the company has experienced this challenge first hand.
Of all the algorithms evaluated, the ones which yielded the best results all used similar methods as in the field of cartographic generalization (polyline simplification). One algorithm in particular called Largest-Triangle-Three-Buckets turned out to be both efficient and produce good results for most cases. It is already being put to use by DataMarket and it has also been published on GitHub under a free software license.
- An interactive demonstration can be viewed here: http://flot.base.is/
- The thesis can be found here: Downsampling Time Series for Visual Representation
Today is #datainnovation day, this blog post is our contribution to the day’s activities, recognizing our role in the data ecosystem.
Corporate decision makers feed on data from a wide and growing variety of sources. More often than not the problem they face is not the lack of available data, but how readily that data can be accessed, understood and used.
This problem was recognized early on in the history of enterprise software and in the 1990s the field of business intelligence really took off as a way to help companies aggregate and make sense of the growing amount of data available to them. This was in the early days of the Internet, and largely before the Internet was recognized as a major force in the world, let alone a source of data and intelligence for serious business.
Therefore, BI took off as a tool that focused on data from internal systems, largely data from operational databases and other enterprise software systems. When properly implemented, a BI system give companies a really valuable insight into operational things ranging from the performance of the call-center to the results of the latest efforts to fight customer churn – in short, pretty much anything anything that falls under the “COO’s world”.
So that’s all good to understand what’s happening inside the organization. But what about the external business environment? The market conditions, competitive landscape and economies that the company is operating in? You can operate your company perfectly looking at the internal metrics, and still come crashing down in flames if you don’t navigate the market environment properly.
In other words: You don’t drive your car staring at the dashboard. If you don’t look out the windshield, you’re guaranteed to crash into something very soon (see the beginning of this video):
But, as history will have it, Business Intelligence systems still seem to see the Internet almost as an afterthought. You can easily hook them up to the various internal databases through ODBC connections, take in data from a variety of enterprise systems and so on, but if you want to hook them up to a simple online API – not so much.
This is slowly changing, but the bigger problem is that most of the good data out there on the internet, whether from Open Data sources, financial and economic databases or market research companies doesn’t even exist in APIs or other well-structured, machine-readable formats.
Therefore, BI has never made its way in any meaningful way to those that are looking at external data: The strategy teams, insights teams and tactical marketing teams, in short – “the CMO’s world”. (see also my previous blog post Data-driven decision making: Beyond today’s BI)
So, the world of decision making data today looks something like this:
Company logos are placeholders and in reality represent thousands
of companies and data providers and dozens of BI systems
Now the interesting thing is that BI in total is about $13.8B industry in 2013 (as measured by Gartner), where as the “market intelligence” industry is over $70B (adding up the IT & Market Research and Science, Technology and Medical information segments as defined and measured by Outsell). Yet, there have not been any good solutions to help unify and simplify access to MI-data – a world that is dominated by delivery of data through – *shudder* – PDFs, PowerPoints, proprietary data systems and Excel-sheets, sometimes “enriched” with complicated pivot tables and macros.
Enter DataMarket. Our value proposition is that we help organizations get unified and normalized access to all this data, providing them with a Data Hub – a portal that provides a single point of access to all this data, allowing users to search, visualize, compare, share and download all of this data in the most useful format for the task at hand. Furthermore we enrich the collection of premium data they’re already subscribed to with data coming from open and publicly available sources across the internet. There’s more good data out there than most people realize. The problem is not availability, but discovery and access.
It is clear to me that in the long run, all this data belongs in the same place – in the same systems – and that’s in fact what we increasingly hear from our customers: “This is fantastic! But now that’s solved, how can I get my internal data into the same view?”
Now we’re happy to add custom data feeds with some key internal data to our customer’s data hub setups, but we’re not a BI or analytics tool – nor do we intend to become one. In fact, most enterprises have very strict rules about internal data ever crossing the company’s firewall (which is why I don’t see pure SaaS BI systems have a place in the large enterprise world, at least not for a good while – but that’s material for another blog post).
What we offer these customers – in addition to the Data Hub access – is that we’ll deliver all this external data in a normalized way to their BI systems, to be analyzed and acted on inside the firewall, using the tools that the organization (at least the COO’s side of the house) is already comfortable with. We take care of aggregating and normalizing the data and maintaining the connections to the data providers, delivering up-to-date data from hundreds of sources to these organizations through a single connection:
And this is my prediction: Business Intelligence and Market Intelligence are about to meet up in a big way, a way that will be transformational for both industries: BI will have to learn about the needs and desires of the world’s marketing departments and Market Intelligence companies will have to learn how to deliver their data in a more useful way than they currently do.
Maybe we can call this new combination “Unified Intelliegence”?
But, whatever it will be called, we’re looking forward to being in the middle of this transformation.
Presentation given by Hjalmar Gislason, CEO of DataMarket at WARC 2014 in London, January 16, 2014
A week ago we silently released a new, exciting feature to DataMarket: Choropleth maps.
Choropleth maps are geographical maps where areas, such as countries or states, are colored based on the value of an indicator. They are a great way to explore and reveal geographical patterns in data.
You’ve seen choropleth maps a million times, but may not have known what they’re called. Neither has creating one ever been this simple. Find the data you want to view on DataMarket.com (or upload your own), and select the Choropleth map chart type. That’s it.
Here’s an example showing GDP per capita across the globe:
And you can zoom into different regions of the world, such as Europe:
There are even two sub-national maps available already, one for the regions of Brazil and the other one for the states of the United States of America. Here’s one that shows the latest unemployment numbers in the US, by state:
Implementation details – for those that are interested
As with everything we do at DataMarket, we have to approach things in a very generic way. There are almost 70 thousand data sets already available on our public site alone, holding more than 310 million time series. Furthermore, users can upload their own data. So every chart, export format and feature of the system has to be able to act in a generic way to accomodate for a wide range of values, different sizes of data sets and all teh weird edge-cases. This is very different from creating a single choropleth by hand, or hacking a single map for use with a single data set.
Choropleths are no exception. We believe we’ve done quite well, but we had to make all sorts of design choices when implementing these. We will be covering these separately in a follow-up post in the coming days.
DataMarket aggregates and normalizes quantitative data from a wide variety of sources, enabling users to “go from question to shared insight in minutes”.
To do so, DataMarket guides users through a specific workflow. An analyst has come up with a question – say, “Who are the world’s biggest oil producers and where is the US in that regard?” – and is seeking data that can provide an answer or serve as an input into a data model or decision making process.
The workflow to answer that question is the following:
- Search: Keyword queries or navigation that helps users to identify data sets that may be relevant to the question. The keyword query here might be: [US oil production]. This will return a list of data sets that the user can scan and select one that is likely to have an answer, e.g. “Oil: Production tonnes” from BP.
- Select: Selecting the data from the data set that may shed light on the question. When first opening a data set, this is done automatically based on information in the user’s search query (if such hints are available) or the properties of the data itself, favouring e.g. world totals over values for individual countries, etc. In some cases no such hints or properties are available, in which case the user is guided through the selection process. In our example above, the system will automatically select the US based on that the term was used in the search query, immediately resulting in a line chart showing the history of oil production in the US from 1965-2012.
However, as our question was about the world’s largest oil producers, we select all of the countries by checking the box next to the title of the country “dimension”. This will result in a somewhat incomprehensible line chart with more than 60 lines!
- Display: This is where the user decides how to display the selected data. In this case we want to see a ranked comparison of oil production in different countries in the latest available year. The chart type that yields that view is the bar chart. Selecting that, we see that there are some totals in this data set that “pollute” the list, so we go back to the select step to remove these. This leaves us with a chart showing the long tail of oil production in the world.
We’ve arrived at our answer: In 2012, Saudi Arabia was the world’s biggest oil producer, followed closely by Russia with the US in a distant (but somewhat secure) 3rd position.
- Export: Now all we need to do is share our finding with our colleagues, customers or take the data elsewhere for further work or analysis. The Export tab allows the user to export the data or the resulting chart in a number of different ways (Excel, CSV, PowerPoint, PNG, SVG, PDF or straight to PowerPoint), connect to the data from other systems using live data feeds (Excel, R) or our generic API or share a link to the data in email, IM or on social media. Let’s assume that the user simply wants to share her new knowledge with a co-worker. Click the “Short URL” option under “Share”, copy the snappy little data.is link and send it off via email.
Clicking the link will take the recipient to exactly the same view. And the user can continue whatever she was doing when the question came up.
In the above example the answer to the question was found in a time series data set, the most common type of research documents found in DataMarket, but the same workflow also applies to other types such as survey questions and text documents, although the available functionality varies at each step given the nature of the data/document.
The example also relies on the fact that the data to answer the original question is indeed available in our data collection. The more than 50,000 data sets that are already there hold more than 3 billion numerical facts about the world, so there is a lot in there. But in short we’re strong when it comes to macro-economic data, and spotty when it becomes to more industry-specific or sub-national data. This is where our Data Hub product comes in, allowing you to use the DataMarket with any data you want, whether from public sources, syndicated premium research or private data from your computer or corporate network. Read more about the Data Hub here.
There are various other aspects of the system that support the workflow and common data tasks around it. Most importantly:
- Combining data: The system allows data from two or more data sets to be combined in a single view, regardless of the original format or source of the data. As an example one could compare the historical growth of oil production in Saudi Arabia to their GDP growth.
- Collecting data: Multiple data views (charts, tables, maps, …) – typically on the same topic or project – can be gathered on a single page, called a topic page, for quick reference. Here’s one on food and agriculture as an example. This can serve several purposes:
- Bookmarking for later reference
- Creating a dashboard for quick overview (each data widget updates as new data becomes available)
- Sharing collected insights with a team to facilitate discussion and decision making.
- Embedding: A chart or a table can be embedded on a 3rd party website similar to a YouTube video or a SlideShare slide show.
You can read more about individual features of the system in our product tour.
This weekend we rolled out the largest upgrade of our data viewer user interface since we introduced the HTML5 charts in 2011. As this interface is in many ways the heart of the DataMarket experience, we are obviously quite excited to see this go live.
Long-time DataMarket users will notice a series of changes, and new users should now find the site even more user friendly and attractive. Here’s what’s new:
Three tab interface
All the controls that used to be around the chart on the right are now gone, putting more emphasis on the chart itself and leaving the chart area cleaner. The controls are now found under their respective tabs in the panel on the left hand side. Each tab supports a logical step in a workflow once you’ve opened the data set you wanted to work with:
- Select: This is where you select the data you want to view from the open dataset(s). This is in fact what has always been on the left-hand side panel. A notable difference, however, is that there is no “Visualize” button. Instead charts and tables update immediately as the data selection is changed. See more on this below.
- Display: This is where you control the display of the data. Pick your chart types, select the period or point in time you want to view, configure chart settings, edit chart titles and so on. Look for new and exciting chart types here soon!
- Export: This is where you “take it away“: Export the data you have selected or the charts you have created in your desired format (CSV, Excel, PowerPoint, PDF, PNG, SVG, …); connect live to our vast collection of normalized data from other systems (Excel, R, or anywhere else using our generic API); embed interactive versions of the charts in your own blog posts and articles; or share your findings with others on social media or email/instant messaging using the short URLs.
We have been fans of instant feedback for a long time, and we finally got around to do something about it. Every change you make to a data selection or a data view is now instantly reflected on the screen. There are no “Visualize” or “Apply” buttons. Instead your actions have immediate effect on the data you are looking at. We believe this approach makes working with data a lot more natural – almost tactile – encouraging exploration and experimentation and facilitating faster insight.
In the world of data, context is everything, and often you need additional meta-data to fully understand what you are looking at, how the data is collected and what it really means. In fact, two of our principles are to remind us that we must provide all the necessary information and context for our users to fully understand any data view. We therefore go to great lengths to acquire associated meta-data and details for any data we make available 1. Access to this information has always been available under “Detailed information”, a menu item on the “hamburger icon” () next to each data set, but we’ve now given it a lot more weight, putting it right below the chart area, easily expanded by clicking the “Show detailed information” link next to the source reference.
We are quite proud of this upgrade and it provides us with a UI-framework that allows us to logically expand on the data viewer’s functionality, including data transformations, statistical analysis and additional visualization types. Stay tuned for that.
We welcome any questions comments and ideas on the new interface. Feel free to comment here or reach out to email@example.com with your feedback.
- – -
1 Unfortunately many data providers leave users of their data – us included – a little too much in the dark the way they make their meta-data available.
My job has a serious occupational hazard. We work with so much interesting data, holding the keys so many – sometimes untold – stories, that a casual opening of a data set can quickly lead to hours of nerdy investigation; trying to understand what might explain a sudden rise, drop or trend.
Following are some of my favorites. Click the thumbnails for full context and interactive charts:
Potatos and the Irish
Japanese fire horses
We got CO2 too
The “terrible” medicalization of childbirth