We have some big news to share!
DataMarket has been acquired by one of the most innovative companies in the data and analytics space: Qlik.
For those of you that don’t know Qlik, they – uh, I mean “we” – are the Business Intelligence vendor that pioneered the field of data discovery – a nimble, interactive and visual way to work with data that really disrupted the entire BI industry a few years back. With the recent launch of a next generation product – Qlik Sense – the BI process has been made even simpler with self-service discovery and visualizations that are so easy to use, anyone in an organization can build a visually compelling story using their data.
Qlik is a rapidly growing organization, with more than $470M in revenue last year, and publicly traded on NASDAQ since 2010.
We are not disclosing a lot about our joint plans at this time, but I’ll still say that we see some very interesting opportunities in bringing together Qlik’s superb analytics products, with DataMarket’s unique abilities to pull in, maintain and normalize data from a vast range of 3rd party sources – all in an effort to fulfill the Qlik vision of simplifying decisions for everyone, everywhere.
Also – while headquartered in the US – Qlik is originally out of Sweden with most of their R&D efforts still there. Qlik has some unmistakably Scandinavian characteristics, and after getting to know the company over the last few months I can say that the cultural fit has felt completely natural.
So, for us DataMarketeers, a new and exciting chapter begins. You won’t hear a lot from us in the next couple of months, but when we come back we’ll have a new home on the Qlik blog on Qlik Community.
In the meantime we encourage you to learn more about Qlik:
The outbreak of the Ebola virus disease (EVD) in western Africa that started in March this year is the deadliest outbreak of EVD to date. In August the World Health Organization declared the epidemic to be an international public health emergency and in a statement on September 26 said that “The Ebola epidemic ravaging parts of West Africa is the most severe acute public health emergency seen in modern times. Never before in recorded history has a biosafety level four pathogen infected so many people so quickly, over such a broad geographical area, for so long.”
Data on the epidemic can be hard to find. Each affected country produces differently grained data and usually it is published in daily pdf reports (here are the Liberia reports, for example). PDF, while possibly useful for something, is generally a really bad format for public and open data. It’s hard to extract without human intervention and generally is a pain.
Recently we were approached by Ola Rosling of the Gapminder foundation, urging us to import into DataMarket metrics on the EVD epidemic. He got us in contact with Caitlin Rivers, a computational epidemiology student at Virginia Tech, who has been collecting the data from the official PDF files and making public in a Github repository. A fantastic initiative. She, along with others, has also been publishing some great insights into the data at her blog.
Today we are making public four datasets with EVD data on DataMarket. These sets are free for anyone to use, either on our site or via our API, our R package or by export into another system (vía CSV or Excel).
(Update: If you are working with this data using our API and hit the request limit for free API usage, just reach out to our support, and we’ll set you up with a generous quota to continue your work).
Caitlin provides us with these datasets:
Additionally we now also import Sub-national time series data on Ebola cases and deaths from the OCHA Regional Office for West and Central Africa:
You can access these sets and others relating to the epidemic from a single topic-page here. They will be updated with the most recent data on a daily basis.
We know that there is more work to be done getting data on EVD into usable formats and if you want to help, check out Caitlin’s Github repository. There are PDF’s that need digitizing and data that needs cleaning. Note that there might be discrepancies in the data. If you notice anything that seems off, do not hesitate to let us know.
We will continue to add to our data collection as we come across useful data.
In high-quality software there are various things going on behind the scenes to make the user experience more pleasant. Many of these are things entirely invisible to the user, but would degrade the user experience if they were NOT done. These are the things that give you that “it just works” feeling interacting with good software.
At DataMarket we take pride in doing such things. One of them is data downsampling. There are various reasons a product like ours needs to downsample data, the main two being:
- Avoid sending excessive amounts of data “over the wire” in order to speed up interactions. There is no need to send a million data points over the internet in order to render a chart on a monitor that is mere 1800 pixels across.
- Speed up calculations and rendering of charts by only focusing on the data points that best represent the overall trends in the data.
But let’s give the word to Sveinn:
As human beings, we often wish to visualize certain information in order to make better sense of it. This can be somewhat challenging in the case of large amounts of data, especially when viewing a large data set in an interactive way. Receiving and rendering all the raw data can be time consuming and is often unnecessary since the user cannot even perceive most of it, at least when viewing all the data at once. One solution is to downsample the data, retaining only the important visual characteristics.
A recently published master’s thesis in computer science at the University of Iceland explores the problem of downsampling line charts (especially time series) for visual representation. The topic was initially suggested by DataMarket since the company has experienced this challenge first hand.
Of all the algorithms evaluated, the ones which yielded the best results all used similar methods as in the field of cartographic generalization (polyline simplification). One algorithm in particular called Largest-Triangle-Three-Buckets turned out to be both efficient and produce good results for most cases. It is already being put to use by DataMarket and it has also been published on GitHub under a free software license.
- An interactive demonstration can be viewed here: http://flot.base.is/
- The thesis can be found here: Downsampling Time Series for Visual Representation
Today is #datainnovation day, this blog post is our contribution to the day’s activities, recognizing our role in the data ecosystem.
Corporate decision makers feed on data from a wide and growing variety of sources. More often than not the problem they face is not the lack of available data, but how readily that data can be accessed, understood and used.
This problem was recognized early on in the history of enterprise software and in the 1990s the field of business intelligence really took off as a way to help companies aggregate and make sense of the growing amount of data available to them. This was in the early days of the Internet, and largely before the Internet was recognized as a major force in the world, let alone a source of data and intelligence for serious business.
Therefore, BI took off as a tool that focused on data from internal systems, largely data from operational databases and other enterprise software systems. When properly implemented, a BI system give companies a really valuable insight into operational things ranging from the performance of the call-center to the results of the latest efforts to fight customer churn – in short, pretty much anything anything that falls under the “COO’s world”.
So that’s all good to understand what’s happening inside the organization. But what about the external business environment? The market conditions, competitive landscape and economies that the company is operating in? You can operate your company perfectly looking at the internal metrics, and still come crashing down in flames if you don’t navigate the market environment properly.
In other words: You don’t drive your car staring at the dashboard. If you don’t look out the windshield, you’re guaranteed to crash into something very soon (see the beginning of this video):
But, as history will have it, Business Intelligence systems still seem to see the Internet almost as an afterthought. You can easily hook them up to the various internal databases through ODBC connections, take in data from a variety of enterprise systems and so on, but if you want to hook them up to a simple online API – not so much.
This is slowly changing, but the bigger problem is that most of the good data out there on the internet, whether from Open Data sources, financial and economic databases or market research companies doesn’t even exist in APIs or other well-structured, machine-readable formats.
Therefore, BI has never made its way in any meaningful way to those that are looking at external data: The strategy teams, insights teams and tactical marketing teams, in short – “the CMO’s world”. (see also my previous blog post Data-driven decision making: Beyond today’s BI)
So, the world of decision making data today looks something like this:
Company logos are placeholders and in reality represent thousands
of companies and data providers and dozens of BI systems
Now the interesting thing is that BI in total is about $13.8B industry in 2013 (as measured by Gartner), where as the “market intelligence” industry is over $70B (adding up the IT & Market Research and Science, Technology and Medical information segments as defined and measured by Outsell). Yet, there have not been any good solutions to help unify and simplify access to MI-data – a world that is dominated by delivery of data through – *shudder* – PDFs, PowerPoints, proprietary data systems and Excel-sheets, sometimes “enriched” with complicated pivot tables and macros.
Enter DataMarket. Our value proposition is that we help organizations get unified and normalized access to all this data, providing them with a Data Hub – a portal that provides a single point of access to all this data, allowing users to search, visualize, compare, share and download all of this data in the most useful format for the task at hand. Furthermore we enrich the collection of premium data they’re already subscribed to with data coming from open and publicly available sources across the internet. There’s more good data out there than most people realize. The problem is not availability, but discovery and access.
It is clear to me that in the long run, all this data belongs in the same place – in the same systems – and that’s in fact what we increasingly hear from our customers: “This is fantastic! But now that’s solved, how can I get my internal data into the same view?”
Now we’re happy to add custom data feeds with some key internal data to our customer’s data hub setups, but we’re not a BI or analytics tool – nor do we intend to become one. In fact, most enterprises have very strict rules about internal data ever crossing the company’s firewall (which is why I don’t see pure SaaS BI systems have a place in the large enterprise world, at least not for a good while – but that’s material for another blog post).
What we offer these customers – in addition to the Data Hub access – is that we’ll deliver all this external data in a normalized way to their BI systems, to be analyzed and acted on inside the firewall, using the tools that the organization (at least the COO’s side of the house) is already comfortable with. We take care of aggregating and normalizing the data and maintaining the connections to the data providers, delivering up-to-date data from hundreds of sources to these organizations through a single connection:
And this is my prediction: Business Intelligence and Market Intelligence are about to meet up in a big way, a way that will be transformational for both industries: BI will have to learn about the needs and desires of the world’s marketing departments and Market Intelligence companies will have to learn how to deliver their data in a more useful way than they currently do.
Maybe we can call this new combination “Unified Intelliegence”?
But, whatever it will be called, we’re looking forward to being in the middle of this transformation.
Presentation given by Hjalmar Gislason, CEO of DataMarket at WARC 2014 in London, January 16, 2014
A week ago we silently released a new, exciting feature to DataMarket: Choropleth maps.
Choropleth maps are geographical maps where areas, such as countries or states, are colored based on the value of an indicator. They are a great way to explore and reveal geographical patterns in data.
You’ve seen choropleth maps a million times, but may not have known what they’re called. Neither has creating one ever been this simple. Find the data you want to view on DataMarket.com (or upload your own), and select the Choropleth map chart type. That’s it.
Here’s an example showing GDP per capita across the globe:
And you can zoom into different regions of the world, such as Europe:
There are even two sub-national maps available already, one for the regions of Brazil and the other one for the states of the United States of America. Here’s one that shows the latest unemployment numbers in the US, by state:
Implementation details – for those that are interested
As with everything we do at DataMarket, we have to approach things in a very generic way. There are almost 70 thousand data sets already available on our public site alone, holding more than 310 million time series. Furthermore, users can upload their own data. So every chart, export format and feature of the system has to be able to act in a generic way to accomodate for a wide range of values, different sizes of data sets and all teh weird edge-cases. This is very different from creating a single choropleth by hand, or hacking a single map for use with a single data set.
Choropleths are no exception. We believe we’ve done quite well, but we had to make all sorts of design choices when implementing these. We will be covering these separately in a follow-up post in the coming days.
DataMarket aggregates and normalizes quantitative data from a wide variety of sources, enabling users to “go from question to shared insight in minutes”.
To do so, DataMarket guides users through a specific workflow. An analyst has come up with a question – say, “Who are the world’s biggest oil producers and where is the US in that regard?” – and is seeking data that can provide an answer or serve as an input into a data model or decision making process.
The workflow to answer that question is the following:
- Search: Keyword queries or navigation that helps users to identify data sets that may be relevant to the question. The keyword query here might be: [US oil production]. This will return a list of data sets that the user can scan and select one that is likely to have an answer, e.g. “Oil: Production tonnes” from BP.
- Select: Selecting the data from the data set that may shed light on the question. When first opening a data set, this is done automatically based on information in the user’s search query (if such hints are available) or the properties of the data itself, favouring e.g. world totals over values for individual countries, etc. In some cases no such hints or properties are available, in which case the user is guided through the selection process. In our example above, the system will automatically select the US based on that the term was used in the search query, immediately resulting in a line chart showing the history of oil production in the US from 1965-2012.
However, as our question was about the world’s largest oil producers, we select all of the countries by checking the box next to the title of the country “dimension”. This will result in a somewhat incomprehensible line chart with more than 60 lines!
- Display: This is where the user decides how to display the selected data. In this case we want to see a ranked comparison of oil production in different countries in the latest available year. The chart type that yields that view is the bar chart. Selecting that, we see that there are some totals in this data set that “pollute” the list, so we go back to the select step to remove these. This leaves us with a chart showing the long tail of oil production in the world.
We’ve arrived at our answer: In 2012, Saudi Arabia was the world’s biggest oil producer, followed closely by Russia with the US in a distant (but somewhat secure) 3rd position.
- Export: Now all we need to do is share our finding with our colleagues, customers or take the data elsewhere for further work or analysis. The Export tab allows the user to export the data or the resulting chart in a number of different ways (Excel, CSV, PowerPoint, PNG, SVG, PDF or straight to PowerPoint), connect to the data from other systems using live data feeds (Excel, R) or our generic API or share a link to the data in email, IM or on social media. Let’s assume that the user simply wants to share her new knowledge with a co-worker. Click the “Short URL” option under “Share”, copy the snappy little data.is link and send it off via email.
Clicking the link will take the recipient to exactly the same view. And the user can continue whatever she was doing when the question came up.
In the above example the answer to the question was found in a time series data set, the most common type of research documents found in DataMarket, but the same workflow also applies to other types such as survey questions and text documents, although the available functionality varies at each step given the nature of the data/document.
The example also relies on the fact that the data to answer the original question is indeed available in our data collection. The more than 50,000 data sets that are already there hold more than 3 billion numerical facts about the world, so there is a lot in there. But in short we’re strong when it comes to macro-economic data, and spotty when it becomes to more industry-specific or sub-national data. This is where our Data Hub product comes in, allowing you to use the DataMarket with any data you want, whether from public sources, syndicated premium research or private data from your computer or corporate network. Read more about the Data Hub here.
There are various other aspects of the system that support the workflow and common data tasks around it. Most importantly:
- Combining data: The system allows data from two or more data sets to be combined in a single view, regardless of the original format or source of the data. As an example one could compare the historical growth of oil production in Saudi Arabia to their GDP growth.
- Collecting data: Multiple data views (charts, tables, maps, …) – typically on the same topic or project – can be gathered on a single page, called a topic page, for quick reference. Here’s one on food and agriculture as an example. This can serve several purposes:
- Bookmarking for later reference
- Creating a dashboard for quick overview (each data widget updates as new data becomes available)
- Sharing collected insights with a team to facilitate discussion and decision making.
- Embedding: A chart or a table can be embedded on a 3rd party website similar to a YouTube video or a SlideShare slide show.
You can read more about individual features of the system in our product tour.