My process - gathering data

Published on 2023-03-10 18:40

Every project starts with gathering data. Good quality and completeness of data is probably the most important part of the project. Visualizing without the data is close to impossible.

In my experience when it comes to the data one can either starts with a dataset from client, kaggle or some other place. Or one needs to gather the data themselves.

Existing dataset

Let's look at working with existing dataset. Getting someone else's data is a first step in the project. During the initial gathering the provider of the data should be able to answer some questions about the dataset.

In the age of AI generators, the first question should be "did you obtain the data legally?" Before the rise of bad crypto and AI generators my assumption would be "yes", It's 2023 and more than handful of companies are using data of artists, writers and developers obtained through questionable means. More experienced designers taught me to make sure to have all licenses to the images used. Otherwise it could bring costly lawsuits.

Next I'd be looking at the completeness of the data and means how it was obtained. This means asking about the domain of the data, is it produced by hardware or wetware (humans). What data are missing and why.

As an example if I were to visualize data about pool in the past three years and in the middle of this period there's a gap because a pump stopped working, it's valuable to know.

Another is when dealing with questionnaires, it's important to know which answers are mandatory and which are optional.

At this point it's good consider if external data published by governments, trade groups and NGOs would be useful. Sometimes it can help frame existing data in a bigger picture. Other times it creates more noise.

Last but not least I like to ask what are you trying to get out of it. While it's possible to make a dashboard with all the data, getting a closer look at certain time frame or selections of data is also useful.

Assembling the data

This happens mostly on my side projects when I want to visualize something and don't have a dataset yet. This means that gathering the data is up to me completely. There are wide range of techniques that one needs to use.

In personal data visualization pen and paper are more than enough to do the job. In the past I use to have a small notebook where I wrote down interesting data points. From them later I created couple of data visualizations.

For projects that have data available on the internet, like on the wikipedia, one can write a scraper that will tranform data stored on page into something that has data. Just remember that Wikipedia's texts are available under Creative Commons license.

Another possibility is to use a book as a source like Sonja Kuijpers did in her data visualization course.

At the end of the day as long as the data is not obtained through shady means, sky is the limit unless you are an astronomer.

Get Monday Notes in your inbox

We won't send you spam. Unsubscribe at any time.