Monthly Archives: May 2014

Permalink to single post

Data Journalism Workshop, May 26 – 30

Objectives: By the end of workshop participants should be able to:

  1. Appreciate data journalism
  2. Mine, scrape and analyze data on health
  3. Use simple tools to visualize data
  4. Write a data driven story proposal
  5. Package data into simple, compelling and accessible stories.


Day One:                 Monday 26

08:30 – 09:30            Introduction/Expectations/Survey – Dorothy/Lydia

09:30 – 11:00            Journalism in the age of data – Dorothy

11:00 – 11:30           Tea break

11:30 – 01:00            Finding stories in data – Eva

01:00 02:00           Lunch

02:00 – 03:30            Interviewing your data – Dorothy

03:30 – 03:45           Tea break

03:45 – 05:15            Multimedia storytelling – Dol


Day Two                    Tuesday 27

09:00 – 10:00            Finding data for stories – Eva

10:00 – 11:30            Finding data on the web – Eva

11:30 – 12:00           Tea break

12:00 – 01:00            Cleaning your data – Aggrey

01:30 – 02:30           Lunch

02:00 – 04:00            Converting data into friendly formats – Eva/ Agnes

04:00 – 04:15           Tea break

04:15 – 05:15            Introduction to the Data Dredger – Dorothy


Day Three                 Wednesday 28

09:00 – 10:30            Math and statistics for journalists  – Dorothy

10:30 – 10:45           Tea break

10:45 – 12:15            Finding interrelationships in data – Dorothy /Aggrey

12:15 – 01:15            How data informs my storytelling – Paul Wafula, The Standard

01:15 – 02:15           Lunch

02:15 – 03:45            Creating compelling visuals – Agnes

03:45 – 04:00           Tea break

04:00 – 05:30            Data visualization for journalism – Agnes

(Visualisation assignment)


Day Four:                 Thursday 29

09:00 – 10:00            Review assignment – Agnes/Eva

10:00 – 11:30            Creating maps with maps engine – Eva

11:30 12:00           Tea break

12: 00 – 01:30           Long–form, multimedia storytelling (part one) – Dorothy /Eva

(Exercise& discussion)

01:30 – 02:30           Lunch

02:30 – 04:00            Interpreting quantitative research results:  distinguishing good

research from bad –Suleiman Asman, Innovation for Poverty Actions


04:00 – 04:15           Tea break

04:15 – 05:00            Long–form, multimedia storytelling (part two) – Eva/ Dorothy


Day Five:                  Friday 30

09:00 – 10:30            Recap – All trainers

10:30 – 11:00           Tea break

11:00 – 01:30            Story Mapping – All trainers

01:30 – 02:30           Lunch

02:30 – 05:00            Data Expedition – Eva/Dorothy/Agnes

05:00 – 5:15              Evaluation


Permalink to single post

Data cleaning Guide for Journalists

Data journalism workshops can make the data journalism process seem much faster and more straight-forward than it really is. In reality, most data doesn’t arrive organized and error-free. Most data is messy. Before beginning any kind of analysis, the data needs to be cleaned. Data cleaning is a process data journalists use to detect, correct or delete inaccurate, incomplete, or erroneous data with an aim of improving data quality. Examples of errors commonly found in data are:
1. Wrong date formats or incorrect dates like 30th February, 2013.
2. Unknown characters.
3. Missing data.
4. Spaces before and after values.
5. Data that is beyond rage for example, age of a human being recorded as 879 years.
6. Inconsistency.
7. Other errors.
Data cleaning is also known as:
1. Error Checking
2. Error Detection
3. Data Validation
4. Data Cleansing
5. Data Scrubbing
6. Error Correction

The process of data cleaning may include:
1. Format checks
2. Completeness checks
3. Reasonableness checks.
4. Limit checks
5. Review of the data to identify outliers
6. Assessment of data by subject area experts (e.g. Doctors assessing Kenya Health at a Glance data)
These processes usually result in flagging, documenting and subsequent checking and correcting of suspect records. In advanced data management, validation checks may also involve checking for compliance against applicable standards, rules, and conventions.
The general framework for data cleaning is:
1. Define and determine error types.
2. Search and identify error instances.
3. Correct the errors.
4. Document error instances and error types.
5. Modify data entry procedures (or regular expressions in during data scrapping) to reduce future errors.
Data journalists often use these tools for data cleaning:
1. Open Refine.
2. Excel.
Advanced data cleaning may be done in SQL, STATA, SAS and other Statistical applications to detect errors. If errors are well documented and analyzed, it can help data journalists and program managers to prevent more errors from happening.

We shall go through the following steps to learn how to use Open Refine to clean data.
• Introduction
• Basic functionalities
• Advanced functionalities
• Summary

Initially developed by Google, Open Refine is now it is completely maintained by volunteers.
• Open Refine is a desktop application (installed in our computers) that help us understand and clean datasets.
• Refine has a web interface that launches a browser but works locally.
• Open Refine does not work on Internet Explorer.

What is Open Refine designed for?
• Understanding the dataset through filters and facets.
• Cleaning typos and adapt data formats.
• Derive new data based on original data – e.g. Generating new data column based on a formula from the already existing data columns.
• Reuse transformations – this is being able to save the steps in a code such that when the second dataset in the same format is imported, the code is ran at once.
What is Open Refine not designed for?
• Adding new information to a dataset.
• Making complex calculations (Spreadsheet software is better, like, MS Excel).
• Data visualization (there are other tools available to do that).
• Datasets with a huge column number greater than 80 (OpenRefine does column-based operations so it would be tedious).

To understand how Open Refine works let’s look to an example;
1. Download and install Open Refine here.
2. Launch OpenRefine.
3. Find the project named: “F1Results2012-2003. google-refine.tar.gz”
4. Import the project into Refine.

Basic functionalities of Open Refine
Facets: These are like Excel filters but with counters.
• Text
• Numeric
• Timeline
• Custom (Facet by blank, Facet by error, etc.)

• Applying a filter enables us to work onthe subset of data we are interested in.
• Add columns based on another column to modify all data in column
• Split columns by a character separator. For example, split:”Surname, Name” into the two columns “Surname” and “Name”
Figure 1: The use of Facets, Text Filters and Clustering

Figure 2: How to Split a column

We can use Open Refine to:
• Rename/Remove columns.
• Execture common transformations.
• Remove white space.
• Data type conversion (number to text, etc.)
• Lowercase, uppercase, title case.
• Cut parts of a text (substring).
• Replace parts of a text (replace)
• Fill down adjacent cells
• Remove “matched” rows (after filtering some rows or selecting a value on a facet we can remove only the matched rows).

Figure 3: Shows how to edit cells through common transforms.

NOTE: Most functionality is under common transforms.
Figure 4: shows how to remove all matching rows.
Helps to find similarities within texts in order to identify and standardize differences in spelling and format of entries. For example, identify that “Kakamega,” “Kaka mega” and “Kakamega County” are all the same. of the different clustering algorithms from finding very close matches to distant matches. It does not cluster values automatically but instead it shows the clusters to the user. So it is our decision in the end whether the different entries should all have a uniform name.

Figure 4: How to use clustering

Advanced functionalities to explore include:
• Obtaining new data through a web service.
• Retrieve coordinates based on address.
• Determine the language of a text.
• Get data from another project based on a common column (Like MS Vlookup).
• Using “cell.cross”

Google tutorials:
1. Introduction
2. Data transformation
3. Data augmentation

• User manual
• Google Refine Expression Language (all the functions available for us to use on our transformations).