Cleaning and standardising geographic names in R with prep_geonames()

A hands-on workshop for systematic geographic name matching

R
Data cleaning
Geospatial
GIS
Author

Mohamed A Yusuf

Published

September 1, 2025

Important

Before you begin, we expect participants to have a basic working knowledge of R. If you are new to R or need a refresher, we recommend reviewing the earlier sessions on data visualization and data wrangling.

No prior experience with geographic data cleaning is required. However, familiarity with the previous mapping in R session will be useful. In this session, we focus on systematic approaches to geographic name standardisation. You will learn how to detect mismatches, apply string-distance algorithms, use interactive review for ambiguous cases, and build a reusable cache of corrections for future datasets.

Overview

Welcome back! In this post, we turn our attention to one of the most persistent challenges in public health data management: cleaning and standardising geographic names.

We’ll start with a quick look at why mismatched names create problems in analysis, reporting, and decision-making. From there, we’ll walk through a structured approach to name cleaning using the sntutils::prep_geonames() function. Along the way, we’ll explore common issues in real datasets, how string-distance algorithms can suggest likely matches, and how to integrate human review only where it is needed.

By the end of this session, you’ll understand how to move from manual, ad hoc fixes to a reproducible workflow that scales from hundreds to tens of thousands of records. You’ll also see how caching, auditing, and match statistics make the process faster, more transparent, and easier to trust.

Note

Learning Objectives

By the end of this session, you will be able to:

  • Understand the nature of geographic name inconsistencies and their impact on data quality.
  • Detect and standardise geographic names across multiple datasets.
  • Use interactive tools to review and resolve ambiguous matches.
  • Handle edge cases by correcting parent-level misassignments before matching.
  • Save and re-use corrections through a persistent caching system.
  • Validate the quality of name matching with summary statistics.
  • Produce simple visualisations to confirm that geographic names align with official boundaries.

The challenge of name matching

Geographic names are rarely consistent across datasets. The same administrative unit may appear with different spellings, formats, or even linked to the wrong parent. These problems are common in routine health data, survey microdata, and sometimes even in official shapefiles.

When names do not match, joins fail and indicators misalign. This leads to duplicate units in summaries, broken time series when names switch, and wrong denominators in coverage estimates. Analysts spend more time fixing data than analysing it, and programme teams lose trust in the results.

Two types of problems

It is important to distinguish between two categories:

1. Name inconsistencies These are technical issues that can be addressed with systematic cleaning. Examples include:

  • Spelling variations (Kadunna vs Kaduna).
  • Case inconsistencies (KANO vs Kano).
  • Extra whitespace or hidden characters (L agos vs Lagos).
  • Abbreviations and aliases (FCT vs Federal Capital Territory).
  • Language differences (local vs English/French names).
  • Parent misassignments (district linked to the wrong region).

2. Administrative changes These are structural changes that cannot be solved by name cleaning alone. Examples include:

  • New districts created (splits).
  • Old districts merged or dissolved.
  • Boundaries re-drawn to shift populations between units.
  • Renaming linked to political or administrative reforms.

The first set of problems can be handled with a reproducible name-matching workflow. The second set requires authoritative boundary data and programme- level decisions on how to manage new or obsolete units. This section focuses only on the first category—fixing names.

Why manual cleaning fails

Manual cleaning is often the default approach, but it does not scale well as datasets grow. Common issues include:

  • Error-prone. Fatigue leads to inconsistent fixes.
  • Inconsistent. Different analysts handle the same issue differently.
  • Non-reproducible. Corrections exist only in local files or memory.
  • Time-consuming. Large datasets take hours or days to clean.
  • Lost knowledge. Corrections disappear when staff leave.
  • Unverifiable. Hard to track what was changed, when, or by whom.

Some teams try to improve this by coding row-by-row replacements. While more systematic than manual edits, this approach is still limited:

  • Hard-coded fixes apply only to one dataset.
  • Scripts become brittle and difficult to maintain.
  • Overlapping fixes spread across multiple files.
  • No single auditable record of decisions exists.

The result is that each new dataset restarts the cycle of ad hoc corrections. The same names are fixed again and again, creating extra effort for analysts, limited reproducibility, and reduced confidence in outputs. Public health data systems need to be auditable—analysts should be able to show how a name was standardised, who made the decision, and when it was applied. Without a structured workflow, results differ each time, cannot be peer-reviewed, and do not build cumulative institutional knowledge.

Why structured approaches are needed

The challenges of name matching highlight the importance of moving beyond one-off fixes. Ad hoc edits may solve immediate problems but do not create reproducible or scalable workflows. A more systematic approach is needed—one that records decisions, reduces duplication of effort, and makes the process auditable.

Such approaches provide a reproducible way to handle recurring issues in name standardisation. They reduce the need for repeated ad hoc fixes and strengthen confidence in analytical outputs.

Step-by-step Guide

In this session we present prep_geonames(), which illustrates how a structured workflow can be applied in practice.

Key elements include:

  • Algorithms to suggest likely matches.
  • Interactive review to resolve ambiguous cases.
  • A persistent record or cache of accepted decisions.
  • Match statistics and exports to support transparency.
  • Simple validation checks or visualisations to confirm consistency.

As an example, we use the 2018 DHS dataset to map the modelled percentage of the population without basic water in Nigeria at the adm2 level. To ensure a reliable join, we align DHS administrative names with the WHO ADM2 shapefile through hierarchy-aware matching, cached decisions, and correction of parent misassignments. This process standardises spelling and boundary labels so that all DHS adm1 and adm2 entries match the WHO reference. We validate the result using unmatched-case checks and a map overlay, confirming a complete join with no missing polygons. The workflow is transparent, auditable, and reusable across DHS indicators and other datasets.

Step 1: Install packages

In the following steps we use prep_geonames(), which is part of sntutils, an R package by AHADI to support Subnational Tailoring (SNT) of malaria interventions. The package provides a number of helper functions for preparing, cleaning, and analysing data to support decision-making at district level and below.

To install sntutils and a number of other packages (dplyr, ggplot2 etc.,), we first need the pak package. pak makes it easy to install R packages, including development versions from GitHub, along with their dependencies.

# 1) install pak
install.packages("pak")

# 2) install packages
pak::pkg_install(
  c(
    "dplyr",        # data manipulation
    "here",         # for relative file paths
    "ggplot2",      # plotting
    "cli",          # console alerts/messages
    "grid",         # unit sizing for plot theme elements
    "ahadi-analytics/sntutils" # for prep_geoname and other helpers
  ),
  dependencies = TRUE
)

Step 2: Load data

For this example we use modeled data from the 2018 Nigeria DHS, available through the DHS Local Data Mapping Tool. Each download includes indicator tables, population counts, uncertainty intervals, and shapefiles. Here we use the Admin 2 tables and shapefiles, specifically the estimates and confidence intervals for the population without basic water. The dataset has been slightly modified with intentional name inconsistencies to demonstrate the cleaning workflow.

# import data
nga_dhs <- sntutils::read(
  here::here("01_data/dhs/Admin2_NGDHS2018Table.xlsx")
) |>
  dplyr::select(
    adm0 = adm0_name,
    adm1 = adm1_name,
    adm2 = adm2_name,
    pop_no_basic_water_value = nobaswatv, # % without basic water
    pop_no_basic_water_ci_lower = nobaswatl, # % without basic water lower CI
    pop_no_basic_water_ci_upper = nobaswatu # % without basic water upper CI
  )

# check the data
dplyr::glimpse(nga_dhs)
Output
Rows: 774
Columns: 6
$ adm0                        <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria"…
$ adm1                        <chr> "Abia", "Abia", "Abia", "Abia", "Abia", "A…
$ adm2                        <chr> "Arochukwu", "Bende", "Ikwuano", "Ohafia",…
$ pop_no_basic_water_value    <dbl> 26.20, 27.94, 26.33, 26.05, 27.44, 17.98, …
$ pop_no_basic_water_ci_lower <dbl> 8.92, 11.58, 8.26, 10.87, 7.99, 4.30, 6.24…
$ pop_no_basic_water_ci_upper <dbl> 54.22, 48.40, 49.39, 48.69, 59.50, 42.94, …

We also need a reference shapefile of administrative boundaries. For this example we use publicly available WHO ADM2 data, which can be accessed from the WHO ArcGIS service, here we use sntutils to download it.

# get/load NGA shapefile (WHO)
nga_shp <- sntutils::download_shapefile(
  country_codes = "NGA",
  admin_level = "adm2",
  dest_path = here::here("01_data/shapefile")
)

# check shapefile
dplyr::glimpse(nga_shp)
Output
Rows: 774
Columns: 7
$ adm0_code  <chr> "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NG…
$ adm0       <chr> "NIGERIA", "NIGERIA", "NIGERIA", "NIGERIA", "NIGERIA", "NIG…
$ adm1       <chr> "JIGAWA", "ZAMFARA", "BAYELSA", "LAGOS", "ABIA", "BAYELSA",…
$ adm2       <chr> "MIGA", "GUSAU", "NEMBE", "EPE", "UMUAHIA SOUTH", "SAGBAMA"…
$ start_date <date> 2000-01-01, 2000-01-01, 2000-01-01, 2000-01-01, 2000-01-01…
$ end_date   <date> 9999-12-31, 9999-12-31, 9999-12-31, 9999-12-31, 9999-12-31…
$ geom       <MULTIPOLYGON [°]> MULTIPOLYGON (((9.71926 12...., MULTIPOLYGON (…

Step 3: Check matches

Before running any cleaning, inspect how well your dataset matches the reference shapefile (lookup_data). Use sntutils::calculate_match_stats() to summarise matches by level.

The function is hierarchy-aware. To check or match at adm2, you must also provide adm1 and adm0. Column names must be identical in both datasets. The same rule applies at finer levels (e.g., adm3, settlements): you must include all higher levels. If your target data use adm0/adm1/adm2, your lookup_data must use the same names. The same rules apply later in prep_geonames().

# check matches
sntutils::calculate_match_stats(
  nga_dhs,
  lookup_data = nga_shp,
  level0 = "adm0",
  level1 = "adm1",
  level2 = "adm2"
)
Output
── ℹ Match Summary ─────────────────────────────────────────────────────────────
! Both sides have unmatched names; see per-level lines below.

Target data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 36 out of 39 matched                                       
• adm2 (level2): 698 out of 774 matched                                     
Lookup data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 36 out of 37 matched                                       
• adm2 (level2): 698 out of 774 matched                                     

All country names in nga_dhs matched those in nga_shp. At the adm1 level, 36 of 37 states matched. At the adm2 level, 698 of 774 districts matched.

Step 4: Start matching

Step 4.1: Use the shapefile as the reference

We will harmonise the admin names in the target table (nga_dhs) to the official shapefile (nga_shp) with sntutils::prep_geonames(). The function is hierarchy-aware and proceeds top–down (adm0adm1adm2). At each level it applies robust text preprocessing (uppercasing, accent removal, whitespace normalisation, punctuation handling) and uses string-distance methods to suggest likely matches within the correct parent context (e.g., adm2 only compares within the same adm1).

Key benefits of prep_geonames():

  • Works across up to six levels (level0level5).
  • Restricts comparisons within parents (contextually correct candidates).
  • Caches decisions via cache_path so future runs automatically reuse prior corrections and can be shared across analysts for consistency.
# path to cache
cache_path <- here::here("01_data/cache/geoname_cleaning_nga.rds")
unmatched_path <- here::here("01_data/cache/nga_unmatched_adm2.rds")

# 2) Run interactive, hierarchy-based matching
nga_dhs_cleaned <- sntutils::prep_geonames(
  target_df = nga_dhs,   # data to be harmonised
  lookup_df = nga_shp,   # authoritative reference
  level0 = "adm0",
  level1 = "adm1",
  level2 = "adm2",
  cache_path = cache_path,
  interactive = TRUE
)

Target data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 37 out of 37 matched                                       
• adm2 (level2): 774 out of 774 matched                                     
Lookup data as base N                                                       
• adm0 (level0): 1 out of 1 matched                                         
• adm1 (level1): 37 out of 37 matched                                       
• adm2 (level2): 774 out of 774 matched                                     

You can set interactive = FALSE to run in a non‑interactive mode.This ensures headless runs proceed without interactive prompts. When a cache is available, decisions are applied automatically; when it is absent, the code leaves names unchanged and continues.

Below is a short video demonstrating the prep_geonames() interface and how it can be used to clean administrative names interactively.