Country Names

Introduction

The function clean_country() cleans a column containing country names and/or ISO 3166 country codes, and standardizes them in a desired format. The function validate_country() validates either a single country or a column of countries, returning True if the value is valid, and False otherwise. The countries/regions supported and the regular expressions used can be found on GitHub.

Countries can be converted to and from the following formats via the input_format and output_format parameters:

  • Short country name (name): “United States”

  • Official state name (official): “United States of America”

  • ISO 3166-1 alpha-2 (alpha-2): “US”

  • ISO 3166-1 alpha-3 (alpha-3): “USA”

  • ISO 3166-1 numeric (numeric): “840”

input_format can be set to “auto” which automatically infers the input format. A tuple of input formats may also be used to indicate that the input may be any of the given input formats.

The strict parameter allows for control over the type of matching used for the “name” and “official” input formats.

  • False (default for clean_country()), search the input for a regex match

  • True (default for validate_country()), look for a direct match with a country value in the same format

The fuzzy_dist parameter sets the maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) allowed between the input and a country regex.

  • 0 (default), countries at most 0 edits from matching a regex are successfully cleaned

  • 1, countries at most 1 edit from matching a regex are successfully cleaned

  • n, countries at most n edits from matching a regex are successfully cleaned

Invalid parsing is handled with the errors parameter:

  • “coerce” (default): invalid parsing will be set to NaN

  • “ignore”: invalid parsing will return the input

  • “raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

  • How many values were cleaned (the value must have been transformed).

  • How many values could not be parsed.

  • A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_country() and validate_country().

An example dataset with country values

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
    "country": [
        "Canada", "foo canada bar", "cnada", "northern ireland", " ireland ",
        "congo, kinshasa", "congo, brazzaville", 304, "233", " tr ", "ARG",
        "hello", np.nan, "NULL"
    ]
})
df
[1]:
country
0 Canada
1 foo canada bar
2 cnada
3 northern ireland
4 ireland
5 congo, kinshasa
6 congo, brazzaville
7 304
8 233
9 tr
10 ARG
11 hello
12 NaN
13 NULL

1. Default clean_country()

By default, the input_format parameter is set to “auto” (automatically determines the input format), the output_format parameter is set to “name”. The fuzzy_dist parameter is set to 0 and strict is False. The errors parameter is set to “coerce” (set NaN when parsing is invalid).

[2]:
from dataprep.clean import clean_country
clean_country(df, "country")
Country Cleaning Report:
        8 values cleaned (57.14%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)
[2]:
country country_clean
0 Canada Canada
1 foo canada bar Canada
2 cnada NaN
3 northern ireland NaN
4 ireland Ireland
5 congo, kinshasa DR Congo
6 congo, brazzaville Congo Republic
7 304 Greenland
8 233 Estonia
9 tr Turkey
10 ARG Argentina
11 hello NaN
12 NaN NaN
13 NULL NaN

Note “Canada” is considered not cleaned in the report since it’s cleaned value is the same as the input. Also, “northern ireland” is invalid because it is part of the United Kingdom. Kinshasa and Brazzaville are the capital cities of their respective countries.

2. Input formats

This section demonstrates the supported country input formats.

name

If the input contains a match with one of the country regexes then it is successfully converted.

[3]:
clean_country(df, "country", input_format="name")
Country Cleaning Report:
        4 values cleaned (28.57%)
        7 values unable to be parsed (50.0%), set to NaN
Result contains 5 (35.71%) values in the correct format and 9 null values (64.29%)
[3]:
country country_clean
0 Canada Canada
1 foo canada bar Canada
2 cnada NaN
3 northern ireland NaN
4 ireland Ireland
5 congo, kinshasa DR Congo
6 congo, brazzaville Congo Republic
7 304 NaN
8 233 NaN
9 tr NaN
10 ARG NaN
11 hello NaN
12 NaN NaN
13 NULL NaN

official

Does the same thing as input_format="name".

[4]:
clean_country(df, "country", input_format="official")
Country Cleaning Report:
        4 values cleaned (28.57%)
        7 values unable to be parsed (50.0%), set to NaN
Result contains 5 (35.71%) values in the correct format and 9 null values (64.29%)
[4]:
country country_clean
0 Canada Canada
1 foo canada bar Canada
2 cnada NaN
3 northern ireland NaN
4 ireland Ireland
5 congo, kinshasa DR Congo
6 congo, brazzaville Congo Republic
7 304 NaN
8 233 NaN
9 tr NaN
10 ARG NaN
11 hello NaN
12 NaN NaN
13 NULL NaN

alpha-2

Looks for a direct match with a ISO 3166-1 alpha-2 country code, case insensitive and ignoring leading and trailing whitespace.

[5]:
clean_country(df, "country", input_format="alpha-2")
Country Cleaning Report:
        1 values cleaned (7.14%)
        11 values unable to be parsed (78.57%), set to NaN
Result contains 1 (7.14%) values in the correct format and 13 null values (92.86%)
[5]:
country country_clean
0 Canada NaN
1 foo canada bar NaN
2 cnada NaN
3 northern ireland NaN
4 ireland NaN
5 congo, kinshasa NaN
6 congo, brazzaville NaN
7 304 NaN
8 233 NaN
9 tr Turkey
10 ARG NaN
11 hello NaN
12 NaN NaN
13 NULL NaN

alpha-3

Looks for a direct match with a ISO 3166-1 alpha-3 country code, case insensitive and ignoring leading and trailing whitespace.

[6]:
clean_country(df, "country", input_format="alpha-3")
Country Cleaning Report:
        1 values cleaned (7.14%)
        11 values unable to be parsed (78.57%), set to NaN
Result contains 1 (7.14%) values in the correct format and 13 null values (92.86%)
[6]:
country country_clean
0 Canada NaN
1 foo canada bar NaN
2 cnada NaN
3 northern ireland NaN
4 ireland NaN
5 congo, kinshasa NaN
6 congo, brazzaville NaN
7 304 NaN
8 233 NaN
9 tr NaN
10 ARG Argentina
11 hello NaN
12 NaN NaN
13 NULL NaN

numeric

Looks for a direct match with a ISO 3166-1 numeric country code, case insensitive and ignoring leading and trailing whitespace. Works on integers and strings.

[7]:
clean_country(df, "country", input_format="numeric")
Country Cleaning Report:
        2 values cleaned (14.29%)
        10 values unable to be parsed (71.43%), set to NaN
Result contains 2 (14.29%) values in the correct format and 12 null values (85.71%)
[7]:
country country_clean
0 Canada NaN
1 foo canada bar NaN
2 cnada NaN
3 northern ireland NaN
4 ireland NaN
5 congo, kinshasa NaN
6 congo, brazzaville NaN
7 304 Greenland
8 233 Estonia
9 tr NaN
10 ARG NaN
11 hello NaN
12 NaN NaN
13 NULL NaN

(name, alpha-2)

A tuple containing any combination of input formats may be used to clean any of the given input formats.

[8]:
clean_country(df, "country", input_format=("name", "alpha-2"))
Country Cleaning Report:
        5 values cleaned (35.71%)
        6 values unable to be parsed (42.86%), set to NaN
Result contains 6 (42.86%) values in the correct format and 8 null values (57.14%)
[8]:
country country_clean
0 Canada Canada
1 foo canada bar Canada
2 cnada NaN
3 northern ireland NaN
4 ireland Ireland
5 congo, kinshasa DR Congo
6 congo, brazzaville Congo Republic
7 304 NaN
8 233 NaN
9 tr Turkey
10 ARG NaN
11 hello NaN
12 NaN NaN
13 NULL NaN

3. Output formats

This section demonstrates the supported output country formats.

official

[9]:
clean_country(df, "country", output_format="official")
Country Cleaning Report:
        8 values cleaned (57.14%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)
[9]:
country country_clean
0 Canada Canada
1 foo canada bar Canada
2 cnada NaN
3 northern ireland NaN
4 ireland Ireland
5 congo, kinshasa Democratic Republic of the Congo
6 congo, brazzaville Republic of the Congo
7 304 Greenland
8 233 Republic of Estonia
9 tr Republic of Turkey
10 ARG Argentine Republic
11 hello NaN
12 NaN NaN
13 NULL NaN

alpha-2

[10]:
clean_country(df, "country", output_format="alpha-2")
Country Cleaning Report:
        9 values cleaned (64.29%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)
[10]:
country country_clean
0 Canada CA
1 foo canada bar CA
2 cnada NaN
3 northern ireland NaN
4 ireland IE
5 congo, kinshasa CD
6 congo, brazzaville CG
7 304 GL
8 233 EE
9 tr TR
10 ARG AR
11 hello NaN
12 NaN NaN
13 NULL NaN

alpha-3

[11]:
clean_country(df, "country", output_format="alpha-3")
Country Cleaning Report:
        8 values cleaned (57.14%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)
[11]:
country country_clean
0 Canada CAN
1 foo canada bar CAN
2 cnada NaN
3 northern ireland NaN
4 ireland IRL
5 congo, kinshasa COD
6 congo, brazzaville COG
7 304 GRL
8 233 EST
9 tr TUR
10 ARG ARG
11 hello NaN
12 NaN NaN
13 NULL NaN

numeric

[12]:
clean_country(df, "country", output_format="numeric")
Country Cleaning Report:
        8 values cleaned (57.14%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)
[12]:
country country_clean
0 Canada 124
1 foo canada bar 124
2 cnada NaN
3 northern ireland NaN
4 ireland 372
5 congo, kinshasa 180
6 congo, brazzaville 178
7 304 304
8 233 233
9 tr 792
10 ARG 32
11 hello NaN
12 NaN NaN
13 NULL NaN

Any combination of input and output formats may be used.

[13]:
clean_country(df, "country", input_format="alpha-2", output_format="official")
Country Cleaning Report:
        1 values cleaned (7.14%)
        11 values unable to be parsed (78.57%), set to NaN
Result contains 1 (7.14%) values in the correct format and 13 null values (92.86%)
[13]:
country country_clean
0 Canada NaN
1 foo canada bar NaN
2 cnada NaN
3 northern ireland NaN
4 ireland NaN
5 congo, kinshasa NaN
6 congo, brazzaville NaN
7 304 NaN
8 233 NaN
9 tr Republic of Turkey
10 ARG NaN
11 hello NaN
12 NaN NaN
13 NULL NaN

4. strict parameter

This parameter allows for control over the type of matching used for “name” and “official” input formats. When False, the input is searched for a regex match. When True, matching is done by looking for a direct match with a country in the same format.

[14]:
clean_country(df, "country", strict=True)
Country Cleaning Report:
        5 values cleaned (35.71%)
        6 values unable to be parsed (42.86%), set to NaN
Result contains 6 (42.86%) values in the correct format and 8 null values (57.14%)
[14]:
country country_clean
0 Canada Canada
1 foo canada bar NaN
2 cnada NaN
3 northern ireland NaN
4 ireland Ireland
5 congo, kinshasa NaN
6 congo, brazzaville NaN
7 304 Greenland
8 233 Estonia
9 tr Turkey
10 ARG Argentina
11 hello NaN
12 NaN NaN
13 NULL NaN

“foo canada bar”, “congo kinshasa” and “congo brazzaville” are now invalid because they are not a direct match with a country in the “name” or “official” formats.

5. Fuzzy Matching

The fuzzy_dist parameter sets the maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) allowed between the input and a country regex. If an input is successfully cleaned by clean_country() with fuzzy_dist=0 then that input with one character inserted, deleted or substituted will match with fuzzy_dist=1. This parameter only applies to the “name” and “official” input formats.

fuzzy_dist=1

Countries at most one edit away from matching a regex are successfully cleaned.

[15]:
df = pd.DataFrame({
    "country": [
        "canada", "cnada", "australa", "xntarctica", "koreea", "cxnda",
        "afghnitan", "country: cnada", "foo indnesia bar"
    ]
})
clean_country(df, "country", fuzzy_dist=1)
Country Cleaning Report:
        7 values cleaned (77.78%)
        2 values unable to be parsed (22.22%), set to NaN
Result contains 7 (77.78%) values in the correct format and 2 null values (22.22%)
[15]:
country country_clean
0 canada Canada
1 cnada Canada
2 australa Australia
3 xntarctica Antarctica
4 koreea South Korea
5 cxnda NaN
6 afghnitan NaN
7 country: cnada Canada
8 foo indnesia bar Indonesia

fuzzy_dist=2

Countries at most two edits away from matching a regex are successfully cleaned.

[16]:
clean_country(df, "country", fuzzy_dist=2)
Country Cleaning Report:
        9 values cleaned (100.0%)
Result contains 9 (100.0%) values in the correct format and 0 null values (0.0%)
[16]:
country country_clean
0 canada Canada
1 cnada Canada
2 australa Australia
3 xntarctica Antarctica
4 koreea South Korea
5 cxnda Canada
6 afghnitan Afghanistan
7 country: cnada Canada
8 foo indnesia bar Indonesia

6. inplace parameter

This just deletes the given column from the returned dataframe. A new column containing cleaned coordinates is added with a title in the format "{original title}_clean".

[17]:
clean_country(df, "country", fuzzy_dist=2, inplace=True)
Country Cleaning Report:
        9 values cleaned (100.0%)
Result contains 9 (100.0%) values in the correct format and 0 null values (0.0%)
[17]:
country_clean
0 Canada
1 Canada
2 Australia
3 Antarctica
4 South Korea
5 Canada
6 Afghanistan
7 Canada
8 Indonesia

7. validate_country()

validate_country() returns True when the input is a valid country value otherwise it returns False. Valid types are the same as clean_country(). By default strict=True, as opposed to clean_country() which has strict set to False by default. The default input_type is “auto”.

[18]:
from dataprep.clean import validate_country

print(validate_country("switzerland"))
print(validate_country("country = united states"))
print(validate_country("country = united states", strict=False))
print(validate_country("ca"))
print(validate_country(800))
True
False
True
True
True

validate_country() on a pandas series

Since strict=True by default, the inputs “foo canada bar”, “congo, kinshasa” and “congo, brazzaville” are invalid since they don’t directly match a country in the “name” or “official” formats.

[19]:
df = pd.DataFrame({
    "country": [
        "Canada", "foo canada bar", "cnada", "northern ireland", " ireland ",
        "congo, kinshasa", "congo, brazzaville", 304, "233", " tr ", "ARG",
        "hello", np.nan, "NULL"
    ]
})

df["valid"] = validate_country(df["country"])
df
[19]:
country valid
0 Canada True
1 foo canada bar False
2 cnada False
3 northern ireland False
4 ireland True
5 congo, kinshasa False
6 congo, brazzaville False
7 304 True
8 233 True
9 tr True
10 ARG True
11 hello False
12 NaN False
13 NULL False

strict=False

For “name” and “official” input types the input is searched for a regex match.

[20]:
df["valid"] = validate_country(df["country"], strict=False)
df
[20]:
country valid
0 Canada True
1 foo canada bar True
2 cnada False
3 northern ireland False
4 ireland True
5 congo, kinshasa True
6 congo, brazzaville True
7 304 True
8 233 True
9 tr True
10 ARG True
11 hello False
12 NaN False
13 NULL False

Specifying input_format

[21]:
df["valid"] = validate_country(df["country"], input_format="numeric")
df
[21]:
country valid
0 Canada False
1 foo canada bar False
2 cnada False
3 northern ireland False
4 ireland False
5 congo, kinshasa False
6 congo, brazzaville False
7 304 True
8 233 True
9 tr False
10 ARG False
11 hello False
12 NaN False
13 NULL False

Credit

The country data and regular expressions used are based on the country_converter project.