Country Names¶

Introduction¶

The function clean_country() cleans a column containing country names and/or ISO 3166 country codes, and standardizes them in a desired format. The function validate_country() validates either a single country or a column of countries, returning True if the value is valid, and False otherwise. The countries/regions supported and the regular expressions used can be found on GitHub.

Countries can be converted to and from the following formats via the input_format and output_format parameters:

Short country name (name): “United States”
Official state name (official): “United States of America”
ISO 3166-1 alpha-2 (alpha-2): “US”
ISO 3166-1 alpha-3 (alpha-3): “USA”
ISO 3166-1 numeric (numeric): “840”

input_format can be set to “auto” which automatically infers the input format. A tuple of input formats may also be used to indicate that the input may be any of the given input formats.

The strict parameter allows for control over the type of matching used for the “name” and “official” input formats.

False (default for clean_country()), search the input for a regex match
True (default for validate_country()), look for a direct match with a country value in the same format

The fuzzy_dist parameter sets the maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) allowed between the input and a country regex.

0 (default), countries at most 0 edits from matching a regex are successfully cleaned
1, countries at most 1 edit from matching a regex are successfully cleaned
n, countries at most n edits from matching a regex are successfully cleaned

Invalid parsing is handled with the errors parameter:

“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_country() and validate_country().

An example dataset with country values¶

[1]:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    "country": [
        "Canada", "foo canada bar", "cnada", "northern ireland", " ireland ",
        "congo, kinshasa", "congo, brazzaville", 304, "233", " tr ", "ARG",
        "hello", np.nan, "NULL"
    ]
})
df

[1]:

	country
0	Canada
1	foo canada bar
2	cnada
3	northern ireland
4	ireland
5	congo, kinshasa
6	congo, brazzaville
7	304
8	233
9	tr
10	ARG
11	hello
12	NaN
13	NULL

1. Default `clean_country()`¶

By default, the input_format parameter is set to “auto” (automatically determines the input format), the output_format parameter is set to “name”. The fuzzy_dist parameter is set to 0 and strict is False. The errors parameter is set to “coerce” (set NaN when parsing is invalid).

[2]:

from dataprep.clean import clean_country
clean_country(df, "country")

Country Cleaning Report:
        8 values cleaned (57.14%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)

[2]:

	country	country_clean
0	Canada	Canada
1	foo canada bar	Canada
2	cnada	NaN
3	northern ireland	NaN
4	ireland	Ireland
5	congo, kinshasa	DR Congo
6	congo, brazzaville	Congo Republic
7	304	Greenland
8	233	Estonia
9	tr	Turkey
10	ARG	Argentina
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

Note “Canada” is considered not cleaned in the report since it’s cleaned value is the same as the input. Also, “northern ireland” is invalid because it is part of the United Kingdom. Kinshasa and Brazzaville are the capital cities of their respective countries.

2. Input formats¶

This section demonstrates the supported country input formats.

name¶

If the input contains a match with one of the country regexes then it is successfully converted.

[3]:

clean_country(df, "country", input_format="name")

Country Cleaning Report:
        4 values cleaned (28.57%)
        7 values unable to be parsed (50.0%), set to NaN
Result contains 5 (35.71%) values in the correct format and 9 null values (64.29%)

[3]:

	country	country_clean
0	Canada	Canada
1	foo canada bar	Canada
2	cnada	NaN
3	northern ireland	NaN
4	ireland	Ireland
5	congo, kinshasa	DR Congo
6	congo, brazzaville	Congo Republic
7	304	NaN
8	233	NaN
9	tr	NaN
10	ARG	NaN
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

official¶

Does the same thing as input_format="name".

[4]:

clean_country(df, "country", input_format="official")

Country Cleaning Report:
        4 values cleaned (28.57%)
        7 values unable to be parsed (50.0%), set to NaN
Result contains 5 (35.71%) values in the correct format and 9 null values (64.29%)

[4]:

	country	country_clean
0	Canada	Canada
1	foo canada bar	Canada
2	cnada	NaN
3	northern ireland	NaN
4	ireland	Ireland
5	congo, kinshasa	DR Congo
6	congo, brazzaville	Congo Republic
7	304	NaN
8	233	NaN
9	tr	NaN
10	ARG	NaN
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

alpha-2¶

Looks for a direct match with a ISO 3166-1 alpha-2 country code, case insensitive and ignoring leading and trailing whitespace.

[5]:

clean_country(df, "country", input_format="alpha-2")

Country Cleaning Report:
        1 values cleaned (7.14%)
        11 values unable to be parsed (78.57%), set to NaN
Result contains 1 (7.14%) values in the correct format and 13 null values (92.86%)

[5]:

	country	country_clean
0	Canada	NaN
1	foo canada bar	NaN
2	cnada	NaN
3	northern ireland	NaN
4	ireland	NaN
5	congo, kinshasa	NaN
6	congo, brazzaville	NaN
7	304	NaN
8	233	NaN
9	tr	Turkey
10	ARG	NaN
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

alpha-3¶

Looks for a direct match with a ISO 3166-1 alpha-3 country code, case insensitive and ignoring leading and trailing whitespace.

[6]:

clean_country(df, "country", input_format="alpha-3")

Country Cleaning Report:
        1 values cleaned (7.14%)
        11 values unable to be parsed (78.57%), set to NaN
Result contains 1 (7.14%) values in the correct format and 13 null values (92.86%)

[6]:

	country	country_clean
0	Canada	NaN
1	foo canada bar	NaN
2	cnada	NaN
3	northern ireland	NaN
4	ireland	NaN
5	congo, kinshasa	NaN
6	congo, brazzaville	NaN
7	304	NaN
8	233	NaN
9	tr	NaN
10	ARG	Argentina
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

numeric¶

Looks for a direct match with a ISO 3166-1 numeric country code, case insensitive and ignoring leading and trailing whitespace. Works on integers and strings.

[7]:

clean_country(df, "country", input_format="numeric")

Country Cleaning Report:
        2 values cleaned (14.29%)
        10 values unable to be parsed (71.43%), set to NaN
Result contains 2 (14.29%) values in the correct format and 12 null values (85.71%)

[7]:

	country	country_clean
0	Canada	NaN
1	foo canada bar	NaN
2	cnada	NaN
3	northern ireland	NaN
4	ireland	NaN
5	congo, kinshasa	NaN
6	congo, brazzaville	NaN
7	304	Greenland
8	233	Estonia
9	tr	NaN
10	ARG	NaN
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

(name, alpha-2)¶

A tuple containing any combination of input formats may be used to clean any of the given input formats.

[8]:

clean_country(df, "country", input_format=("name", "alpha-2"))

Country Cleaning Report:
        5 values cleaned (35.71%)
        6 values unable to be parsed (42.86%), set to NaN
Result contains 6 (42.86%) values in the correct format and 8 null values (57.14%)

[8]:

	country	country_clean
0	Canada	Canada
1	foo canada bar	Canada
2	cnada	NaN
3	northern ireland	NaN
4	ireland	Ireland
5	congo, kinshasa	DR Congo
6	congo, brazzaville	Congo Republic
7	304	NaN
8	233	NaN
9	tr	Turkey
10	ARG	NaN
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

3. Output formats¶

This section demonstrates the supported output country formats.

official¶

[9]:

clean_country(df, "country", output_format="official")

Country Cleaning Report:
        8 values cleaned (57.14%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)

[9]:

	country	country_clean
0	Canada	Canada
1	foo canada bar	Canada
2	cnada	NaN
3	northern ireland	NaN
4	ireland	Ireland
5	congo, kinshasa	Democratic Republic of the Congo
6	congo, brazzaville	Republic of the Congo
7	304	Greenland
8	233	Republic of Estonia
9	tr	Republic of Turkey
10	ARG	Argentine Republic
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

alpha-2¶

[10]:

clean_country(df, "country", output_format="alpha-2")

Country Cleaning Report:
        9 values cleaned (64.29%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)

[10]:

	country	country_clean
0	Canada	CA
1	foo canada bar	CA
2	cnada	NaN
3	northern ireland	NaN
4	ireland	IE
5	congo, kinshasa	CD
6	congo, brazzaville	CG
7	304	GL
8	233	EE
9	tr	TR
10	ARG	AR
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

alpha-3¶

[11]:

clean_country(df, "country", output_format="alpha-3")

Country Cleaning Report:
        8 values cleaned (57.14%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)

[11]:

	country	country_clean
0	Canada	CAN
1	foo canada bar	CAN
2	cnada	NaN
3	northern ireland	NaN
4	ireland	IRL
5	congo, kinshasa	COD
6	congo, brazzaville	COG
7	304	GRL
8	233	EST
9	tr	TUR
10	ARG	ARG
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

numeric¶

[12]:

clean_country(df, "country", output_format="numeric")

Country Cleaning Report:
        8 values cleaned (57.14%)
        3 values unable to be parsed (21.43%), set to NaN
Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)

[12]:

	country	country_clean
0	Canada	124
1	foo canada bar	124
2	cnada	NaN
3	northern ireland	NaN
4	ireland	372
5	congo, kinshasa	180
6	congo, brazzaville	178
7	304	304
8	233	233
9	tr	792
10	ARG	32
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

Any combination of input and output formats may be used.¶

[13]:

clean_country(df, "country", input_format="alpha-2", output_format="official")

Country Cleaning Report:
        1 values cleaned (7.14%)
        11 values unable to be parsed (78.57%), set to NaN
Result contains 1 (7.14%) values in the correct format and 13 null values (92.86%)

[13]:

	country	country_clean
0	Canada	NaN
1	foo canada bar	NaN
2	cnada	NaN
3	northern ireland	NaN
4	ireland	NaN
5	congo, kinshasa	NaN
6	congo, brazzaville	NaN
7	304	NaN
8	233	NaN
9	tr	Republic of Turkey
10	ARG	NaN
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

4. `strict` parameter¶

This parameter allows for control over the type of matching used for “name” and “official” input formats. When False, the input is searched for a regex match. When True, matching is done by looking for a direct match with a country in the same format.

[14]:

clean_country(df, "country", strict=True)

Country Cleaning Report:
        5 values cleaned (35.71%)
        6 values unable to be parsed (42.86%), set to NaN
Result contains 6 (42.86%) values in the correct format and 8 null values (57.14%)

[14]:

	country	country_clean
0	Canada	Canada
1	foo canada bar	NaN
2	cnada	NaN
3	northern ireland	NaN
4	ireland	Ireland
5	congo, kinshasa	NaN
6	congo, brazzaville	NaN
7	304	Greenland
8	233	Estonia
9	tr	Turkey
10	ARG	Argentina
11	hello	NaN
12	NaN	NaN
13	NULL	NaN

“foo canada bar”, “congo kinshasa” and “congo brazzaville” are now invalid because they are not a direct match with a country in the “name” or “official” formats.

5. Fuzzy Matching¶

The fuzzy_dist parameter sets the maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) allowed between the input and a country regex. If an input is successfully cleaned by clean_country() with fuzzy_dist=0 then that input with one character inserted, deleted or substituted will match with fuzzy_dist=1. This parameter only applies to the “name” and “official” input formats.

`fuzzy_dist=1`¶

Countries at most one edit away from matching a regex are successfully cleaned.

[15]:

df = pd.DataFrame({
    "country": [
        "canada", "cnada", "australa", "xntarctica", "koreea", "cxnda",
        "afghnitan", "country: cnada", "foo indnesia bar"
    ]
})
clean_country(df, "country", fuzzy_dist=1)

Country Cleaning Report:
        7 values cleaned (77.78%)
        2 values unable to be parsed (22.22%), set to NaN
Result contains 7 (77.78%) values in the correct format and 2 null values (22.22%)

[15]:

	country	country_clean
0	canada	Canada
1	cnada	Canada
2	australa	Australia
3	xntarctica	Antarctica
4	koreea	South Korea
5	cxnda	NaN
6	afghnitan	NaN
7	country: cnada	Canada
8	foo indnesia bar	Indonesia

`fuzzy_dist=2`¶

Countries at most two edits away from matching a regex are successfully cleaned.

[16]:

clean_country(df, "country", fuzzy_dist=2)

Country Cleaning Report:
        9 values cleaned (100.0%)
Result contains 9 (100.0%) values in the correct format and 0 null values (0.0%)

[16]:

	country	country_clean
0	canada	Canada
1	cnada	Canada
2	australa	Australia
3	xntarctica	Antarctica
4	koreea	South Korea
5	cxnda	Canada
6	afghnitan	Afghanistan
7	country: cnada	Canada
8	foo indnesia bar	Indonesia

6. `inplace` parameter¶

This just deletes the given column from the returned dataframe. A new column containing cleaned coordinates is added with a title in the format "{original title}_clean".

[17]:

clean_country(df, "country", fuzzy_dist=2, inplace=True)

Country Cleaning Report:
        9 values cleaned (100.0%)
Result contains 9 (100.0%) values in the correct format and 0 null values (0.0%)

[17]:

	country_clean
0	Canada
1	Canada
2	Australia
3	Antarctica
4	South Korea
5	Canada
6	Afghanistan
7	Canada
8	Indonesia

7. `validate_country()`¶

validate_country() returns True when the input is a valid country value otherwise it returns False. Valid types are the same as clean_country(). By default strict=True, as opposed to clean_country() which has strict set to False by default. The default input_type is “auto”.

[18]:

from dataprep.clean import validate_country

print(validate_country("switzerland"))
print(validate_country("country = united states"))
print(validate_country("country = united states", strict=False))
print(validate_country("ca"))
print(validate_country(800))

True
False
True
True
True

`validate_country()` on a pandas series¶

Since strict=True by default, the inputs “foo canada bar”, “congo, kinshasa” and “congo, brazzaville” are invalid since they don’t directly match a country in the “name” or “official” formats.

[19]:

df = pd.DataFrame({
    "country": [
        "Canada", "foo canada bar", "cnada", "northern ireland", " ireland ",
        "congo, kinshasa", "congo, brazzaville", 304, "233", " tr ", "ARG",
        "hello", np.nan, "NULL"
    ]
})

df["valid"] = validate_country(df["country"])
df

[19]:

	country	valid
0	Canada	True
1	foo canada bar	False
2	cnada	False
3	northern ireland	False
4	ireland	True
5	congo, kinshasa	False
6	congo, brazzaville	False
7	304	True
8	233	True
9	tr	True
10	ARG	True
11	hello	False
12	NaN	False
13	NULL	False

`strict=False`¶

For “name” and “official” input types the input is searched for a regex match.

[20]:

df["valid"] = validate_country(df["country"], strict=False)
df

[20]:

	country	valid
0	Canada	True
1	foo canada bar	True
2	cnada	False
3	northern ireland	False
4	ireland	True
5	congo, kinshasa	True
6	congo, brazzaville	True
7	304	True
8	233	True
9	tr	True
10	ARG	True
11	hello	False
12	NaN	False
13	NULL	False

Specifying `input_format`¶

[21]:

df["valid"] = validate_country(df["country"], input_format="numeric")
df

[21]:

	country	valid
0	Canada	False
1	foo canada bar	False
2	cnada	False
3	northern ireland	False
4	ireland	False
5	congo, kinshasa	False
6	congo, brazzaville	False
7	304	True
8	233	True
9	tr	False
10	ARG	False
11	hello	False
12	NaN	False
13	NULL	False

Credit¶

The country data and regular expressions used are based on the country_converter project.

Country Names¶

Introduction¶

An example dataset with country values¶

1. Default clean_country()¶

2. Input formats¶

name¶

official¶

alpha-2¶

alpha-3¶

numeric¶

(name, alpha-2)¶

3. Output formats¶

official¶

alpha-2¶

alpha-3¶

numeric¶

Any combination of input and output formats may be used.¶

4. strict parameter¶

5. Fuzzy Matching¶

fuzzy_dist=1¶

fuzzy_dist=2¶

6. inplace parameter¶

7. validate_country()¶

validate_country() on a pandas series¶

strict=False¶

Specifying input_format¶

Credit¶

1. Default `clean_country()`¶

4. `strict` parameter¶

`fuzzy_dist=1`¶

`fuzzy_dist=2`¶

6. `inplace` parameter¶

7. `validate_country()`¶

`validate_country()` on a pandas series¶

`strict=False`¶

Specifying `input_format`¶