URLs

Introduction

The function clean_url() cleans a DataFrame column containing urls, and extracts the important parameters including cleaned path, queries, scheme, etc. The function validate_url() validates either a single url or a column of urls, returning True if the value is valid, and False otherwise.

clean_url() extracts the important features of the url and creates an additional column containing key value pairs of the parameters. It extracts the following features:

  • scheme (string)

  • host (string)

  • cleaned path (string)

  • queries (key-value pairs)

Remove authentication tokens: Sometimes we would like to remove certain sensitive information which is usually contained in a url for e.g. access_tokens, user information, etc. clean_url() provides us with an option to remove this information with the remove_auth parameter. The usage of all parameters is explained in depth in the sections below.

Invalid parsing is handled with the errors parameter:

  • “coerce” (default): invalid parsing will be set to NaN

  • “ignore”: invalid parsing will return the input

  • “raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

  • How many values were cleaned (the value must have been transformed).

  • How many values could not be parsed.

  • A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_url() and validate_url().

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
    "url": [
        "random text which is not a url",
        "http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&not_token=hiThere&another_token=12323423",
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1234&loc=van",
        "notaurl",
        np.nan,
        None,
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken2&studentid=1230&loc=bur",
        "",
        {
            "not_a_url": True
        },
        "2345678",
        345345345,
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur",
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1232&loc=van",
    ]
})
df
[1]:
url
0 random text which is not a url
1 http://www.facebookee.com/otherpath?auth=faceb...
2 https://www.sfu.ca/ficticiouspath?auth=samplet...
3 notaurl
4 NaN
5 None
6 https://www.sfu.ca/ficticiouspath?auth=samplet...
7
8 {'not_a_url': True}
9 2345678
10 345345345
11 https://www.sfu.ca/ficticiouspath?auth=samplet...
12 https://www.sfu.ca/ficticiouspath?auth=samplet...

1. default: clean_url()

By default, the parameteres are set as inplace = False, split = False, remove_auth = False, report = True,errors = coerce.

[2]:
from dataprep.clean import clean_url
df_default = clean_url(df, column="url")
df_default
URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
[2]:
url url_details
0 random text which is not a url NaN
1 http://www.facebookee.com/otherpath?auth=faceb... {'scheme': 'http', 'host': 'www.facebookee.com...
2 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3 notaurl NaN
4 NaN NaN
5 None NaN
6 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7 NaN
8 {'not_a_url': True} NaN
9 2345678 NaN
10 345345345 NaN
11 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...

We can see that in the new dataframe df_default a new column is created url_details, this follows the naming convention of orginal_column_name**_details** (url_details in our case).

Now let us see what one of the cells in url_details looks like.

[3]:
df_default["url_details"][1]
[3]:
{'scheme': 'http',
 'host': 'www.facebookee.com',
 'url_clean': 'http://www.facebookee.com/otherpath',
 'queries': {'auth': 'facebookeeauth',
  'token': 'iwusdkc',
  'not_token': 'hiThere',
  'another_token': '12323423'}}

2. remove_auth parameter

Sometimes we need to remove sensitive information when parsing a url, we can do this in the clean_url() function by specifying the remove_auth parameter to be True or we can can specify a list of parameters to removed. Hence remove_auth can be a boolean value or list of strings.

When remove_auth is set to the boolean value of True, clean_url() looks for auth tokens based on the default list of token names (provided below) and removes them. When remove_auth is set to list of strings it creates a union of the user provided list and default list to create a new set of token words to be removed.

[4]:
default_list = {
    "access_token",
    "auth_key",
    "auth",
    "password",
    "username",
    "login",
    "token",
    "passcode",
    "access-token",
    "auth-key",
    "authentication",
    "authentication-key",
}

Lets have a look at the same dataframe and the two scenerios described above (by looking at the second row).

a. remove_auth = True (boolean)

[5]:
df_remove_auth_boolean = clean_url(df, column="url", remove_auth=True)
df_remove_auth_boolean["url_details"][1]
URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Removed 6 auth queries from 5 rows
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
[5]:
{'scheme': 'http',
 'host': 'www.facebookee.com',
 'url_clean': 'http://www.facebookee.com/otherpath',
 'queries': {'not_token': 'hiThere', 'another_token': '12323423'}}

As we can see queries auth & token were removed from the result but not_token and another_token were included, this is because auth and token were specified in default_list. Also notice the additional line giving the stats on how many queries were removed from how many rows.

b. remove_auth = list of string

[6]:
df_remove_auth_list = clean_url(df, column="url", remove_auth=["another_token"])
df_remove_auth_list["url_details"][1]
URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Removed 7 auth queries from 5 rows
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
[6]:
{'scheme': 'http',
 'host': 'www.facebookee.com',
 'url_clean': 'http://www.facebookee.com/otherpath',
 'queries': {'not_token': 'hiThere'}}

As we can see queries auth, token and another_token were removed but not_token was included in the result, this is because a new list was created by creating a union of default_list and user defined list and queries were removed based on the new combined list

3. split parameter

The split parameter adds individual columns containing the containing all the extracted features to the given DataFrame.

[7]:
df_remove_split = clean_url(df, column="url", split=True)
df_remove_split
URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
[7]:
url scheme host url_clean queries
0 random text which is not a url NaN NaN NaN NaN
1 http://www.facebookee.com/otherpath?auth=faceb... http www.facebookee.com http://www.facebookee.com/otherpath {'auth': 'facebookeeauth', 'token': 'iwusdkc',...
2 https://www.sfu.ca/ficticiouspath?auth=samplet... https www.sfu.ca https://www.sfu.ca/ficticiouspath {'auth': 'sampletoken1', 'studentid': '1234', ...
3 notaurl NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 None NaN NaN NaN NaN
6 https://www.sfu.ca/ficticiouspath?auth=samplet... https www.sfu.ca https://www.sfu.ca/ficticiouspath {'auth': 'sampletoken2', 'studentid': '1230', ...
7 NaN NaN NaN NaN
8 {'not_a_url': True} NaN NaN NaN NaN
9 2345678 NaN NaN NaN NaN
10 345345345 NaN NaN NaN NaN
11 https://www.sfu.ca/ficticiouspath?auth=samplet... https www.sfu.ca https://www.sfu.ca/ficticiouspath {'auth': 'sampletoken3', 'studentid': '1231', ...
12 https://www.sfu.ca/ficticiouspath?auth=samplet... https www.sfu.ca https://www.sfu.ca/ficticiouspath {'auth': 'sampletoken1', 'studentid': '1232', ...

4. inplace parameter

Replaces the original column with orginal_column_name_details.

[8]:
df_remove_inplace = clean_url(df, column="url", inplace=True)
df_remove_inplace
URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
[8]:
url_details
0 NaN
1 {'scheme': 'http', 'host': 'www.facebookee.com...
2 {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3 NaN
4 NaN
5 NaN
6 {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7 NaN
8 NaN
9 NaN
10 NaN
11 {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12 {'scheme': 'https', 'host': 'www.sfu.ca', 'url...

5. split and inplace

Replaces the original column with other columns based on the split parameters.

[9]:
df_remove_inplace_split = clean_url(df, column="url", inplace=True, split=True)
df_remove_inplace_split
URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
[9]:
scheme host url_clean queries
0 NaN NaN NaN NaN
1 http www.facebookee.com http://www.facebookee.com/otherpath {'auth': 'facebookeeauth', 'token': 'iwusdkc',...
2 https www.sfu.ca https://www.sfu.ca/ficticiouspath {'auth': 'sampletoken1', 'studentid': '1234', ...
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 https www.sfu.ca https://www.sfu.ca/ficticiouspath {'auth': 'sampletoken2', 'studentid': '1230', ...
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 https www.sfu.ca https://www.sfu.ca/ficticiouspath {'auth': 'sampletoken3', 'studentid': '1231', ...
12 https www.sfu.ca https://www.sfu.ca/ficticiouspath {'auth': 'sampletoken1', 'studentid': '1232', ...

6. errors parameter

  • “coerce” (default), then invalid parsing will be set as NaN

  • “ignore”, then invalid parsing will return the input

  • “raise”, then invalid parsing will raise an exception

a. “coerce” (default)

This is the default value of the parameters, this sets the invalid parsing to NaN.

[10]:
df_remove_errors_default = clean_url(df, column="url")
df_remove_errors_default
URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
[10]:
url url_details
0 random text which is not a url NaN
1 http://www.facebookee.com/otherpath?auth=faceb... {'scheme': 'http', 'host': 'www.facebookee.com...
2 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3 notaurl NaN
4 NaN NaN
5 None NaN
6 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7 NaN
8 {'not_a_url': True} NaN
9 2345678 NaN
10 345345345 NaN
11 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...

b. “ignore”

This sets the value of invalid parsing as the input.

[11]:
df_remove_errors_ignore = clean_url(df, column="url", errors="ignore")
df_remove_errors_ignore
URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), left unchanged
Result contains 5 (38.46%) parsed key-value pairs and 3 null values (23.08%)
[11]:
url url_details
0 random text which is not a url random text which is not a url
1 http://www.facebookee.com/otherpath?auth=faceb... {'scheme': 'http', 'host': 'www.facebookee.com...
2 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3 notaurl notaurl
4 NaN NaN
5 None NaN
6 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7 NaN
8 {'not_a_url': True} {'not_a_url': True}
9 2345678 2345678
10 345345345 345345345
11 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...

c. “raise”

This will raise a value error when it encounters an invalid parsing value.

7. report parameter

By default it is set to True, when set to False it will not display the stats pertaining to the cleaned operations performed.

[12]:
df_remove_auth_boolean = clean_url(df, column="url", remove_auth=True, report=False)
df_remove_auth_boolean
[12]:
url url_details
0 random text which is not a url NaN
1 http://www.facebookee.com/otherpath?auth=faceb... {'scheme': 'http', 'host': 'www.facebookee.com...
2 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3 notaurl NaN
4 NaN NaN
5 None NaN
6 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7 NaN
8 {'not_a_url': True} NaN
9 2345678 NaN
10 345345345 NaN
11 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12 https://www.sfu.ca/ficticiouspath?auth=samplet... {'scheme': 'https', 'host': 'www.sfu.ca', 'url...

8. validate_url()

validate_url() returns True when the input is a valid url. Otherwise it returns False.

[13]:
from dataprep.clean import validate_url
print(validate_url({"not_a_url" : True}))
print(validate_url(2346789))
print(validate_url("https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur"))
print(validate_url("http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&nottoken=hiThere&another_token=12323423"))

False
False
True
True
[14]:
df = pd.DataFrame({
    "url": [
        "random text which is not a url",
        "http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&nottoken=hiThere&another_token=12323423",
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1234&loc=van",
        "notaurl",
        np.nan,
        None,
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken2&studentid=1230&loc=bur",
        "",
        {
            "not_a_url": True
        },
        "2345678",
        345345345,
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur",
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1232&loc=van",
    ]
})

df["validate_url"] = validate_url(df["url"])
df
[14]:
url validate_url
0 random text which is not a url False
1 http://www.facebookee.com/otherpath?auth=faceb... True
2 https://www.sfu.ca/ficticiouspath?auth=samplet... True
3 notaurl False
4 NaN False
5 None False
6 https://www.sfu.ca/ficticiouspath?auth=samplet... True
7 False
8 {'not_a_url': True} False
9 2345678 False
10 345345345 False
11 https://www.sfu.ca/ficticiouspath?auth=samplet... True
12 https://www.sfu.ca/ficticiouspath?auth=samplet... True