Overview

Dataset statistics

Number of variables1
Number of observations59079
Missing cells0
Missing cells (%)0.0%
Duplicate rows31785
Duplicate rows (%)53.8%
Total size in memory5.2 MiB
Average record size in memory91.7 B

Variable types

CAT1

Reproduction

Analysis started2020-06-04 22:28:50.050734
Analysis finished2020-06-04 22:28:51.972652
Duration1.92 second
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Dataset has 31785 (53.8%) duplicate rows Duplicates
has a high cardinality: 27294 distinct values High cardinality

Variables


Categorical

HIGH CARDINALITY

Distinct count27294
Unique (%)46.2%
Missing0
Missing (%)0.0%
Memory size461.7 KiB
<td>0</td>
 
4631
<td>2020-06-03 02:33:13</td>
 
3641
</tr>
 
3641
<td>US</td>
 
3036
<td></td>
 
1668
Other values (27289)
42462
ValueCountFrequency (%) 
<td>0</td>46317.8%
 
<td>2020-06-03 02:33:13</td>36416.2%
 
</tr>36416.2%
 
<td>US</td>30365.1%
 
<td></td>16682.8%
 
<td>0.0</td>13292.2%
 
<td>1</td>7341.2%
 
<td>2</td>4270.7%
 
<td>3</td>3430.6%
 
<td>4</td>2570.4%
 
<td>Texas</td>2350.4%
 
<td>5</td>2040.3%
 
<td>7</td>1850.3%
 
<td>6</td>1820.3%
 
<td>Georgia</td>1630.3%
 
<td>8</td>1530.3%
 
<td>9</td>1440.2%
 
<td>Virginia</td>1330.2%
 
<td>12</td>1260.2%
 
<td>Kentucky</td>1200.2%
 
<td>11</td>1180.2%
 
<td>10</td>1160.2%
 
<td>13</td>1030.2%
 
<td>Missouri</td>1030.2%
 
<td>Illinois</td>1030.2%
 
Other values (27269)3718462.9%
 

Length

Max length765
Median length29
Mean length34.66646355
Min length1

Overview of Unicode Properties

Unique unicode characters89
Unique unicode categories (?)13
Unique unicode scripts (?)2
Unique unicode blocks (?)4
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Most occurring characters

ValueCountFrequency (%) 
81559039.8%
 
t1179545.8%
 
d1155935.6%
 
<1104105.4%
 
>1104065.4%
 
/556092.7%
 
0483832.4%
 
3423752.1%
 
2391591.9%
 
-389051.9%
 
"385181.9%
 
1365301.8%
 
l296961.4%
 
e289091.4%
 
i286761.4%
 
n283441.4%
 
s277771.4%
 
6251571.2%
 
a240291.2%
 
4238391.2%
 
5230501.1%
 
7222921.1%
 
9219831.1%
 
8217841.1%
 
r193980.9%
 
Other values (64)1536947.5%
 

Most occurring categories

ValueCountFrequency (%) 
Space Separator81559039.8%
 
Lowercase Letter50396324.6%
 
Decimal Number30455214.9%
 
Math Symbol24013411.7%
 
Other Punctuation1172945.7%
 
Dash Punctuation389051.9%
 
Uppercase Letter274071.3%
 
Connector Punctuation188< 0.1%
 
Open Punctuation9< 0.1%
 
Close Punctuation9< 0.1%
 
Other Symbol6< 0.1%
 
Modifier Symbol2< 0.1%
 
Final Punctuation1< 0.1%
 

Most frequent Math Symbol characters

ValueCountFrequency (%) 
<11041046.0%
 
>11040646.0%
 
=192978.0%
 
+21< 0.1%
 

Most frequent Lowercase Letter characters

ValueCountFrequency (%) 
t11795423.4%
 
d11559322.9%
 
l296965.9%
 
e289095.7%
 
i286765.7%
 
n283445.6%
 
s277775.5%
 
a240294.8%
 
r193983.8%
 
b156843.1%
 
u130052.6%
 
m126142.5%
 
c97461.9%
 
o97441.9%
 
j76161.5%
 
f44740.9%
 
h22440.4%
 
g17400.3%
 
p16080.3%
 
k12260.2%
 
y10540.2%
 
v9950.2%
 
w7620.2%
 
x7560.2%
 
z2550.1%
 

Most frequent Space Separator characters

ValueCountFrequency (%) 
815590100.0%
 

Most frequent Other Punctuation characters

ValueCountFrequency (%) 
/5560947.4%
 
"3851832.8%
 
.1570113.4%
 
:73516.3%
 
;30< 0.1%
 
&25< 0.1%
 
%25< 0.1%
 
#13< 0.1%
 
?6< 0.1%
 
!5< 0.1%
 
*4< 0.1%
 
·2< 0.1%
 
'2< 0.1%
 
…2< 0.1%
 
@1< 0.1%
 

Most frequent Dash Punctuation characters

ValueCountFrequency (%) 
-38905100.0%
 

Most frequent Decimal Number characters

ValueCountFrequency (%) 
04838315.9%
 
34237513.9%
 
23915912.9%
 
13653012.0%
 
6251578.3%
 
4238397.8%
 
5230507.6%
 
7222927.3%
 
9219837.2%
 
8217847.2%
 

Most frequent Uppercase Letter characters

ValueCountFrequency (%) 
L770028.1%
 
C469217.1%
 
S379313.8%
 
U321211.7%
 
M9803.6%
 
I6172.3%
 
N5261.9%
 
T5111.9%
 
G4941.8%
 
O4631.7%
 
W4331.6%
 
A4281.6%
 
D4271.6%
 
V4261.6%
 
B4241.5%
 
K3761.4%
 
H3721.4%
 
P3651.3%
 
R2891.1%
 
F2600.9%
 
J2170.8%
 
E1800.7%
 
Y1160.4%
 
Z460.2%
 
Q320.1%
 

Most frequent Connector Punctuation characters

ValueCountFrequency (%) 
_188100.0%
 

Most frequent Modifier Symbol characters

ValueCountFrequency (%) 
`2100.0%
 

Most frequent Other Symbol characters

ValueCountFrequency (%) 
↵6100.0%
 

Most frequent Open Punctuation characters

ValueCountFrequency (%) 
(9100.0%
 

Most frequent Close Punctuation characters

ValueCountFrequency (%) 
)9100.0%
 

Most frequent Final Punctuation characters

ValueCountFrequency (%) 
’1100.0%
 

Most occurring scripts

ValueCountFrequency (%) 
Common151669074.1%
 
Latin53137025.9%
 

Most frequent Common characters

ValueCountFrequency (%) 
81559053.8%
 
<1104107.3%
 
>1104067.3%
 
/556093.7%
 
0483833.2%
 
3423752.8%
 
2391592.6%
 
-389052.6%
 
"385182.5%
 
1365302.4%
 
6251571.7%
 
4238391.6%
 
5230501.5%
 
7222921.5%
 
9219831.4%
 
8217841.4%
 
=192971.3%
 
.157011.0%
 
:73510.5%
 
_188< 0.1%
 
;30< 0.1%
 
&25< 0.1%
 
%25< 0.1%
 
+21< 0.1%
 
#13< 0.1%
 
Other values (12)49< 0.1%
 

Most frequent Latin characters

ValueCountFrequency (%) 
t11795422.2%
 
d11559321.8%
 
l296965.6%
 
e289095.4%
 
i286765.4%
 
n283445.3%
 
s277775.2%
 
a240294.5%
 
r193983.7%
 
b156843.0%
 
u130052.4%
 
m126142.4%
 
c97461.8%
 
o97441.8%
 
L77001.4%
 
j76161.4%
 
C46920.9%
 
f44740.8%
 
S37930.7%
 
U32120.6%
 
h22440.4%
 
g17400.3%
 
p16080.3%
 
k12260.2%
 
y10540.2%
 
Other values (27)108422.0%
 

Most occurring blocks

ValueCountFrequency (%) 
ASCII2048049> 99.9%
 
Arrows6< 0.1%
 
Punctuation3< 0.1%
 
None2< 0.1%
 

Most frequent ASCII characters

ValueCountFrequency (%) 
81559039.8%
 
t1179545.8%
 
d1155935.6%
 
<1104105.4%
 
>1104065.4%
 
/556092.7%
 
0483832.4%
 
3423752.1%
 
2391591.9%
 
-389051.9%
 
"385181.9%
 
1365301.8%
 
l296961.4%
 
e289091.4%
 
i286761.4%
 
n283441.4%
 
s277771.4%
 
6251571.2%
 
a240291.2%
 
4238391.2%
 
5230501.1%
 
7222921.1%
 
9219831.1%
 
8217841.1%
 
r193980.9%
 
Other values (60)1536837.5%
 

Most frequent None characters

ValueCountFrequency (%) 
·2100.0%
 

Most frequent Arrows characters

ValueCountFrequency (%) 
↵6100.0%
 

Most frequent Punctuation characters

ValueCountFrequency (%) 
…266.7%
 
’133.3%
 

Missing values

Sample

First rows

<!DOCTYPE html>
0<html lang="en">
1<head>
2<meta charset="utf-8">
3<link rel="dns-prefetch" href="https://github.githubassets.com">
4<link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">
5<link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">
6<link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">
7<link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">
8<link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
9<link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">

Last rows

<!DOCTYPE html>
59069<div class="octocat-spinner my-6 js-details-dialog-spinner"></div>
59070</details-dialog>
59071</details>
59072</template>
59073<div class="Popover js-hovercard-content position-absolute" style="display: none; outline: none;" tabindex="0">
59074<div class="Popover-message Popover-message--bottom-left Popover-message--large Box box-shadow-large" style="width:360px;">
59075</div>
59076</div>
59077</body>
59078</html>

Duplicate rows

Most frequent

<!DOCTYPE html>count
46<td>0</td>4631
399<td>2020-06-03 02:33:13</td>3641
1688</tr>3641
1623<td>US</td>3036
1028<td></td>1668
25<td>0.0</td>1329
360<td>1</td>734
526<td>2</td>427
643<td>3</td>343
738<td>4</td>257