Overview

Dataset statistics

Number of variables6
Number of observations16599
Missing cells0
Missing cells (%)0.0%
Duplicate rows12
Duplicate rows (%)0.1%
Total size in memory778.2 KiB
Average record size in memory48.0 B

Variable types

NUM3
CAT3

Reproduction

Analysis started2020-08-04 23:54:40.257427
Analysis finished2020-08-04 23:54:48.190619
Duration7.93 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Dataset has 12 (0.1%) duplicate rows Duplicates
Dates has a high cardinality: 498 distinct values High cardinality
Regions is highly correlated with StatesHigh correlation
States is highly correlated with RegionsHigh correlation
States is uniformly distributed Uniform
Dates is uniformly distributed Uniform

Variables

States
Categorical

HIGH CORRELATION
UNIFORM

Distinct count33
Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory size129.7 KiB
Nagaland
 
503
Assam
 
503
Pondy
 
503
Tripura
 
503
Gujarat
 
503
Other values (28)
14084
ValueCountFrequency (%) 
Nagaland5033.0%
 
Assam5033.0%
 
Pondy5033.0%
 
Tripura5033.0%
 
Gujarat5033.0%
 
Maharashtra5033.0%
 
Jharkhand5033.0%
 
West Bengal5033.0%
 
Arunachal Pradesh5033.0%
 
Odisha5033.0%
 
Other values (23)1156969.7%
 

Length

Max length17
Median length7
Mean length7.363636364
Min length2

Regions
Categorical

HIGH CORRELATION

Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size129.7 KiB
NR
4527
NER
3521
WR
3018
SR
3018
ER
2515
ValueCountFrequency (%) 
NR452727.3%
 
NER352121.2%
 
WR301818.2%
 
SR301818.2%
 
ER251515.2%
 

Length

Max length3
Median length2
Mean length2.212121212
Min length2

latitude
Real number (ℝ≥0)

Distinct count33
Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean23.17822023487879
Minimum8.900372741
Maximum33.45
Zeros0
Zeros (%)0.0%
Memory size129.7 KiB

Quantile statistics

Minimum8.900372741
5-th percentile11.93499371
Q119.82042971
median23.83540428
Q327.3333303
95-th percentile31.51997398
Maximum33.45
Range24.54962726
Interquartile range (IQR)7.51290059

Descriptive statistics

Standard deviation6.146575264
Coefficient of variation (CV)0.2651875425
Kurtosis-0.4589125045
Mean23.17822023
Median Absolute Deviation (MAD)3.76457641
Skewness-0.5614781947
Sum384735.2777
Variance37.78038748
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
11.934993715033.0%
 
31.100025455033.0%
 
14.75042915033.0%
 
26.74998095033.0%
 
19.250231955033.0%
 
12.570381295033.0%
 
27.599980695033.0%
 
27.100398785033.0%
 
33.455033.0%
 
20.266578195033.0%
 
Other values (23)1156969.7%
 
ValueCountFrequency (%) 
8.9003727415033.0%
 
11.934993715033.0%
 
12.570381295033.0%
 
12.920385765033.0%
 
14.75042915033.0%
 
ValueCountFrequency (%) 
33.455033.0%
 
31.519973985033.0%
 
31.100025455033.0%
 
30.719996975033.0%
 
30.320408955033.0%
 

longitude
Real number (ℝ≥0)

Distinct count32
Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean81.79453346151514
Minimum71.1924
Maximum94.21666744
Zeros0
Zeros (%)0.0%
Memory size129.7 KiB

Quantile statistics

Minimum71.1924
5-th percentile73.0166178
Q176.56999263
median78.57002559
Q388.32994665
95-th percentile94.11657019
Maximum94.21666744
Range23.02426744
Interquartile range (IQR)11.75995402

Descriptive statistics

Standard deviation7.258428845
Coefficient of variation (CV)0.08873977927
Kurtosis-1.202815695
Mean81.79453346
Median Absolute Deviation (MAD)3.93004435
Skewness0.5198699519
Sum1357707.461
Variance52.6847893
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
78.0500056510066.1%
 
79.01935033.0%
 
88.61664755033.0%
 
92.720014615033.0%
 
94.116570195033.0%
 
79.830000375033.0%
 
75.980002815033.0%
 
94.216667445033.0%
 
77.166597045033.0%
 
74.639981245033.0%
 
Other values (22)1106666.7%
 
ValueCountFrequency (%) 
71.19245033.0%
 
73.01661785033.0%
 
73.160174935033.0%
 
73.818000655033.0%
 
74.639981245033.0%
 
ValueCountFrequency (%) 
94.216667445033.0%
 
94.116570195033.0%
 
93.950017055033.0%
 
93.616600715033.0%
 
92.720014615033.0%
 

Dates
Categorical

HIGH CARDINALITY
UNIFORM

Distinct count498
Unique (%)3.0%
Missing0
Missing (%)0.0%
Memory size129.7 KiB
09/07/2019 00:00:00
 
66
11/07/2019 00:00:00
 
66
08/07/2019 00:00:00
 
66
10/07/2019 00:00:00
 
66
12/07/2019 00:00:00
 
66
Other values (493)
16269
ValueCountFrequency (%) 
09/07/2019 00:00:00660.4%
 
11/07/2019 00:00:00660.4%
 
08/07/2019 00:00:00660.4%
 
10/07/2019 00:00:00660.4%
 
12/07/2019 00:00:00660.4%
 
24/04/2019 00:00:00330.2%
 
22/06/2019 00:00:00330.2%
 
05/01/2019 00:00:00330.2%
 
26/10/2019 00:00:00330.2%
 
21/11/2019 00:00:00330.2%
 
Other values (488)1610497.0%
 

Length

Max length19
Median length19
Mean length19
Min length19

Usage
Real number (ℝ≥0)

Distinct count3627
Unique (%)21.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean103.00186155792517
Minimum0.3
Maximum522.1
Zeros0
Zeros (%)0.0%
Memory size129.7 KiB

Quantile statistics

Minimum0.3
5-th percentile1.8
Q16.7
median64.4
Q3173.9
95-th percentile344.75
Maximum522.1
Range521.8
Interquartile range (IQR)167.2

Descriptive statistics

Standard deviation116.0440556
Coefficient of variation (CV)1.126620955
Kurtosis0.8018644468
Mean103.0018616
Median Absolute Deviation (MAD)60.8
Skewness1.243323796
Sum1709727.9
Variance13466.22285
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
2.23282.0%
 
2.13151.9%
 
2.32071.2%
 
1.71781.1%
 
21581.0%
 
1.81560.9%
 
2.41510.9%
 
2.51290.8%
 
1.61190.7%
 
1.91150.7%
 
Other values (3617)1474388.8%
 
ValueCountFrequency (%) 
0.31< 0.1%
 
0.41< 0.1%
 
0.55< 0.1%
 
0.66< 0.1%
 
0.790.1%
 
ValueCountFrequency (%) 
522.11< 0.1%
 
516.41< 0.1%
 
515.81< 0.1%
 
513.91< 0.1%
 
513.61< 0.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

StatesRegionslatitudelongitudeDatesUsage
0PunjabNR31.51997475.98000302/01/2019 00:00:00119.9
1HaryanaNR28.45000677.01999102/01/2019 00:00:00130.3
2RajasthanNR26.44999974.63998102/01/2019 00:00:00234.1
3DelhiNR28.66999377.23000402/01/2019 00:00:0085.8
4UPNR27.59998178.05000602/01/2019 00:00:00313.9
5UttarakhandNR30.32040978.05000602/01/2019 00:00:0040.7
6HPNR31.10002577.16659702/01/2019 00:00:0030.0
7J&KNR33.45000076.24000002/01/2019 00:00:0052.5
8ChandigarhNR30.71999776.78000602/01/2019 00:00:005.0
9ChhattisgarhWR22.09042082.15998702/01/2019 00:00:0078.7

Last rows

StatesRegionslatitudelongitudeDatesUsage
16589OdishaER19.82043085.90001705/12/2020 00:00:0095.1
16590West BengalER22.58039088.32994705/12/2020 00:00:00110.4
16591SikkimER27.33333088.61664705/12/2020 00:00:001.2
16592Arunachal PradeshNER27.10039993.61660105/12/2020 00:00:002.1
16593AssamNER26.74998194.21666705/12/2020 00:00:0020.3
16594ManipurNER24.79997193.95001705/12/2020 00:00:002.5
16595MeghalayaNER25.57049291.88001405/12/2020 00:00:005.8
16596MizoramNER23.71039992.72001505/12/2020 00:00:001.6
16597NagalandNER25.66699894.11657005/12/2020 00:00:002.1
16598TripuraNER23.83540491.27999905/12/2020 00:00:003.3

Duplicate rows

Most frequent

StatesRegionslatitudelongitudeDatesUsagecount
0Arunachal PradeshNER27.10039993.61660108/07/2019 00:00:001.42
1Arunachal PradeshNER27.10039993.61660112/07/2019 00:00:002.12
2MeghalayaNER25.57049291.88001410/07/2019 00:00:004.12
3MizoramNER23.71039992.72001509/07/2019 00:00:001.42
4MizoramNER23.71039992.72001510/07/2019 00:00:001.42
5NagalandNER25.66699894.11657010/07/2019 00:00:001.82
6NagalandNER25.66699894.11657012/07/2019 00:00:002.12
7PondySR11.93499479.83000012/07/2019 00:00:007.42
8SikkimER27.33333088.61664710/07/2019 00:00:001.52
9TripuraNER23.83540491.27999910/07/2019 00:00:002.92