Inferential Analysis
For inferential analysis, two SARS COVID19 datasets are provided that contain number of cases and deaths due to COVID, for different countries. As we are dealing with China, we filtered the datasets for cities in China. However, it was observed that the provided COVID datasets are cumulative, so first all the records are being subtracted by its preceeding values to obtain non-cumulative data for inferential analysis. Then datasets are filtered, transposed and merged with the AQI dataset to obtain a final dataset for inferential analysis. In the air quality dataset, the concentrations are measured after every 6 hours and thus, we have 4 observations corresponding to each date. However, in case of SARS datsets, we have one observation each day. But WHO released the data for COVID deaths and cases at the end of the day, thus we select AQI data recorded at 6 pm and merge it with SARS dataset to perform inferential tests.
​
Correlation between different variables- Pearson and Spearman Test
We first checked the correlation between the air quality and the number of deaths and cases due to COVID, among two cities 'Beijing' and 'Shanghai'. Pearson test is used to analyze the realtionship between AQI with Cases and Deaths, whereas Spearman test is more relaiable for analysing the same between AQI Category with Cases and Deaths

For Beijing:
-
​AQI Categorydo not correlates with number of COVID cases and deaths as p values are > 0.05.
-
AQI correlates with the number of deaths but not with the number of cases​.
​​
For Shanghai:​
-
Neither AQI nor AQI Category correlate with number of deaths or Cases.
​
Does AQI really correlates with the deaths due to COVID in Beijing?
To answer this question, let's perform the Linear regression on variables AQI and Deaths.
​
Linear Regression


Where, Å· is the predicted or estimated value of y (dependent variable) and x is the predictor. So here we have taken x-variable (predictor) as AQI and y-variable (dependent) as Deaths. The results show that the intercept estimate (a) is -0.00713 and slope estimate (b) is 0.00026, thus the regression equation becomes:
​
Å· = -0.00713 + 0.00026*x
​
It predicts that for every 1 % increase in the AQI, the number of deaths is increased by 0.00026 numbers.
Also, r-square (r2) is 0.0099, which shows that 0.99% of variation in ‘Deaths’ can be explained by the variation in ‘AQI’, which is very less and is almost negligible. The p value is greater than 0.05 and hinting that AQI is a not good predictor in present case. Following fit plot also validates this interpretation.
​
However, we can observe here that the correlation between AQI and number of deaths for city ‘Beijing’ is spurious correlation as the variables seems to be a mathematical associated but not casually correlated.
So, we can conclude that there is no correlation between the number of confirmed cases and deaths due to COVID with the air quality of the city.
Identify Mean Difference of AQI among cities in China Before and During Lockdown - T-Test
We will check the mean difference of AQI among two cities ‘Beijing’ and ‘Shanghai’ before and during COVID. As we know that COVID-19 cases recorded almost at the starting of 2020 and the pandemic is still not ended, thus 2020-2021 is considered as the period during which COVID exists. And we have taken data of AQI from 2018 and 2019 as these years are considered as before COVID period. Two sample t-test is used for the analysis. From descriptive analysis, it has been found that Beijing is more polluted city as compared to Shanghai. Thus the mean AQI in Beijing is considered to be more than that of Shanghai in our alternative hypothesis.
​
-
During COVID:
​​
To check any mean difference of AQI in the two cities, we perform t-test as the data in both samples are independent and obtained following results:



Although the Kolmogorov p value comes out to be <0.05, the data is not normal. However, from the QQ plots and histograms, it can be seen that there is some normal distribution in data. So we proceed to T-test (as population variance is unknown). Let us consider,
H0 be the null hypothesis, Ha be the alternative hypothesis, and Level of significance: α=0.05
Step1: Let H0: the variance of AQI in Beijing is same as that in Shanghai during lockdown
σb^2 = σsh^2
and Ha:the variance of AQI in Beijing is not same as that in Shanghai during lockdown
σb^2 ≠ σsh^2
First, we saw the result of F test for equal variances and found that the p value is <0.0001, which means that we have enough statistical evidences to reject the null hypothesis. Thus the two variances are not equal. Therefore, we use ‘Satterthwaite’ results to our further interpretation.
Step2: Let H0: The mean of AQI in Beijing is same as the mean of AQI in Shanghai during lockdown
μb = μsh
Ha: The mean of AQI in Beijing is greater than the mean of AQI in Shanghai during lockdown
μb > μsh
Now, from the Satterthwaite the t value is 8.89 and p value is <0.001, thus we have enough statistical evidences to reject null hypothesis and we conclude that the two means are not equal and μb > μsh, thus the mean of AQI in Beijing and Shanghai is not same during lockdown.
The box plots and histograms are also interpreting same results. The difference in mean can be clearly seen from box plots.
​
​
-
Before COVID:
​​
To check any mean difference of AQI in the two cities before COVID, we perform t-test as the data in both samples are independent and obtained following results:



Again the Kolmogorov p value comes out to be <0.05, the data is not normal. However, from the QQ plots and histograms, it can be seen that there is some normal distribution in data. So we proceed to T-test. Let
​
Step1: Let H0: the variance of AQI in Beijing is same as that in Shanghai before lockdown
σb^2 = σsh^2
and Ha:the variance of AQI in Beijing is not same as that in Shanghai before lockdown
σb^2 ≠ σsh^2
First, we saw the result of F test for equal variances and found that the p value is <0.0001, which means that we have enough statistical evidences to reject the null hypothesis. Thus the two variances are not equal. Therefore, we use ‘Satterthwaite’ results to our further interpretation.
​
Step2: Let H0: The mean of AQI in Beijing is same as the mean of AQI in Shanghai before lockdown
μb = μsh
Ha: The mean of AQI in Beijing is greater than the mean of AQI in Shanghai before lockdown
μb > μsh
Now, from the Satterthwaite the t value is 5.57 and p value is <0.001, thus we have enough statistical evidences to reject null hypothesis and we conclude that the two means are not equal and μb > μsh, thus the mean of AQI in Beijing and Shanghai is not same before lockdown.
The box plots and histograms are also interpreting same results. The difference in mean can be clearly seen from box plots.
​
Thus, the mean AQI in Beijing is more than that in Shanghai, both before and during COVID.
Relation of AQI Categories with COVID confirmed cases among all cities-
Chi-square test
To find relation between two categorical variables, chi-square test is used. However, the confirmed cases is a numerical variable, so we use HPBIN by which values are automatically sorted and ranges are divided into defined number of bins/intervals. Here, COVID-cases are binned into 3 groups using pseudo quantile HPBIN method, which gives us following cut-points:


From the cut-points, we can convert the numerical variable ‘Cases’ into categorical variable ‘Cases_cat’ by using 3 groups:
-
‘No cases’ where Cases=0
-
‘Few cases’ where Cases are in between 1-3
-
‘Noticeable Cases’ where Cases are greater than 3.
​​
Now, before moving to chi-square, consider
H0 be the null hypothesis and
Ha be the alternative hypothesis.
Also ‘Table A’ is used to interpret results from chi-square test.
-
For Beijing:
​​
Let,
H0: AQI Category and confirmed cases are independent of each other for Beijing.
Ha: AQI Category and confirmed cases are not independent of each other for Beijing.
Row variable should be a dependant variable so we define ‘Cases’ as row variable and ‘AQI_Category’ as column variable (independent variable) to obtain two-way table. But the expected frequencies come out to be <5 which makes the Chi-square an invalid test. So, to make it valid, we grouped the AQI Categories 'Very Unhealthy', 'Unhealthy for Sensitive Groups' and 'Hazardous' as 'Beyond Unhealthy', and run the test again to obtain following results:
After grouping the AQI Categories, we can see that in table C2 the expected frequencies are more than 5, that makes chi-square a valid test. From TableC3, it can be seen that the chi-square statistic (χ²) is 8.49 and the degrees of freedom (DF) is 6. From Table A, we can interpret that χ² is significantly smaller than the decision point (DP) as DP=12.59 corresponds to DF=6 and thus we have enough statistical evidence to accept H0 and conclude that in Beijing the AQI Category and confirmed COVID cases are significantly independent of each other. Also, the p-value in TableC2 is 0.2(>0.05). Thus the p-value supports our interpretation.
​
​
​
-
For Shanghai:
​​
Let,
H0: AQI Category and confirmed cases are independent of each other for Shanghai.
Ha: AQI Category and confirmed cases are not independent of each other for Shanghai.
Row variable should be a dependant variable so we define ‘Cases’ as row variable and ‘AQI_Category’ as column variable (independent variable) to obtain two-way table and got following results:

W can see that in table C4 the no expected frequency is less than 5, that makes chi-square a valid test. From TableC5, it can be seen that the chi-square statistic (χ²) is 10.338 and the degrees of freedom (DF) is 6. From Table A, we can interpret that χ² is significantly smaller than the decision point (DP) as DP=12.59 corresponds to DF=6 and thus we have enough statistical evidence to accept H0 and conclude that in Shanghai the AQI Category and confirmed COVID cases are significantly independent of each other. Also, the p-value in TableC5 is 0.11(>0.05). Thus, the p-value and the above bar and mosaic plot supports our interpretation.
​
​
From the chi-square test between AQI Category and Confirmed Cases among two cities Beijing and Shanghai, we have concluded that AQI Category and COVID cases are independent of each other in both cities.
​
From all tests performed above under inferential analysis concluded that COVID-19 does not have any actual relationship with the air quality of the cities in China.
