Summary
Introduction: Few studies have been conducted to construct a reliable predictive model for the differential diagnosis of severe and non-severe Coronavirus disease-2019 (COVID-19) in the early stages of the disease. This study aimed to compare the accuracy of linear discriminate analysis (LDA) and binary logistic regression (BLR), as two empirical correlations, in predicting COVID-19 severity using single laboratory data and calculated indexes such as the neutrophil-to-lymphocyte ratio (NLR) and systemic immune-inflammation index (SII).
Materials and Methods: We investigated 109 patients with confirmed COVID-19 pneumonia. Epidemiological, demographic, clinical, laboratory, and outcome data were obtained, and the patients were classified into two groups: mild group (42 patients) and severe group (67 patients).
Results: A comparison of the clinical data in the severe and non-severe groups showed significant differences in SpO2 and respiratory rate. In addition, significant difference in NLR, SII, white blood cell count, neutrophil count, mean corpuscular volume and mean corpuscular hemoglobin, lymphocyte count, erythrocyte sedimentation rate, lactate dehydrogenase, and blood urea nitrogen was found between both groups. Moreover, there was a small difference between the LDA and LR models, and LDA was more appropriate for a smaller sample size.
Conclusion: Our predictive models could help clinicians to identify patients at risk of severe COVID-19 Such prediction can be performed by a simple blood test. LDA and BLR can be used to effectively classify patients with severe and non-severe COVID-19, even with violation of the normality assumption.
Introduction
The novel Coronavirus, named by the World Health Organization as Severe acute respiratory syndrome-Coronavirus-2, has diffusion worlwide[1]. Coronavirus disease-2019 (COVID‐19) is an infectious disease with a high incidence that affects people differently and poses a threat to people’s life and health. Of patients with novel Coronavirus infections, approximately 81% of patients were mild, 14% were severe, and 5% were critical cases[2], and severe illness often led to death, based on the available infestigations[2, 3]. Critically ill patients have a high mortality and poor prognosis. Therefore, the early prediction of moderate or severe acute respiratory syndrome (ARS) is vital and can help clinicians to reduce the mortality rate[2, 4, 5].
The neutrophil-to-lymphocyte ratio (NLR) and systemic immune-inflammation index (SII), as indicators of inflammation and immune response, were calculated from a routine blood test[6, 7]. Neutrophil-to-lymphocyte ratio is known as a risk factor for mortality from infectious diseases, malignancies, intracerebral hemorrhage, and dermatomyositis[6, 8]. Systemic immune-inflammation index is a prognostic factor in some malignancies such as breast cancer, hepatocellular carcinoma, and esophageal squamous cell carcinoma[7, 9, 10]. It has been demonstrated that severe and critical ARS cases tend to have higher neutrophil counts and lower lymphocyte counts. Several studies suggest that NLR and SII are two independent predictors of COVID-19 progression in the early stages of the disease[3, 6, 7]. In addition, several single laboratory and clinical markers, including C-reactive protein, lymphocyte and neutrophil counts, creatine phosphokinase, erythrocyte sedimentation rate (ESR), urea, and creatinine, have been tested on patients with COVID-19, and can also be used for predicting the ARS severity[4, 11-14].
Thus far, studies on COVID-19 have focused on the epidemiology of the disease, clinical characteristics of patients, and the risk factors associated with mortality during hospitalization in critical COVID-19 cases. However, few studies have been conducted to predict progression among patients in the early stages of the disease, and these studies were based on calculated indexes or single blood test factors[4, 13, 15]. This study aimed to compare the accuracy of two empirical correlations, binary logistic regression (BLR) and linear discriminant analysis (LDA), in predicting COVID-19 severity with single laboratory and clinical data or indexes calculated from blood tests (NLR and SII). Predicting the severity of disease conditions is a binary classification problem, and the exact statistical method for data fitting is a frequent question for researchers[16]. However, the two methods differ in their basic principles. While LR makes no assumptions on the distribution of the explanatory data, LDA has been developed for normally distributed explanatory variables. It is therefore reasonable to expect better results from LDA when the normality assumptions are met, although LR is more appropriate in all other situations[17]. Therefore, BLR and LDA, the two most applicable statistical classifier techniques, were used for baseline prediction in this study because of the increasing interest in choosing between BLR and LDA for biological data analysis.
Methods
Data Collection
The participants of the present study were 109 patients diagnosed with COVID-19 pneumonia [confirmed by computed tomography (CT) and reverse transcriptase-polymerase chain reaction (RT-PCR)] in Baqiyatallah hospital between February 20, 2020 and June 9, 2020 in Tehran, Iran. Epidemiological, demographic, clinical, laboratory, and outcome data were obtained from the Baghiyatallah laboratory computer system, electronic medical records, and interviews with patients. Then, the patients were divided into two groups according to the severity of the disease. Hence, there was a mild group (consisting of 42 patients) and a severe group (consisting of 67 patients).
The clinical classification of patients as having severe or non-severe COVID-19 was established based on clinical signs of pneumonia with SpO2 <90% on room air or treatment in the intensive care unit.
The proposal of the study was approved by the Research Ethics Committee, Baghiyatallah University of Medical Sciences, Tehran, Iran (coded: IR.BMSU.RETECH.REC.1399.094). All participants signed a written informed consent.
Statistical Analysis
Statistical analyses were performed using IBM Statistical Package for the Social Sciences Statistics Software (version 26; IBM, New York, USA). Quantitative data were presented as mean±standard error. A p value of <0.05 was defined as statistical significance. ANOVA and t-test were used to compare groups and means of two groups, respectively. For classification, we compared BLR analysis with LDA, and the receiver operating characteristic (ROC) curve was plotted for each model. In addition, we examined the prediction ability of two different independent factors: NLR and SII: defined as neutrophil * platelet/lymphocyte), as calculated laboratory indexes and single hematological factors (selected based on ANOVA).
Binary Logistic Regression Analysis
A type of regression was used to predict probabilities of the presence or absence of a particular disease, characteristic, and condition. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables, and the predicted probability must lie between 0 and 1.
Linear Discriminant Analysis
The discriminant analysis focuses on the association between multiple independent variables and a categorical dependent variable by forming a composite of the independent variables. This type of multivariate analysis can determine the extent to which any of the composite variables discriminates between two or more pre-existing groups of subjects, in addition to deriving a classification model for predicting the group membership of new observations. The simplest type of discriminant analysis is when the dependent variable has two groups. In this case, a linear discriminant function that passes through the means of the two groups (centroids) can be used to discriminate subjects between the two groups.
Receiver Operating Characteristic Curve
For each model, we plotted the corresponding ROC curve. A ROC curve graphically displays sensitivity and 100% minus specificity (false positive rate) at several cutoff points. By plotting the ROC curves for two models on the same axes, we are able to determine which test is better for classification, namely, that test whose curve encloses the larger area beneath it.
Results
Demographics Data of Patients with Mild and Severe COVID-19
The present study included 109 patients with confirmed COVID-19 pneumonia (confirmed by CT and RT-PCR) who were admitted to Baqiyatallah hospital between February 20, 2020 and June 9, 2020 in Tehran, Iran. These patients were divided into two groups according to their disease severity (mild or severe). Table 1 demonstrates the demographic characteristics of 109 patients. There were no significant differences in sex, age, and BMI between patients with severe or mild COVID-19. Patients in the severe group had low SpO2 and were more likely to have comorbidities (Table 1). Other characteristics such as respiratory rate and blood pressure showed no significant difference between both groups.
Laboratory Findings of Patients with Mild and Severe COVID-19
The patients with severe COVID-19 had higher white blood cell and neutrophils counts, mean corpuscular volume (MCV), and mean corpuscular hemoglobin (MCH) than those with mild COVID-19, with statisitically significant differences (p<0.05) (Table 2). In contrast, lymphocyte count was significantly reduced in severe COVID-19 patients. Compared with the mild group. ESR, lactate dehydrogenase (LDH), and blood urea nitrogen were significantly increased in the severe group (Table 2). Neutrophil-to-lymphocyte ratio and SII as calculated indexes showed a significant difference between patients with mild and severe COVID-19 (Table 2).
Mathematical Analysis
Using laboratory parameters, five characters of single-factor haematologic data were extracted for the first classification. Platelet count, MCV, MCH, neutrophil count, and lymphocyte count, were compared with two calculated indexes as a second classification factor: NLR and SII. These variables were used in both discriminant and logistic regression analyses to determine the role of single and calculated haematologic data in the prediction of disease severity in COVID-19 patients.
Analysis by Binary Logistics Regression
BLR techniques revealed that both analyzed groups were the same, and the overall correct prediction rate was 65.1% for the calculated index and 67.0% for the haematologic single factors (Table 3). Moreover, we observed that the Goodness-of-Fit test (Hosmer&Lemeshow) was higher in the haematologic single-factor classification.
The role of predictors in explaining the outcome, using BLR, is reported in Tables 4 and 5.
Wald factor shows the importance of each variable in of dependent which higher is better. Therefore, the first model SII (Table 4) and the second model neutrophil and platelets (Table 5) are more important. In addition, Odds ratio (OR) showed an OR for dichotomous predictors to predict the presence of outcomes.
Analysis by Linear Discriminant Analysis
Whereas the overall correct prediction rate was 67.9% for the calculated index and 68.8% for the haematologic single-factor (Table 6), the LDA method revealed that both analyzed groups were the same (such as the BLR technique). Moreover, we observe that the Goodness-of-Fit test (Hosmer&Lemeshow) was higher in the haematologic single-factor classification.
The standardized discriminant function coefficients indicate the relative importance of the independent variables in predicting the dependent. In contrast, with the BLR method, NLR is more important in LDA (Table 7). The large absolute values of neutrophils showed the greater discriminating ability of this factor (Table 8).
Analysis by Receiver Operating Characteristic Curve
As shown in Figure 1, the ROC curves of the aforementioned models clearly indicate that the logistic model is similar to the discriminant analysis model. In addition, Table 9, which presents the area under the ROC curve (AUC), shows no difference between the two techniques used in this study.
ROC: Receiver operating characteristic, COVID-19: Coronavirus disease-2019
Discussion
The current study provided demographic, laboratory, clinical data, complications, and outcomes data of hospitalized patients with non-severe and severe COVID-19 in Baqiyatallah Hospital, Tehran, Iran. More than 50% of the patients in the current study were classified as severe cases, which is consistent with Li et al.’s[18] study. In accordance with other studies[19, 20], our results showed that COVID-19 patients with comorbidities such as hypertension, coronary heart disease, diabetes, and chronic obstructive lung disease are more vulnerable to severe disease. Thus, due to the increased rate of mortality in severe patients, early identification is vital to increase the treatment efficiency of COVID-19 and reducing mortality.
Previous studies demonstrated that increase in neutrophil counts and decrease in lymphocyte counts were correlated with COVID-19 severity[21, 22]. The current study indicated that WBC, MCV, MCH, neutrophil, lymphocyte, ESR, blood urea nitrogen, and LDH were significantly different between severe and non-severe patients with COVID-19. Therefore, neutrophil and lymphocyte counts, MCV, MCH, and platelets were selected as the hematology single predictor factors for severity in this study and compared with NLR and SII as calculated haematologic factors. For these reasons, the present study compared logistic regression and LDA to investigate the accuracy of the applied classifications in order to make the choice between the methods easier. The methods do not differ in their functional forms. It seems that the advantages of LDA or LR depend on the sample size and normality of variables[16, 17, 23]. Our results showed that both logistic regression and discriminant analyses converged in similar results. Both methods estimated the same statistically significant coefficients. This study is consistent with other studies that performed BLR and LDA to examine the effects of sample sizes. Pohar et al.[17] revealed that the correct classification was achieved for a sample size of more than 50. However, they concluded that the correct classification is sensitive to the assumption of normality, and that the LDA model performed better with a sufficiently large sample size. This report was confirmed by Musa et al.[16], who demonstrated that the differences between the two methods are negligible for a sample size of more than 100 members. Although, ROC curve analysis and the AUC are considered as another helpful parameter for evaluating the quality of the LDA and BLR, the result of the ROC curves of the present study showed a small difference in AUC between LDA and BLR. These results are in agreement with other studies which showed that the AUC was similar for both models[23].
As mentioned above, in LDA and BLR analysis, single and calculated haematologic factors were associated with COVID-19 severity. The result of LDA and BLR analysis revealed the small difference in overall correct prediction between both analyzed groups, although this factor was higher for single hematology factors (Tables 3, 6). Our data suggest that neutrophil, lymphocyte, MCV, MCH, and platelets should be considered to evaluate severity in patients with the novel Coronavirus, especially the differences in neutrophils between the severe and non-severe groups (Tables 5, 8). Although the role of single factors predictors in explaining the outcome using LDA or BLR were consistent, we observed a dissimilarity in the role of the calculated haematologic factors between the two methods. In contrast, with the BLA method, NLR was more important in LDA (Tables 4, 7). Regarding the basic principles of the two methods, LDA was developed for normally distributed variables[17, 23]; thus, it is expected to be more appropriate when the normality assumptions are met.
The limitation of this study is the small sample size. Therefore, future studies should be performed with a larger number of participants.
Conclusion
Our prediction models could help clinicians to early identify patients at risk of severe COVID-19, and this prediction can be conducted using a simple blood test. LDA and BLR can be used to effectively classify patients with severe and non-severe COVID-19, even with violation of the normality assumption.
Acknowledgment
The authors would like to thank the Clinical Research Development Unit of Baqiyatallah Hospital, for all their support and guidance during carrying out this study.
Ethics
Ethics Committee Approval: The proposal of the study was approved by the Research Ethics Committee, Baghiyatallah University of Medical Sciences, Tehran, Iran (coded: IR.BMSU.RETECH.REC.1399.094, date: 19.04.2020).
Informed Consent: Consent form was filled out by all participants.
Peer-review: Externally and internally peer-reviewed.
Authorship Contributions
Surgical and Medical Practices: M.M.P., T.M., Z.E.N., Z.S., Concept: M.M.P., M.M.A., T.M., Z.S., Design: F.B., Z.R., Z.S., Data Collection or Processing: M.M.A., F.B., H.E.G., Z.R., Analysis or Interpretation: M.S., M.M.A., Literature Search: M.S., F.B., H.E.G., T.M., Writing: M.M.P., M.S., H.E.G., E.N.
Conflict of Interest: No conflict of interest was declared by the authors.
Financial Disclosure: The authors declared that this study received no financial support.