肝脏 ›› 2023, Vol. 28 ›› Issue (4): 469-473.

• 非酒精性脂肪性肝病 • 上一篇    下一篇

基于机器学习构建非酒精性脂肪性肝病预测模型

刘璐, 朱锦舟, 刘晓琳, 王超, 殷民月, 高静雯, 许春芳   

  1. 215000 江苏 苏州大学附属第一医院
  • 收稿日期:2022-06-19 出版日期:2023-04-30 发布日期:2023-08-29
  • 通讯作者: 许春芳,Email:xcf601@163.com
  • 基金资助:
    国家自然科学基金青年项目(81900508,82000540);江苏省自然科学基金青年项目(BK20190172);苏州市科技计划(SKY2021038)

Development of Prediction Models Based on Machine Learning for Non-alcoholic fatty liver disease

LIU Lu, ZHU Jin-zhou, LIU Xiao-lin, WANG Chao, YI Min-yuen, GAO Jing-wen, XU Chun-fang   

  1. The First Affiliated Hospital of Soochow University, Jiangsu 215006, China
  • Received:2022-06-19 Online:2023-04-30 Published:2023-08-29
  • Contact: XU Chun-fang,Email:xcf601@163.com

摘要: 目的 利用H2O平台自动化机器学习(AutoML)框架,建立非酒精性脂肪性肝病的预测模型。方法 收集苏州大学附属第一医院体检中心人员资料。利用临床结构化数据,基于H2O AutoML框架,建立预测非酒精性脂肪性肝病发病的多种机器学习算法模型,绘制ROC曲线并建立混淆矩阵来评价模型效力,同时对重要变量进行可视化呈现。结果 自动化建立28个机器学习模型。最佳模型为梯度提升机(GBM),Gini值为0.80,R2为0.42,LogLoss为0.45。模型中重要性绝对值排名前五的变量为:三酰甘油(95%CI: -1.053 ~ -0.887)、天冬氨酸转氨酶 (95%CI: -20.433 ~ -16.927)、高密度脂蛋白 (95%CI: 0.232 ~ 0.268)、铁蛋白 (95%CI: -80.533 ~ -68.607)及血糖 (95%CI: -0.576 ~ -0.424)。最佳模型GBM在验证集中特异度为0.818,敏感度为0.715,AUC为0.766,优于基于XGBoost、逻辑回归、随机森林和深度学习等算法类型。结论 非酒精性脂肪性肝病的机器学习模型为筛查非酒精性脂肪性肝病患者提供了新的诊疗思路。

关键词: 非酒精性脂肪性肝病, 自动化机器学习, 预测模型

Abstract: Objective To develop prediction models based on H2O automated machine learning(AutoML) tools for the incidence of Non-alcoholic fatty liver disease (NAFLD). Methods A total of 4,105 subjects were recruited in the study. The data was loaded using H2O AutoML to develop various machine learning models to predict NAFLD. The model was evaluated by ROC curve and confusion matrix, while visualized by SHAP, LIME, and partial dependence plots. Results Twenty-eight machine learning models were fitted. The best model was a gradient boosting machine (GBM) model (Gini 0.80, R2 0.42, LogLoss 0.45). Triglyceride (95%CI: -1.053~-0.887), aspartate aminotransferase (AST) (95%CI: -20.433~-16.927), high density lipoprotein (HDL) (95%CI: 0.232~0.268), ferritin (95%CI: -80.533~-68.607), and blood glucose (95%CI: -0.576~-0.424) were the important variables. The area under ROC in the validation dataset was 0.766 with a sensitivity of 0.715 and a specificity of 0.818, which suggested that the GBM models performed better than the XGBoost models, logistic regression, random forest, and deep learning. Conclusion The prediction model based on H2O AutoML algorithm provides both promise and insights in screening NAFLD patients.

Key words: NAFLD, Automatic machine learning (AutoML), Prediction model, Receiver operating characteristic curve (ROC), Confusion matrix, Shapley additive explanations (SHAP), Partial dependence plots (PDP)