A study of US expenditures on cancer treatment with data analysis and machine learning
Authors
Yue Wang

Share
Annotation
Cancer is the second leading cause of death around the world, causing cancer cost to be an important social issue in the United States. News reports show that American cancer patients spent more than $21 billion on their care in 2019. (US News, 2021) In this research, data analysis has been done based on the national expenditure on cancer treatment from 2010 to 2020 through the use of Python language and available third party libraries. Also, a machine learning classification model has been trained, developed and tested to help predict the cost of cancer treatment in the next few years. Among four different machine learning regression algorithms that are applied (i.e linear regression, lasso regression, random forest regression, and gradient boosting regression), gradient boosting regression is the best fit for the model, aiming to produce the most accurate prediction to inform people and government officials.
Keywords
Authors
Yue Wang

Share
References:
- Watson, IBM. “Expenditures for Cancer Care - Dataset by Xprize Ai-Health.” Data.world, 19 July 2017, https://data.world/xprizeai-health/expenditures-for-cancer-care/workspace/project-summary?agentid=xprizeai-health&datasetid=expenditures-for-cancer-care.
- “Cancer Costs U.S. Patients $21 Billion a Year.” US News, https://www.usnews.com/news/health-news/articles/2021-10-26/cancer-costs-us-patients-21-billion-a-year.
- Selby, Karen. “Americans Can't Keep Up with the High Cost of Cancer Treatment.” Mesothelioma Center - Vital Services for Cancer Patients & Families, 20 Aug. 2021, https://www.asbestos.com/featured-stories/high-cost-of-cancer-treatment/.
- “Financial Burden of Cancer Care.” Financial Burden of Cancer Care, 20 July 2021, https://progressreport.cancer.gov/after/economic_burden.
- “The American Cancer Society Cancer Action NetworkSM (ACS CAN) Is Making Cancer—and the Affordability of Cancer Care—a Top Priority for Public Officials and Candidates at the Federal, State and Local Levels. .” The Costs of Cancer, Oct. 2020, https://www.fightcancer.org/sites/default/files/National%20Documents/Costs-of-Cancer-2020-10222020.pdf. Accessed 16 Dec. 2021.
- Yadav, D. (2019, December 9). Categorical encoding using label-encoding and one-hot-encoder. Medium. Retrieved January 22, 2022, from https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd
- Editor, M. B. (n.d.). Regression analysis: How do I interpret R-squared and assess the goodness-of-fit? Minitab Blog. Retrieved January 22, 2022, from https://blog.minitab.com/en/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit
- Chakure, A. (2020, November 6). Random Forest and its implementation. Medium. Retrieved January 22, 2022, from https://medium.com/swlh/random-forest-and-its-implementation-71824ced454f
- Radečić, D. (2020, January 4). Top 3 methods for handling skewed data. Medium. Retrieved January 22, 2022, from https://towardsdatascience.com/top-3-methods-for-handling-skewed-data-1334e0debf45
- Parleto, A. (2020). Deal multicollinearity with lasso regression. Andrea Perlato. Retrieved January 22, 2022, from https://www.andreaperlato.com/mlpost/deal-multicollinearity-with-lasso-regression/