Accelerated hazard prediction based on age time-scale for women diagnosed with breast cancer using a deep learning method

Ramezani, Zahra; Charati, Jamshid Yazdani; Alizadeh-Navaei, Reza; Eslamijouybari, Mohammad

doi:10.1186/s12911-024-02725-7

Research
Open access
Published: 28 October 2024

Accelerated hazard prediction based on age time-scale for women diagnosed with breast cancer using a deep learning method

Zahra Ramezani¹,
Jamshid Yazdani Charati¹,
Reza Alizadeh-Navaei² &
…
Mohammad Eslamijouybari³

BMC Medical Informatics and Decision Making volume 24, Article number: 314 (2024) Cite this article

606 Accesses
Metrics details

Abstract

Breast cancer is the most common cancer in women. Previous studies have investigated estimating and predicting the proportional hazard rates and survival in breast cancer. This study deals with predicting accelerated hazards (AH) rate based on age categories in breast cancer patients using deep learning methods. The AH has a time-dependent structure whose rate changes according to time and variable effects. We have collected data related to 1225 female patients with breast cancer at the Mandarin University of Medical Sciences. The patients' demographic and clinical characteristics including family history, age, history of tobacco use, hysterectomy, first menstruation age, gravida, number of breastfeeding, disease grade, marital status, and survival status have been recorded. Initially, we dealt with predicting three age groups of patients: ≤ 40, 41–60, and ≥ 61 years. Then, the prediction of accelerated risk value based on age categories for each breast cancer patient through deep learning and the importance of variables using LightGBM is discussed. Improving clinical management and treatment of breast cancer requires advanced methods such as time-dependent AH calculation. When the behavioral effect is assumed as a time scale change between hazard functions, the AH model is more appropriate for randomized clinical trials. The study results demonstrate the proper performance of the proposed model for predicting AH by age categories based on breast cancer patients' demographic and clinical characteristics.

Peer Review reports

Introduction

Breast cancer is the leading cause of death and the most common disease in women [1,2,3] Therefore, predicting AH and survival is essential for preventing, controlling, and treating cancers, including breast cancer. The purpose of using the accelerated hazard model is to identify the factors that influence hazard at a specific time, as increasing hazard leads to decreased survival.

The survival analysis model arises from the fact that the hazard rate, which represents the instantaneous risk of experiencing the event at a given time, can vary over time. The hazard rate can be affected by the presence or absence of certain risk factors, and these factors can change over time. Therefore, survival models need to account for the dynamic nature of both the covariates and the hazard rate.

One commonly used survival analysis model is the Cox proportional hazards model (CoxPH), which assumes that the hazard rate is proportional across different levels of the covariates. This model allows for the estimation of hazard ratios, which quantify the effect of each covariate on the hazard rate. However, it is essential to note that the proportional hazards assumption may not always hold, and alternative models, such as time-varying covariate models, can be employed to handle non-proportional hazards. Considering the dynamic nature of the covariates and the hazard rate, these models provide valuable insights into the factors influencing survival probabilities. The AH model is time-dependent, and at the same time, the variables have a direct effect on the baseline hazard function [4, 5]. The purpose of using the accelerated hazard model is to identify the factors that influence hazard at a specific time, as increasing hazard leads to decreased survival. The cumulative AH model for one's hazard function $i$ with failure time ${T}_{i}$ is calculated as follows [6]:

$$H\left(t|{Z}_{i}, \beta \right)= {H}_{0}\left(t exp\left({\beta }^{T} {Z}_{i}\right)\right)exp\left(-{\beta }^{T} {Z}_{i}\right)$$

(1)

where H₀ is the baseline hazard function. In this model, ${(H}_{0}(t exp({\beta }^{T} {Z}_{i}))$ specifies how ${Z}_{i}$ variables change the baseline hazard function time scale and directly affect it. On the other hand, in the proportional hazard model [7, 8] (2), the desired variables proportionally have a stable effect on a baseline hazard function, and the assumption that the baseline hazard function is stable may not be applicable and efficient.

$$H\left(t|{Z}_{i}, \beta \right)= {H}_{0}\left(t\right)exp\left(-{\beta }^{T} {Z}_{i}\right)$$

(2)

Some models, such as the accelerated failure-time (AFT) model, exist in the statistical literature that is not limited to constant proportionality or additivity and may lead to greater experimental flexibility for specific data types [9, 10]. For example, Peng and Dear [11], Sy and Taylor [12], and Lu and Ying [13] have used a mixture cure model in which the distribution for failure time of patients under treatment is modeled with a proportional hazard model. They proposed semiparametric estimation methods to estimate model parameters. On the other hand, Wang et al. suggested a flexible mixture cure rate model with non-parametric spline forms for treatment probability and hazard rate function [14]. The accelerated failure time model is considered for modeling the failure time of untreated patients in the mixture cure model. Li and Taylor [15] and Zhang and Peng [16] called it the accelerated failure time mixture cure (AFTMC) model. The proportional hazards model [17] and AFT model [18] are popular cure models in survival analysis $S\left(t\right)=\text{exp}(-H(t))$ due to simple estimation methods and easy interpretation. However, such assumptions are not always applicable to models.

Researchers consider the AH a valuable model to the time-dependent hazard models with the direct effect of the variables on the baseline hazards [19]. Correspondingly, one of the methods for estimating accelerated cumulative hazard is the method of the spline-based sieve maximum likelihood estimation [20, 21]. Chen and Wang [5] have proposed the AH model, which provides more flexibility in modeling survival data. The AH model offers increased flexibility compared to the traditional AFT model. While the two models may seem similar at first glance, the AH model introduces important modifications that enhance its versatility.

One of the primary advantages of the AH model is its ability to incorporate time-varying covariates. Unlike the AFT model, which assumes that the effect of covariates on the survival time remains constant over time, the AH model allows for the covariate effects to change as time progresses. This flexibility is particularly valuable when analyzing complex datasets where the relationship between covariates and survival time may evolve. Additionally, the AH model provides a more intuitive interpretation of the covariate effects. In the AFT model, the coefficients represent the logarithm of the hazard ratio, which can be difficult to interpret directly. The AH model estimates the additive effects of covariates on the hazard function, making it easier to understand the impact of each covariate on the survival time.

This study deals with predicting AH by age categories using deep learning. Using age as a time-scale to predict hazard rates in breast cancer is a strategic choice due to various reasons. Firstly, breast cancer is heavily influenced by hormonal changes associated with ageing, making age a biologically relevant time-scale. Additionally, age is a common demographic variable used in clinical practice for risk assessment and treatment decisions. Age is also typically documented in medical records, making it easily accessible for analysis. While other time-scales could be used, age provides a simple and understandable approach. By focusing on age-based predictions, we aim to identify age-specific risk patterns to guide prevention and early detection strategies, especially with the rising incidence of breast cancer in younger women.

Studies have shown that age plays a significant role in breast cancer prognosis. It reflects changing risk patterns over a woman's lifespan and can modify the impact of other prognostic factors like tumor size or lymph node involvement. Age can also serve as a proxy for unmeasured variables like hormonal changes or genetic predisposition that can influence hazard rates. Therefore, incorporating age as a time-scale in prognostic models offers a more comprehensive assessment of patient prognosis. Another study published in Frontiers in Oncology also investigated the association of age at diagnosis with survival in breast cancer [22]. The study analyzed a large cohort of patients and examined how age influenced survival outcomes. By using age as a time-scale, the researchers were able to identify the relationship between age and survival, providing valuable insights into the prognostic implications of age in breast cancer. Several studies have consistently demonstrated that young age at diagnosis is associated with an unfavourable prognosis and a higher risk of cancer recurrence and metastasis. Incorporating age as a time-scale in predictive models allows for a more accurate understanding of the impact of age on breast cancer outcomes [23,24,25,26,27]. Therefore, the use of age as a time-scale for predicting hazard rates is crucial in breast cancer prognosis. Since there is a scaling change relationship between the baseline hazard functions, AH is the best option for prediction. Considering age as the disease diagnosis time and the age classification of each person, the prediction of AH of breast cancer patients is performed using deep learning methods. Several extensions of the model are also considered and real clinical trial data are used to assess the model's applicability. The sieve maximum likelihood estimation method based on a polynomial spline is used to estimate the cumulative accelerated hazard model for each patient with interval-censored data for the training dataset. AH prediction illustrates the progression of hazards over time and is specific to each patient according to the characteristics of the breast cancer patient. Finally, the AH is predicted using deep learning methods based on each patient's age category. This study provides good inferential features of the proposed model and presents the metrics for evaluating the excellent performance of AH model prediction in breast cancer data.

Data and methodology

This study is an open cohort investigation where data on 1225 female patients diagnosed with breast cancer was gathered at Mazandaran University of Medical Sciences from 2008 to 2018. The patient's characteristics including family history, age, history of tobacco use, hysterectomy, first menstruation age, gravida, number of breastfeeding, disease grade, marital status, and survival status in 2022 have been examined. The treatment type has not been required due to the AH prediction based on the patients' age category: ≤ 40, 41–60, ≥ 61 years old. Patient information has been collected from the diagnosis time until the treatment or death. Some patients' information has been removed due to missing data on survival time. Table 1 presents the demographic and clinical information of the patients based on age category. We use the Cox proportional hazard model to determine the effect of the variables. A proportional hazards model is commonly used in survival analysis when the assumption of proportional hazards is reasonable. This assumption implies that the hazard rate of an event remains constant over time. This model is appropriate when the effects of covariates on the hazard rate are constant over time.

Table 1 Clinical characteristics of breast cancer patients based on the age groups of ≤ 40, 41–60, and ≥ 61

Full size table

On the other hand, an AH model is used when the proportional hazards assumption is violated, allowing covariate effects to vary over time for a more flexible hazard rate model. This study aims to determine if deep learning methods can accurately predict accelerated hazard rates for breast cancer patients in three age groups: ≤ 40, 41–60, and ≥ 61 years. This focus enhances our understanding of breast cancer prognosis and provides valuable insights for clinical practice and patient management. By considering age as a factor in disease onset, we applied deep learning to predict these hazard rates across the specified age groups.

The significance level is predetermined at 0.03. If the p-value is less than or equal to the significance level of 0.03, then the null hypothesis is rejected, indicating that there is sufficient evidence to support the alternative hypothesis.

In survival analysis, the AH model is used to analyze time-to-event data, where the hazard rate is assumed to change over time. The cumulative accelerated hazard function is a measure of the cumulative risk of an event occurring at a given time, taking into account the changing hazard rate. To predict the cumulative accelerated hazard function, we would typically need to fit an AH model to our data and then use the model to estimate the cumulative accelerated hazard function at specific time points.

Predicting AH involves estimating the hazard rate or cumulative hazard function for time-to-event data, where the hazard rate is not constant over time but instead changes with covariates.

Proposed method

This study suggests an AH model for situations with a time-scale change between the hazard functions due to its time-dependent variable structure. The AH prediction has unique features and shows the AH function of the hazard progression over time. Moreover, the parameter in the interpretation model informs about the acceleration or deceleration of the hazard progress. The event T, in this paper, refers to the age of the subject at the time of disease onset. It poses a challenge to comprehend how the age of a subject at the time of disease onset could be interval censored, given that the researchers recorded the patient's age when they were sampled into the cohort under study. This raises questions about how the times of diagnosis were recorded or partially observed.

One issue is that we only had access to limited information about the exact age at which the diagnosis occurred. This could be due to various reasons, such as incomplete medical records or missing data. As a result, we might have only been able to determine the age of diagnosis within a certain range or interval. Alternatively, it is also plausible that the true event of interest in this study is not the age at diagnosis, but rather the age at onset. In such cases, the age at diagnosis might be estimated or inferred based on other factors or symptoms. This could introduce uncertainty and make the event of interest interval censored. For AH prediction and age classification, 70% of training data and 30% of testing data are considered.

At first, we estimate the AH regarding patients' age and clinical characteristics. The sieve maximum likelihood estimation method based on a polynomial spline is applied to estimate the accelerated cumulative hazard model for each patient with interval-censored data in the training dataset [20]. In order to estimate the accelerated cumulative hazard function, a polynomial spline space is used. The real-time of the event (T), which is the age of the people at time of disease onset, is considered by consecutive intervals ${\text{V}}_{j} <T \le {\text{V}}_{j+1}$. The observed time interval is shown as (L, R).

Uniform B-splines are employed to approximate the uniform nonparametric function. After estimating the accelerated cumulative hazard model using the sieve maximum likelihood estimation method based on a polynomial spline for each patient, the values are added to the training dataset. Then, the study starts training the data using a deep neural network model.

The cumulative hazard function of an individual i is $H(t|{Z}_{i}, \beta ) = {H}_{0}(t exp({\beta }^{T} {Z}_{i})) exp(-{\beta }^{T} {Z}_{i})$. Let $\varphi=log({H}_{0})$ and $F\left(t,z\right)={e}^{\varphi \left(t\right)+{\beta }^{T}Z}$.

Let's examine the partial likelihood function of cumulative incidence functions as stated [28]:

$$L\left(\theta ,D\right)\propto \prod_{i=1}^{n}(\{\prod_{j=1}^{2}{[{F}_{j}\left({R}_{i};{Z}_{i},{\theta }_{j}\right)-{F}_{j}\left({L}_{i};{Z}_{i},{\theta }_{j}\right)]}^{{\delta }_{ij}}\}\times \left\{\prod_{j=1}^{2}{\left[{F}_{j}\left({R}_{i};{Z}_{i},{\theta }_{j}\right)\right]}^{{\delta }_{ij}{\prime}}\right\}{[1-\sum_{j=1}^{2}{F}_{j}\left({L}_{i};{Z}_{i},{\theta }_{j}\right)]}^{1-{\delta }_{i}}$$

(3)

The log-likelihood function can be expressed as shown [20]:

$$\begin{array}{c}l\left(\beta , {H}_{0}|\text{\rm O}\right)=\sum_{i=1}^{n}{\delta }_{1i}\text{log}\{1-\text{exp}\left[-{H}_{0}\left({R}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)\right)\text{exp}(-{\beta }^{T}{X}_{i})\right]\}\\ +{\delta }_{2i}\text{log}\{\text{exp}[-{H}_{0}\left({L}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)\right)\text{exp}(-{\beta }^{T}{X}_{i})]\\ -\text{exp}[-{H}_{0}\left({R}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)\right)]\text{exp}(-{\beta }^{T}{X}_{i})]\}\\ {-\delta }_{3i}{H}_{0}\left({L}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)\right)\text{exp}(-{\beta }^{T}{X}_{i})]\}\end{array}$$

(4)

The indicator function, denoted by$I(.)$, is used to define${\delta }_{1i}=I({L}_{i}=0)$,${\delta }_{2i}=I\left({L}_{i}>0 ,{R}_{i}< \infty \right)$, and${\delta }_{3i}=1-{\delta }_{1i}-{\delta }_{2i}$. To estimate the parameters $({\widehat{\beta }}_{n}, {\widehat{H}}_{n})$ using a semi-parametric sieve maximum likelihood method, splines are employed. The maximization of $l(\beta ,{H}_{0}|O)$ is subject to the constraint that $\upbeta \in \text{ B}$ and ${H}_{0}$ in some spline space. The B-spline approximation method is widely used to obtain a smooth estimation of an unknown function because of its maximum approximation order and minimal support. Specifically, B-Splines use a combination of polynomial basis functions to generate piecewise polynomials on the interval$[a,b]$. These polynomial basis functions are limited to the subintervals$[a,b]$. A general B-Spline space ${\gamma }_{n}^{m}$ of order $m$ can be defined as a set of linear combinations of basic splines:

$${\gamma }_{n}^{m}=\left\{{f}_{n}: {f}_{n}\left(t\right)=\sum_{j=1}^{{q}_{n}}{c}_{j}{s}_{j}^{\beta }\left(t\right)\right\},$$

(5)

where ${s}_{j}^{\beta }\left(t\right)$ is the base spline of polynomial of the order $m$, ${c}_{j}$ are the spline coefficients and ${q}_{n}={k}_{n}+m$ are the number of spline bases used. In survival models with interval-censored data, the B-spline technique is used to estimate the unknown cumulative hazard function. This method is applicable to PH-based models [29] and semi-parametric cure models [30, 31].

In these methods, the fixed distance $[\text{min}\left\{{L}_{i},{R}_{i}\right\},\text{max}\{{L}_{i},{R}_{i}I({R}_{i}<\infty )\}]$ is used to define the spline space. It is necessary to develop a flexible approach to calculate the random variation of the observed values of ${L}_{i}$ and ${R}_{i}$ in which the base hazard is evaluated in the AH model.

To estimate the cumulative hazard of constant β ∈ B, we define a polynomial spline space using a set of bounded intervals. The endpoints of the finite distance ${L}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)$ and ${R}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)I({R}_{i}<\infty )$ are determined as ${\omega }_{1}^{\beta }<{\omega }_{2}^{\beta }<\cdots <{\omega }_{N}^{\beta }$. To maximize (6), we use a two-step algorithm allowing updates of the base spline at each iteration [20]. The algorithm starts with initial values of ${\widehat{\beta }}^{\left(0\right)}$ and ${\widehat{c}}^{\left(0\right)}$, chosen as 0 and 1, respectively. The log-likelihood function is given by relations (4) and (5):

$$\begin{array}{c}l\left(\beta , c|{\beta }{\prime}, \text{\rm O}\right)=\sum_{i=1}^{n}{\delta }_{1i}\text{log}\{1-\text{exp}[-\sum_{j=1}^{{q}_{n}}{c}_{j}{s}_{j}^{{\beta }{\prime}}\left({R}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)\right)\text{exp}(-{\beta }^{T}{X}_{i})]\}\\ l\left(\beta , c|{\beta }{\prime}, \text{\rm O}\right)=\sum_{i=1}^{n}{\delta }_{1i}\text{log}\{1-\text{exp}[-\sum_{j=1}^{{q}_{n}}{c}_{j}{s}_{j}^{{\beta }{\prime}}\left({R}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)\right)\text{exp}(-{\beta }^{T}{X}_{i})]\}\\ +{\delta }_{2i}\text{log}\{\text{exp}[-\sum_{j=1}^{{q}_{n}}{c}_{j}{s}_{j}^{{\beta }{\prime}}\left({L}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)\right)\text{exp}(-{\beta }^{T}{X}_{i})]\\ {-\delta }_{3i}\sum_{j=1}^{{q}_{n}}{c}_{j}{s}_{j}^{{\beta }{\prime}}\left({L}_{i}\text{exp}\left({\beta }^{T}{X}_{i}\right)\right)\text{exp}(-{\beta }^{T}{X}_{i})]\}\end{array}$$

(6)

where β corresponds to the regression coefficient provided in log-likelihood (6) and ${\beta }{\prime}$ is used for the basic spline. Then the algorithm is repeated between two steps:

1)
$\widehat{\upbeta }$ and $\widehat{c}$ are calculated by maximizing (6) in terms of β and $\text{c}$ according to the constant spline space ${\beta }{\prime}$.
2)
The spline space is calculated based on we update the estimated points ${L}_{i}\text{exp}\left({\widehat{\beta }}^{T}{X}_{i}\right)$ and ${R}_{i}\text{exp}\left({\widehat{\beta }}^{T}{X}_{i}\right)$.

In each iteration of the algorithm, we update the spline coefficients ${\widehat{c}}^{\left(m\right)}$ with fixed observation points ${L}_{i}\text{exp}\left({\widehat{\beta }}^{\left(m-1\right)T}{X}_{i}\right)$ and ${R}_{i}\text{exp}\left({\widehat{\beta }}^{\left(m-1\right)T}{X}_{i}\right)$. Based on ${\widehat{c}}^{\left(m\right)}$, we estimate ${\widehat{\beta }}^{\left(m\right)}$ by maximizing the log-likelihood function $l\left(\beta , {\widehat{c}}^{\left(m\right)}|{\widehat{\beta }}^{\left(m-1\right)}, \text{\rm O}\right)$ in terms of $\beta$. We repeat the process until convergence, which occurs when the absolute difference between log-likelihoods of two consecutive iterations is less than a certain threshold ε (usually chosen as ${10}^{-6}$.

Maximizing the log-likelihood with respect to $j$ becomes an unconstrained optimization problem, and we can use the optim/optimize function in R. At convergence, the values of ${\widehat{\beta }}^{\left(m\right)}$ and ${\widehat{\beta }}^{\left(m-1\right)}$ are very close to each other, so the difference between ${s}_{j}^{{\widehat{\beta }}^{\left(m\right)}}$ and ${s}_{j}^{{\widehat{\beta }}^{\left(m-1\right)}}$ is small, and the difference between ${\sum }_{j=1}^{{q}_{n}}{\widehat{c}}_{j}^{(m)}{s}_{j}^{{\widehat{\beta }}^{\left(m\right)}}$ and ${\sum }_{j=1}^{{q}_{n}}{\widehat{c}}_{j}^{(m-1)}{s}_{j}^{{\widehat{\beta }}^{\left(m-1\right)}}$ is very small. At this point, the spline estimator ${\widehat{H}}_{n}$ converges.

After learning the AH of patients with different characteristics by a deep learning algorithm, the AH value of each breast cancer patient in the test dataset is predicted and the importance of variables is examined using the Light Gradient Boosting Machine (LightGBM) model.

The LightGBM is a widely used gradient boosting framework recognized for its efficiency and accuracy. It effectively handles large, feature-rich datasets, making it suitable for a range of machine learning tasks, including survival analysis.

LightGBM has demonstrated its versatility and effectiveness across various domains, particularly in predictive modeling. In solar radiation prediction, a study showed that LightGBM matched the predictive accuracy of support-vector regression (SVR) and outperformed other benchmarks [32]. In road traffic injury severity prediction, it achieved the highest classification accuracy among four boosting-based ensemble models [33]. For depression prediction, the LightGBM model reached an average accuracy of 82.74%, effectively identifying those with depression, underscoring its robustness [34]. In breast cancer diagnosis, it achieved a maximum accuracy of 99%, highlighting its potential in healthcare [35]. Additionally, in predicting gamma pass rates for intensity-modulated radiotherapy, LightGBM excelled in building a classification model, surpassing other machine learning algorithms [36]. Collectively, these studies illustrate LightGBM's effectiveness and applicability across diverse fields, affirming its reliability as a versatile machine learning model.

LightGBM is commonly used for machine learning tasks, particularly in the field of tabular data analysis. It is designed to be efficient and provides high-performance implementations of gradient boosting algorithms. In terms of inputs, LightGBM typically requires the following:

Training data: This includes a set of labeled examples, where each example consists of a set of features (input variables) and their corresponding form of probabilities (output variable).
Feature matrix: A matrix or dataframe containing the input features for both training and testing data. Each row represents an example, and each column represents a feature.
Optional parameters: LightGBM provides various hyperparameters that can be tuned to control the model's behavior, such as the learning rate, number of trees, maximum depth, and more.

As for the outputs, LightGBM generates predictions based on the trained model. These predictions can be in the form of probabilities depending on the feature's importance. Data are analyzed using SPSS software version 26, R, and Python software version 3.7.

Findings

The participants (1225 female patients) were aged 24 to 93 years, 19.3% were ≤ 40, 58% were 41–60, and 22.6% were ≥ 61 years old which were classified into groups 1, 2, and 3, respectively. Among the breast cancer patients, 33.8% had a family history, 4.49% were single, 95.51% were married, and 2.04% were tobacco used. Table 1 shows the clinical characteristics of the participants by age categories, and Fig. 1 illustrates the correlation between variables in breast cancer patients.

To compare the effect of variables to predict the AH of breast cancer patients and to fit the best model to the data, at first variables that affect time are identified using the univariate model. Table 2 demonstrates a simple estimate of the coefficients of variables affecting age, hazard rate, and confidence interval at the 95% level based on the CoxPH model. The model's main advantage is that it accounts for many situations and can determine the weight of features. The results of fitting the proportional Cox regression model on the data are presented in Table 2, where coef is the value of the coefficient, HR is the hazard ratio, and se (coef) is the standard error of the coefficient. In the current data, the highest hazard ratio is related to the hysterectomy variable with HR = 1.116 and a coefficient of 0.110. The possible results regarding the gravida reveal that the probability of breast cancer decreases with the increase in gravida. Figure 2 demonstrates the error rate of breast cancer occurrence in terms of AH with maximum and minimum error limits.

Table 2 The relationship between explanatory and demographic variables and risk factors based on the age of breast cancer onset using the Cox regression model

Full size table

Figure 3 shows the significant relationship between the number of breastfeeding and developing breast cancer; however, the incidence rate of breast cancer decreases with the increase in the number of breastfeeding in older ages. In addition, this study investigates the effect of family history in 3 age groups of these patients; 35% of the age group of ≤ 40, 33% of 41–60, and 34% of those with ≥ 61 had a family history. This increase in the percentage of family history in the younger age group indicates an increase in cancer risk in high-risk families, requiring more care, annual control, and lifestyle changes.

LightGBM method

Before predicting the accelerated hazard, we examine the importance of factors influencing breast cancer using the LightGBM method and input the important factors into the model. LightGBM is a fast and high-performance gradient-boosting framework based on the decision tree algorithm for ranking, classification, and other machine-learning tasks [37, 38]. It is an ensemble method that combines multiple decision tree predictions by adding them together for the final prediction. In this section, LightGBM is used to rank the characteristics of breast cancer patients. This model is evaluated through 70% training and 30% testing. The average evaluation accuracy of the patients' characteristics is 0.763. Marital status, according to Fig. 4, is known as the least important factor in different age groups and therefore we remove it from the prediction model. Figure 4 illustrates the LightGBM performance and the characteristics' importance analysis result.

Predicting the AH of each person with breast cancer using deep learning

A deep neural network model is implemented to predict the AH of breast cancer patients in each age group. Suppose a person with breast cancer with family history, age, history of tobacco use, hysterectomy, first menstruation age, gravida, number of breastfeeding, disease grade, and marital status to predict the AH based on her age in the age categories. The data are trained using a deep neural network to predict the AH in each age group based on the characteristics of each patient. The mean absolute error loss function is applied to optimize the algorithm. Figure 5 shows the trend of loss and accuracy for classifying 3 age groups of ≤ 41, 41–60 years, and ≥ 61 in a deep neural network model in the training and validation datasets.

The evaluation metrics in this study are F1-score, recall, precision, and accuracy. The batch-size values of 32 and 64 have been considered and evaluated with the number of epochs of 500 and 800. The confusion matrix in Table 3 demonstrates that age groups have been classified with good accuracy.

Table 3 Confusion Matrix for the classification of age groups

Full size table

Table 4 demonstrates that the best classification occurred with batch_size64 with 500 epochs with a mean test accuracy of 0.964. Therefore, the classification of the age groups based on the characteristics of the person has been performed with high accuracy and precision; an accuracy of 0.981 for the tenfold network and a loss of 0.053 in 500 epochs are obtained.

Table 4 Average of recall, precision, F1-score, accuracy, and ranking of test accuracy in several different deep learning networks for age group classification

Full size table

The loss trend in deep network training with multiple hidden layers after 100 epochs loss has reached 0.0178. This is how the AH of the disease is determined using a deep learning model. The value of variables for prediction is randomly selected. For example, considering the age of 68 years in the third age group and random values for risk factors, the AH has been predicted based on the age group of 0.026.

The predicted AH based on the age of the patients in the test dataset is estimated with an accuracy of 0.955 and a loss of 0.095. The study result to predict the AH of patients based on their characteristics and age group is an essential step towards analyzing the time-dependent AH survival and taking care of the time and severity of the disease. Developing this research will help to investigate more factors to predict the AH of breast cancer and other cancers. In addition, this study can be applied and commercialized to develop an application that can easily determine the hazards for each person upon diagnosing the disease.

Discussion

This research is an epidemiological study to predict AH by age classification of breast cancer patients using deep learning. In previous studies, time-dependent AH prediction based on diagnosis age has not been investigated. Therefore, this section presents the results of previous studies on predicting breast cancer risk using machine learning methods [39, 40].

Recent studies have explored machine learning (ML) approaches for breast cancer risk prediction, showing promising results compared to traditional models. Alfian et al. [41] combined support vector machine and extra-trees classifiers, achieving 80.23% accuracy in early-stage diagnosis. Chang et al. [42] reported significantly higher predictive accuracy using ML methods (88–90%) compared to established models like BCRAT (62.40%) and BOADICEA (59.31%). A systematic review and meta-analysis by Gao et al. [43] (2021) found that ML-based models had a pooled AUC of 0.73, outperforming traditional risk factor-based models. Neural networks were the most common ML method, with higher pooled AUC than non-neural network approaches. Incorporating imaging features further improved model performance. While these ML models show potential for enhancing breast cancer risk prediction, it's important to note that some studies, like Degnim et al. [44], have developed specialized models for high-risk subgroups, such as women with atypical hyperplasia.

Machine learning models used the dataset with116 breast cancer participants to predict breast cancer and tenfold cross-validation to evaluate the model. Ghani et al. [45] used recursive feature elimination (RFE) to select features and different classification models, such as DT, KNN, NB, and ANN, for performance evaluation. The results demonstrated that ANN had the highest accuracy (up to 80%).

Stark et al. [46], in 2019, predicted the breast cancer risk using machine learning models. At the 0.05 confidence interval level, logistic regression, linear discriminant analysis, and neural network models with a more extensive set were significantly stronger than the Breast Cancer Risk Assessment Tool (BCRAT). The effectiveness of these models suggests that they can be in the risk classification tools. Such tools can improve the early detection of breast cancer by performing rapid screening and help reduce the incidence of breast cancer through preventive measures. In addition, Ming et al. [42] obtained 88.89% predictive accuracy for breast cancer using ML-adaptive boosting, 62.40% using ML random forest, and 62.40% for the US population using BCRAT. The prediction accuracy for the Swiss clinical sample was 90.17% using the ML adaptive boosting, 89.32% using the ML-Markov chain Monte Carlo generalized linear mixed model, and 59.31% using the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA). In 2022, Shen et al. [47] investigated the role of the detection method in predicting breast cancer survival. They diagnosed breast cancer according to mammogram screening. The Health Insurance Plan (HIP) trial randomly divided approximately 62,000 women into screening and control groups. The two CNBSS cohorts encompassed 44,970 women in the screening and 44,961 in the control groups. After diagnosing the grade and other characteristics of the tumor, they compared the survival distributions for breast cancer diagnosis using univariate and multivariate analysis in a Cox proportional hazard model. By adjusting tumor size, lymph node status, and disease grade in a Cox proportional hazard model, the diagnostic method was statistically independent and significant for predicting the survival rate of a specific disease.

This study is more compatible with the logic of randomization because the AH model is flexible at the start of a random experiment with the same hazard. Therefore, AH prediction allows us to use a single parameter to reflect a complex phenomenon so that the models have a time-dependent structure, and the variables directly affect the baseline hazard function. AH prediction presents the progression of hazard and risk over time. Moreover, it may represent a general biological mechanism in which treatment type can accelerate or decelerate the failure time. In the present study, among the 1225 breast cancer patients, 66.2% had no family history, 95.51% were married, and 2.04% were a history of tobacco used. The evaluation metrics were F1-score, recall, precision, and accuracy. The batch sizes 32 and 64 were considered and checked with 500 and 800 epochs. The findings revealed that the best classification was 3 age groups (i.e., ≤ 41, 41–60, and ≥ 61) with batch_size64, 500 epochs, mean test accuracy of 0.964, recall 0.962, precision 0.953, and F1-score 0.973. The AH prediction was estimated based on the patient's age in the test dataset with an accuracy of 0.955 and a loss of 0.095. Considering age 68 in the third age group and randomized values for the risk factors, an AH prediction value was obtained at 0.026. By developing the suggested method and investigating more factors, the researchers can examine AH prediction for breast cancer and other diseases more accurately. Moreover, this study's results can be commercialized, and a smart application can easily determine the degree of risk and AH based on the patient's disease and characteristics.

Evaluation metrics used to assess the model's performance included F1 Score, Recall, Precision, and Accuracy. The model was evaluated with batch sizes of 32 and 64 and training conducted over 500 and 800 epochs. The confusion matrix showed that the model effectively classified age groups with high accuracy, indicating robust performance in predicting AH in breast cancer patients.

Strengths of the study include the novelty of using deep learning for AH prediction, the practicality of age-based classification, and the comparison with traditional survival models like Cox proportional hazards.

Similar to any other study, this study has some limitations. The study conducted on registered patients in this research is an open cohort, with the age at disease onset being considered as time. Open cohort studies are often used in epidemiological research to examine the occurrence of diseases or health outcomes in a specific population over time. However, it should be noted that some features may not have been registered. For example, we examine accelerated hazard prediction without considering the BMI factor. This is because the registered data of the patients did not include it, and our emphasis is on other significant factors. Due to the lack of information on the history of tobacco use, it was excluded from the study. Furthermore, the prediction was checked using the available data due to the lack of access to more breast cancer data with the desired characteristics. However, the prediction model can be developed by increasing the data. To address these limitations, future research should focus on data enrichment, closed cohort study design, imputation techniques, sensitivity analysis, and collaboration with other institutions.

Future research should focus on incorporating additional predictors such as BMI, and genetic markers, conducting longitudinal studies to follow patients over time, and validating the model on independent datasets to assess its generalizability.

In conclusion, when examining breast cancer studies, it is crucial to assess the reported results in relation to other studies. These results can still be valuable in guiding future research and expanding the scope of breast cancer prediction models to include additional variables. In addition, this research can be applied and commercialized to develop an application that can easily determine the survival and hazards for each person upon diagnosing the disease. Therefore, based on age and characteristics, artificial intelligence can provide a clear path to identify people's hazards rate in developing cancer.

Conclusion

This study discussed the AH prediction by age categories based on the characteristics of breast cancer patients. First, the breast cancer data were assessed from a descriptive point of view. Then, 3 age groups (≤ 40, 41–60, and ≥ 61) of breast cancer patients were classified and it was performed with high accuracy (0.964). Improving clinical management and breast cancer treatment requires developed methods and time-dependent AH calculation. Unlike other research, this study suggested the prediction of AH rate using deep learning methods based on the patient's age because of its time-dependent structure. This study obtained the AH prediction of each breast cancer patient in the test dataset with an accuracy of 0.955 and a loss of 0.095. The study results illustrated the appropriate performance of the proposed model for predicting AH according to the characteristics of the breast cancer patient.

Data availability

The data cannot be made openly available because of ethical and legal considerations. Non-identifying data are however available from the corresponding author upon reasonable request and with permission of Mazandaran University of Medical Sciences.

References

Spanhol FA, Oliveira LS, Petitjean C, Heutte L. A dataset for breast cancer histopathological image classification. IEEE Trans Biomed Eng. 2015;63:1455–62.
Article PubMed Google Scholar
Peng C, Wu K, Chen X, Lang H, Li C, He L, et al. Migraine and risk of breast cancer: a systematic review and meta-analysis. Clin Breast Cancer. 2022;23:e122–30.
Almawi WY, Zidi S, Sghaier I, El-Ghali RM, Daldoul A, Midlenko A. Novel association of IGF2BP2 gene variants with altered risk of breast cancer and as potential molecular biomarker of triple negative breast cancer. Clin Breast Cancer. 2023;23:272–80.
Muse AH, Chesneau C, Ngesa O, Mwalili S. Flexible parametric accelerated hazard model: simulation and application to censored lifetime data with crossing survival curves. Math Comput Appl. 2022;27:104.
Google Scholar
Chen YQ, Wang M-C. Analysis of accelerated hazards models. J Am Stat Assoc. 2000;95:608–18.
Article Google Scholar
Zhang J, Peng Y, Zhao O. A new semiparametric estimation method for accelerated hazard model. Biometrics. 2011;67:1352–60.
Article PubMed Google Scholar
Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94:496–509.
Article Google Scholar
Finkelstein DM. A proportional hazards model for interval-censored failure time data. Biometrics. 1986;42:845–54.
Article CAS PubMed Google Scholar
Orbe J, Ferreira E, Núñez-Antón V. Comparing proportional hazards and accelerated failure time models for survival analysis. Stat Med. 2002;21:3493–510.
Article PubMed Google Scholar
Parsa M, Van Keilegom I. Accelerated failure time vs Cox proportional hazards mixture cure models: David vs Goliath? Stat Pap. 2023;64:835–55.
Peng Y, Dear KBG. A nonparametric mixture model for cure rate estimation. Biometrics. 2000;56:237–43.
Article CAS PubMed Google Scholar
Sy JP, Taylor JMG. Estimation in a Cox proportional hazards cure model. Biometrics. 2000;56:227–36.
Article CAS PubMed Google Scholar
Lu W, Ying Z. On semiparametric transformation cure models. Biometrika. 2004;91:331–43.
Article Google Scholar
Wang L, Du P, Liang H. Two-component mixture cure rate model with spline estimated nonparametric components. Biometrics. 2012;68:726–35.
Article PubMed Google Scholar
Li C, Taylor JMG. A semi-parametric accelerated failure time cure model. Stat Med. 2002;21:3235–47.
Article PubMed Google Scholar
Zhang J, Peng Y. A new estimation method for the semiparametric accelerated failure time mixture cure model. Stat Med. 2007;26:3157–71.
Article PubMed Google Scholar
Cox DR. Regression models and life-tables. J R Stat Soc Ser B. 1972;34:187–202.
Article Google Scholar
Cox DR, Oakes D. Analysis of survival data. Chapman and Hall/CRC; 2018.
Book Google Scholar
Qing CY. Accelerated hazards regression model and its adequacy for censored survival data. Biometrics. 2001;57:853–60.
Article Google Scholar
Szabo Z, Liu X, Xiang L. Semiparametric sieve maximum likelihood estimation for accelerated hazards model with interval-censored data. J Stat Plan Inference. 2020;205:175–92.
Article Google Scholar
Zhou J, Zhang J, Lu W. An expectation maximization algorithm for fitting the generalized odds-rate model to interval censored data. Stat Med. 2017;36:1157–71.
Article PubMed Google Scholar
Xie Y, Deng Y, Wei S, Huang Z, Li L, Huang K, et al. Age has a U-shaped relationship with breast cancer outcomes in women: a cohort study. Front Oncol. 2023;13:1265304.
Article PubMed PubMed Central Google Scholar
Chia S, Bryce C, Gelmon K. Effects of chemotherapy and hormonal therapy for early breast cancer on recurrence and 15-year survival: an overview of the randomised trials. Commentary. Lancet. 2005;365:1665–6.
Article CAS PubMed Google Scholar
Cardoso F, Kyriakides S, Ohno S, Penault-Llorca F, Poortmans P, Rubio IT, et al. Early breast cancer: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. Ann Oncol. 2019;30:1194–220.
Article CAS PubMed Google Scholar
Kang HW, Seo SP, Kim WT, Yun SJ, Lee S-C, Kim W-J, et al. Impact of young age at diagnosis on survival in patients with surgically treated renal cell carcinoma: a multicenter study. J Korean Med Sci. 2016;31:1976–82.
Article PubMed PubMed Central Google Scholar
Sung H, Siegel RL, Rosenberg PS, Jemal A. Emerging cancer trends among young adults in the USA: analysis of a population-based cancer registry. Lancet Public Heal. 2019;4:e137–47.
Article Google Scholar
Brandt J, Garne JP, Tengrup I, Manjer J. Age at diagnosis in relation to survival following breast cancer: a cohort study. World J Surg Oncol. 2015;13:1–11.
Article CAS Google Scholar
Bakoyannis G, Yu M, Yiannoutsos CT. Semiparametric regression on cumulative incidence function with interval-censored competing risks data. Stat Med. 2017;36:3683–707.
Article PubMed PubMed Central Google Scholar
Zhang Y, Hua LEI, Huang J. A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scand J Stat. 2010;37:338–54.
Article Google Scholar
Hu T, Xiang L. Partially linear transformation cure models for interval-censored data. Comput Stat Data Anal. 2016;93:257–69.
Article Google Scholar
Hu T, Xiang L. Efficient estimation for semiparametric cure models with interval-censored data. J Multivar Anal. 2013;121:139–51.
Article Google Scholar
Chaibi M, Benghoulam EM, Tarik L, Berrada M, Hmaidi AE. An interpretable machine learning model for daily global solar radiation prediction. Energies. 2021;14(7367):2021.
Google Scholar
Sheng Y, Dong D, He G, Zhang J. How noise can influence experience-based decision-making under different types of the provided information. Int J Environ Res Public Health. 2022;19:10445.
Article PubMed PubMed Central Google Scholar
Wang AX, Nguyen BP, Elliott T, Mbinta JF, Sporle A, Simpson CR. Early detection of depression using machine learning and social well-being survey data. In: 2024 16th International Conference on Computer and Automation Engineering(ICCAE). Melbourne: IEEE; 2024. p. 181–6.
Mohi Uddin KM, Biswas N, Rikta ST, Dey SK, Qazi A. XML-LightGBMDroid: a self-driven interactive mobile application utilizing explainable machine learning for breast cancer diagnosis. Eng Rep. 2023;5:e12666.
Article CAS Google Scholar
Ni Q, Zhu J, Chen L, Tan J, Pang J, Sun X, et al. Establishment and interpretation of the gamma pass rate prediction model based on radiomics for different intensity-modulated radiotherapy techniques in the pelvis. Front Phys. 2023;11:1217275.
Article Google Scholar
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
Google Scholar
Pokhrel P, Ioup E, Hoque MT, Abdelguerfi M, Simeonov J. A LightGBM based forecasting of dominant wave periods in oceanic waters. 2021;abs/2105.08721. https://api.semanticscholar.org/CorpusID:234778041.
Wang Y, Tang L, Chen P, Chen M. The role of a deep learning-based computer-aided diagnosis system and elastography in reducing unnecessary breast lesion biopsies. Clin Breast Cancer. 2022;23(3):e112–21.
Kwong A, Co M, Fukuma E. Prospective clinical trial on expanding indications for cryosurgery for early breast cancers. Clin Breast Cancer. 2023;23(4):363–8.
Alfian G, Syafrudin M, Fahrurrozi I, Fitriyani NL, Atmaji FTD, Widodo T, et al. Predicting breast cancer from risk factors using SVM and extra-trees-based feature selection method. Computers. 2022;11:136.
Article Google Scholar
Ming C, Viassolo V, Probst-Hensch N, Chappuis PO, Dinov ID, Katapodi MC. Machine learning techniques for personalized breast cancer risk prediction: comparison with the BCRAT and BOADICEA models. Breast Cancer Res. 2019;21:1–11.
Article CAS Google Scholar
Gao Y, Li S, Jin Y, Zhou L, Sun S, Xu X, et al. An assessment of the predictive performance of current machine learning-based breast cancer risk prediction models: systematic review. JMIR Public Heal Surveill. 2022;8:e35750.
Article Google Scholar
Degnim AC, Winham SJ, Frank RD, Pankratz VS, Dupont WD, Vierkant RA, et al. Model for predicting breast cancer risk in women with atypical hyperplasia. J Clin Oncol. 2018;36:1840–6.
Article PubMed PubMed Central Google Scholar
Ghani MU, Alam TM, Jaskani FH. Comparison of classification models for early prediction of breast cancer. In: 2019 International Conference on Innovative Computing (ICIC). Lahore: IEEE; 2019. p. 1–6.
Stark GF, Hart GR, Nartowt BJ, Deng J. Predicting breast cancer risk using personal health data and machine learning models. PLoS One. 2019;14:e0226765.
Article CAS PubMed PubMed Central Google Scholar
Shen Y, Yang Y, Inoue LYT, Munsell MF, Miller AB, Berry DA. Role of detection method in predicting breast cancer survival: analysis of randomized screening trials. J Natl Cancer Inst. 2005;97:1195–203.
Article PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank Dr Ahmad Ostovar for sharing his valuable feedback.

Funding

There was not research funding.

Author information

Authors and Affiliations

Department of Epidemiology and Biostatistics, School of Health, Health Sciences Research Center, Addiction Institute, Mazandaran University of Medical Sciences, Sari, Iran
Zahra Ramezani & Jamshid Yazdani Charati
Gastrointestinal Cancer Research Center, Non-Communicable Diseases Research Institute, Mazandaran University of Medical Sciences, Sari, Iran
Reza Alizadeh-Navaei
Department of Hematology and Oncology, Gastrointestinal Cancer Research Center, Mazandaran University of Medical Sciences, Sari, Iran
Mohammad Eslamijouybari

Authors

Zahra Ramezani
View author publications
You can also search for this author inPubMed Google Scholar
Jamshid Yazdani Charati
View author publications
You can also search for this author inPubMed Google Scholar
Reza Alizadeh-Navaei
View author publications
You can also search for this author inPubMed Google Scholar
Mohammad Eslamijouybari
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Z. R. wrote the main manuscript text, performed research, conducted the analysis and python programming, implemented the computer code and supporting algorithms, performed visualization, formal analysis, data curation, validation, and conceptualization, interpreted the data, and revised the manuscript. J. Y. Ch. designed the research, contributed to the interpretation and edited the manuscript. R. A. N. contributed to data acquisition and reviewed the manuscript. M. E. reviewed the manuscript.

Corresponding author

Correspondence to Jamshid Yazdani Charati.

Ethics declarations

Ethics approval and consent to participate

Ethics approval for this study was provided by the Research Ethics Committee of Mazandaran University of Medical Sciences, Approval Number IR.MAZUMS.REC.14018820. All experiments and methods were performed in accordance with relevant guidelines and regulations. The study was performed in accordance with the ethical guidelines of the Declaration of Helsinki of the World Medical Association for medical studies. The research design procedures were reviewed and approved by the Research Ethics Committee of Mazandaran University of Medical Sciences. The study was conducted Informed consent was obtained from a parent/guardian and assent was obtained from all participants prior to undergoing research. All participants in this study consented for themselves and provided verbal consent. Verbal consent was approved because the research presented no more than minimal risk of harm to subjects and involved no procedures for which written consent is normally required outside of the research context. The verbal consent procedure was approved by the Research Ethics Committee of the Ministry of Health.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ramezani, Z., Charati, J.Y., Alizadeh-Navaei, R. et al. Accelerated hazard prediction based on age time-scale for women diagnosed with breast cancer using a deep learning method. BMC Med Inform Decis Mak 24, 314 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02725-7

Download citation

Received: 14 March 2023
Accepted: 16 October 2024
Published: 28 October 2024
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02725-7

Accelerated hazard prediction based on age time-scale for women diagnosed with breast cancer using a deep learning method

Abstract

Introduction

Data and methodology

Proposed method