Longivity + Machine Learning

Infinit Plus: Predicting Biological Age with Biomarkers

I had the chance to work on a small project for an industry I really liked, Longevity, and in this post I will share the details of that journey.

I got referred to a doctor who worked in this field, by our CEO, to help him take this research project off the ground, with the potential of it turning into a product later. He had 15 years of bio markers data (Blood tests, weight, heights and other measures) and he needed a predictive model to calculate the biological age of the client using these bio markers.

He had some consideration, for example only study the people in range of 18 to 45 and narrow the biomarkers and only use “Glucose”, “Cholesterol”, “Lymphocyte”, “Corpuscular”, “Height” and “Weight” for the ease of process. In my mind it was a very straight forward task, a simple regression model using supervised learning techniques, That’s it. So I asked for the data and that’s when the challenges surfaced…

Data Collection

First thing I do in data related projects, is a health check of the dataset, to make sure we have everything we need and its in a good shape to work with.

This is the CSV file I have received:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113249 entries, 0 to 113248
Data columns (total 20 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   SEQN      113249 non-null  int64  
 1   RIDAGEYR  113249 non-null  float64
 2   RIAGENDR  113249 non-null  int64  
 3   Gender    113249 non-null  object 
 4   LBXTC     66304  non-null  float64
 5   LBXLYPCT  82744  non-null  float64
 6   LBXMCVSI  82926 non-null   float64
 7   LBXGLU    35009 non-null   float64
 8   LBXIN     9501  non-null   float64
 9   Cycle     113249 non-null  object 
 10  LB2DAY    544 non-null     float64
 11  LB2TC     542 non-null     float64
 12  LB2TCSI   542 non-null     float64
 13  LB2HDL    542 non-null     float64
 14  LB2HDLSI  542 non-null     float64
 15  LB2TR     284 non-null     float64
 16  LB2TRSI   284 non-null     float64
 17  LB2LDL    276 non-null     float64
 18  LB2LDLSI  276 non-null     float64
 19  LBXTST    21926 non-null   float64
dtypes: float64(16), int64(2), object(2)
memory usage: 17.3+ MB

Right at the first sight, I noticed the challenges I had with this dataset:

  1. Column names were encoded.
  2. There are a LOT of missing data.

So I reached the doctor and told him about the health of the data, and he said he actually is aware of that, gave me the mapping for the column names, and explained that the process of data collection have changed in each cycle, so not everybody have their blood tested similarly. So in total we had about 69% missing glucose, 41% cholesterol, 27% lymphocyte and corpuscular volume, and very high gaps in insulin (91%) and testosterone (81%).

Since he said he has assembled the CSV file himself, I got worried if there was a mistake while importing the data, so I decided to ask for the data source so I can collect the datapoints with code to be sure about the quality of it.

The source of the data was National Health and Nutrition Examination Survey (NHANES), that measures the health and nutrition of adults and children in the United States. NHANES is the only national health survey that includes health exams and laboratory tests for participants of all ages, so I was a bit relieved that we have a huge dataset in hand and there are lots of research and documents about it. This actually allowed me to use other biomarkers that what has initially requested.

I gathered the data cycle by cycle, merged them into a single dataframe and renamed the columns to the actual biomarker, and this was what I got:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18179 entries, 0 to 18178
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   SEQN                         18179 non-null  float64
 1   age                          18179 non-null  float64
 2   gender                       18179 non-null  float64
 3   glucose                      12454 non-null  float64
 4   cholesterol                  14362 non-null  float64
 5   lymphocyte                   16042 non-null  float64
 6   insulin                      5918  non-null  float64
 7   corpuscular                  16072 non-null  float64
 8   red cell distribution width  16072 non-null  float64
 9   white blood cell count       16072 non-null  float64
 10  albumin                      12458 non-null  float64
 11  alkaline phosphatase         12455 non-null  float64
 12  creatinine                   12456 non-null  float64
 13  protein                      12445 non-null  float64
 14  weight                       17965 non-null  float64
 15  height                       17083 non-null  float64
 16  bmi                          17060 non-null  float64
dtypes: float64(17)
memory usage: 2.4 MB

Good news was that I had much cleaner data with less gaps, but the sample size was much less (and reduced to third after filtering to the age range), the “Testosterone” data was not found in NHANES bank, and we still had 50% missing data on “Insulin”.


Data Cleaning

I quickly checked the correlations to see how much the insulin is related to the age, to decide how much should I invest in cleaning it. Judging by the heatmap you see, it didn’t answer my question, because there were so few strong correlation with most of the parameters:

So I decided to just drop the missing rows (which gave me only 2200 samples to work with) and start training to see what are we dealing with. I wanted to see how much is the importance of insulin in predicting the age.

I used XGBoost Regressor with a Tree-Based approach and the results was awful:

Mean Squared Error: 52.92

Which means the model is off by 7.27 years on average, and cherry on top, the importance of Insulin turned up to be very high, even more than the corpuscular which had the highest correlation with the age:

So I decided to not drop the missing values, and instead try and fill the gaps first, this way I could use all of the data (almost doubled the sample size)

I didn’t use the normal practices, which basically use the Mean or Median of the values, because these biomarkers are going to be absolutely different from one person to another depending on their age and other parameters. So I chose to go with the Iterative approach, which models each feature with missing values as a function of other features, and iteratively predicts missing values.

This is the result I got with the cleaned dataset (with 5170 samples):

Mean Squared Error: 48.93

Which means the model is off by 6.99 years on average, which is better than the previous try, but still not very good. So I thought I can do better, instead of the Iterative approach, I will use an unsupervised approach with KNN Imputer.

In this method, for each missing value, we find the k-nearest samples (rows) using the other features, and impute the missing value as the mean (or weighted mean) of those neighbors.

But Surprisingly, the results got even worse than the first try:

Mean Squared Error: 57.80

which means the model is off by 7.60 years on average!!! The issue is that although this approach is simple, intuitive, and very Fast for small to medium datasets, It may not capture complex feature relationships.

OK, Wait a minute...

While I was thinking about the ways I could reduce the error, it suddenly hit me:

In this dataset, the age column (which I’m trying to predict) is the chronological age of the individuals, and the goal is to calculate their biological age. So should I considered the models prediction as the biological age? If so, then what even the error means?

If my dilemma is not clear to you yet, let me put it this way:

What if the age that is calculated by the model is the actual biological age I’m looking for and I’m ruining it by trying to minimize the error?!

I was thinking about this for two days straight, and I was completely lost. So I decided to double check, I shared my codes and results with AI, explained the project and asked to evaluate my work and see if is there anything wrong with it.

It just gave me normal recommendations like adding more biomarkers to the study, acknowledge the high rate of missing values, and encouraged me to use cross validation and optimization for better results. But it didn’t catch the actual circular trap I was in. Even after I asked the question and concern I had, it just said

“Your model gives you a proxy for biological age, with about ±7 years of precision. Refining that precision (by adding more biomarkers, reducing missingness, stacking models, etc.) shrinks the “noise band” so you can detect smaller biological‐age deviations with confidence.”

So I decided to actually explain the logical fallacy I was worried about (which BTW we have a cool term for it: petitio principii or also known as begging the question) and then it finally got my point, and confirmed:

“You’re not crazy—your intuition is spot on. By training a model to mimic chronological age, you risk smoothing over the very biological variation you care about. But if you treat the model’s prediction as biological age and its error as meaningful deviation, you can still extract useful insights.” (Duh)

Syncicny up with the SME

I tried to do a sanity check before I move forward, and share the progress and the results I had so far with the doctor, to see if I’m on the right track or not. I decided to extract a SHAP chart (which is a way to visualize feature contributions to a model’s predictions) from the best model I had, so he can tell me if the model's use of parameters makes sense to begin with or not.

In this chart, each dot represents an individual.

The x-axis shows the SHAP value, which is the impact of each feature on the model’s prediction. Red dots indicate higher biomarker values, while blue dots indicate lower values.

For example, higher cholesterol values push the prediction toward an older biological age (to the right), while lower values push it younger (to the left). The same effect is seen for glucose. Insulin shows the opposite: higher values push older age, while lower values push younger age.

For corpuscular volume, both red and blue dots appear on both sides, suggesting its effect is more individual-specific and context-dependent.

Features lower on the chart generally have less overall importance on the prediction compared to those higher up. Clustering of dots near zero is natural and reflects individuals with average biomarker values that don’t strongly shift the prediction either way (like the gender, which makes perfect sense).

After these explanations, He confirmed that the model is broadly aligned with medical knowledge, so I got relieved that I'm on the right track.

So What was the issue?

Once I shared the concern I had with the doctor, he simply confirmed that this is a very well known issue and actually is a profound dilemma at the heart of biomarker-based aging models. See, biological age is meant to capture the physiological wear and tear, which means how “old” a person’s body truly is. Our model will capture average age related changes, which might not fully translate into health risk. That’s fine for calibration, but it can wash out the very biological differences we care about. Also for us, the useful quantity was the acceleration (how far someone sits above/below the expected age for their peers), not the raw number.

As we discussed this matter, we reached the conclusion that for better predictions, we had three options, but each came with their own challanges:

  1. Including more modern data, such as DNA and genetic information, which wasn’t easily available, and also we wanted to have a model that predicts the acceleration of the aging, and gives recommendations to the users to slow down the aging, with a simple blood test.
  2. Adding behavioral metrics like diet, lifestyle habits, and mortality measures. But we avoided these in the first place because we wanted to rely only on blood tests, not on self-reported data that could be inaccurate.
  3. Using a larger and richer dataset (such as the UK Biobank), which not only has more data but also includes repeated measurements from the same patients across different cycles (which becomes super helpful because it shows how biomarkers change as they grow old). The challenge with this option was that access required taking courses and passing exams (the doctor was already in the process), working within their lab environment (which leaved me out), and dealing with possible biases (for example, patients might only visit the Biobank when they are already unwell, which could skew the data.)

Since the doctor mentioned he is trying to gain access to this dataset, but it will take time, i decided to also do a research on my own to see what else is out there and how other people are doing it.

Contact

Get in Touch

Ready to bring your ideas to life? Get in touch to discuss your project and see how we can create something amazing together.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.