How to improve a 5-class Diabetic Retinopathy model (APTOS 2019) – Mixed predictions across classes[P]

Hi everyone,

I’m a final-year Computer Engineering student building a Flask-based AI Diabetic Retinopathy Detection system. The web application itself is complete with patient management, authentication, dashboard, PDF report generation, prediction history, and AI inference.

The only issue I’m facing is with the AI model.

I’m using a 5-class Diabetic Retinopathy classifier trained on the APTOS 2019 dataset.

Classes:

No DR

Mild

Moderate

Severe

Proliferative DR

The model predicts all five classes, but the predictions are inconsistent.

Examples:

Moderate is sometimes classified as Severe or Proliferative.

Severe is often classified as Moderate or Proliferative and is rarely predicted correctly.

Some fundus images from outside the APTOS dataset produce completely unexpected results.

The model sometimes shows very high confidence (90%+) even when the prediction appears incorrect.

Things I’ve already tried:

Different pretrained models (including a ResNet50 trained on APTOS)

ResNet152 implementation

Correct preprocessing (RGB conversion, resizing, normalization)

Verified class mapping

Softmax confidence scores

Test-Time Augmentation (TTA)

Image quality validation

Top-3 predictions instead of only one prediction

I’m trying to understand whether this is:

A domain shift problem between APTOS and other datasets?

A limitation of the pretrained model?

A preprocessing issue?

Class imbalance?

Or simply expected behavior in 5-class DR classification?

I’m also considering using an ensemble (ResNet50 + EfficientNet + DenseNet), but it’s difficult to find compatible pretrained 5-class diabetic retinopathy models.

I’d really appreciate advice from anyone who has worked on retinal image classification or medical AI.

My questions are:

  1. Is this level of class confusion common in diabetic retinopathy models?

  2. What preprocessing techniques made the biggest improvement for you (CLAHE, retinal cropping, illumination correction, etc.)?

  3. Has anyone significantly improved results using ensemble models?

  4. Are there any high-quality pretrained 5-class DR models that you’d recommend?

  5. If you were in my situation, what would be the first thing you’d investigate to improve prediction consistency?

Any suggestions, GitHub repositories, pretrained models, research papers, or personal experiences would be greatly appreciated.

Thanks in advance!

submitted by /u/Delicious_Corner_754
[link] [comments]

Liked Liked