First Round
My team joined a competition RIS 2026 and this year’s problem was classification of bacteria based on their spectrum. The competition was split into 2 rounds. In each round we were given training and test data, training data contained images and classifications, and our task was to generate classifications for images in test data. In first round we were given only RGB images of agar plates with bacteria on it. Using convolutional neural network we managed to correctly classify 50 % of images, which was enough for 2nd place out of 19.
The Problem
In second round we were given 3D images, the third dimension was spectrum - amount of light reflected for each wavelenght. Images were cut to contain only the bacteria and agar in corners was set to value -1 so that we could focus only on the spectrum (and shape if we wanted).
Our task was to classify which bacteria was on a given image. Possible bacteria were:
- Escherichia coli (Ecoli),
- Enterococcus faecalis (Efae),
- Klebsiella aerogenes (Kaer),
- Klebsiella pneumoniae (Kpne),
- Pseudomonas aeruginosa (Paer),
- Staphylococcus aureus (Saur),
- Staphylococcus epidermidis (Sepi),
- Streptococcus pyogenes (Spyo).
In our team we had decided that each member would try to create his own model and see what works best. I decided to go with statistics and other two teammates both went with AI models.
Spectrum comparison
Firstly, I wanted to generate a spectrum - list of average brighteness for each wavelenght. This is an easy task. Next I wanted to see how much spectrum differs from one bacteria to another so I generated an average spectrum for each bacteria.
As you can see, the spectrum is very similar from one bacteria to another, the main difference is mostly in brightness. So I tried creating first model for classification - for each image generate a standard deviation for each bacteria between their spectrums and simply take the lowest deviation. That is our prediction. Mathematically:
$$ \sigma = \sum_{i = \lambda_\min}^{\lambda_\max}{ \frac{(x_i - x_{b i})^2}{\lambda_\max - \lambda_\min}} $$
where \( x_i \) is brightness in wavelenght \( i \) of the image and \( x_{b i} \) average brightness in wavelenght \( i \) of specific bacteria images.
This simple model achived 34 % correct classifications, which is not bad for it’s simplicity, but we can do a lot better.
Derivative of the spectrum
If you look again to the first image, you can see that Kpne is an impostor between other bacteria due to not being as bright in shorter wavelenghts relative to higher ones. In other words, spectrum of Kpne changes differently than other bacteria’ - it has a different derivative. To calculate derivative is as easy as substracting neighbouring wavelenghts:
$$ x_{d i} = x_{i + 1} - x_i $$
Here are derivatives for all bacteria:
Graphs are way more similar to one another, but are also more rough and in some small sections differ a lot. If we try the same model as before with derivatives of spectrums instead of spectrums we get amazing 64 %. So substracting one value from the next makes model almost as twice as good.
Splitting the spectrum
If we try to compare only first half of the spectrum, the score jumpes from 64 % to 71 %. That means that some parts of the spectrum are more important than others, and some are completly useless or even harm the model. So if one half is better than whole, what about one third, or one tenth, or first tenth and last tenth? Well, we can just split the spectrum into \( N \) parts and brute force what parts should be enabled and what not. So if we split the spectrum into 10 parts, each part is either used or not used so there are \( 2^{10} \) possible combinations to try. In general there are \( 2^N \) combinations where \( N \) is the number of parts. Becase we have 184 different wavelenght and \( 184 = 2^3 \cdot 23 \), I decided that I would split the spectrum to 23 parts of 8 wavelenghts each. Brute forcing all combinations on supercomputer with 96 threads took 11 seconds and got the result 79 %.
Some parts of the spectrum are the same for all bacteria except for one. This causes some problems and one solution is to split the model into 8 different models, one for each bacteria with yes or no output. Most models were 95 - 99 % correct and one was 85 % correct - Sepi. We can easly combine them, selecting the only yes output, if there is only one correct output. Most of the time is and in this subgroup the combined model scores 94 %. But what if there are more yes outputs or none? If there are more I had writen with trial and error some rules what output should be selected. If there are none yes outputs most of the time the actual result is Sepi. That is because Sepi is weird and sometimes even Sepi model (that 85%) does not detect it, so in that case the model just says Sepi. The final model scores 90 % on training data.
Results
At the end we combined my statistial model with AI model from one of the teammates and our score on the test data was 86.8 %. That put us on 6th place. First had 94.6 % and second 90.4 %. Scores for seperated models from each teammate were 85.6 % for my statistical model, 85.6 % for one teammate’s AI model and 85.0 % for other teammate’s AI model.
Results from all teams were very close. We learned a lot and we will be going to this competition again!
