Data leakage pre/post-processing. #1660

Thibescobar · 2023-08-30T09:40:00Z

Thibescobar
Aug 30, 2023

Hello all,

I just open this discussion to have your practical experience and opinion about the data leakage that occurs when planning (patch size, CNN topology, etc.), pre-processing (spacing, CT normalization, etc.), and eventually post-processing.

I know that nothing beats external validation on data that we do not touch at all until the end of the project. But in practice this is difficult to obtain such large datasets so we need to do the best we can with what we have.

In this context, I rely a lot on cross-validation, that does not include planning and pre/post-processing. I wonder if this data leakage would have a big optimistic bias effect or if in your experience, this could be considered ok ?

Precision of my particular case:

More precisely I am doing active learning. I do not have labelled data, so I label (with experts) few of them, then train a model (model M0), then predict on more data, correct the predictions (with experts) if needed, then train a model (model M1), and so on.

At the end of the active learning procedure, all my data are labelled. Thus, I can train my final model (eg, model M5). To assess the improvement by adding data, at the end, I merge cross-validation and test. For example if I add 10 data each time :

M0 : 5f-CV with 10 data + test with 40 (that I only have at the end)
M1 : 5f-CV with 20 data + test with 30
M2 : 5f-CV with 30 data + test with 20
.
.
.
M5 5f-CV with 50 data (no test)

Eg:

In a pure methodological view, it is dirty I agree. But I found this approach the best suited in my context. But this is true only if the CV data leakage can be neglected. Otherwise, we do not really know if we measure the improvment due adding more data or to leaking more...

Thank you very much for your comments and advices in advances !
Have a good day !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data leakage pre/post-processing. #1660

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Data leakage pre/post-processing. #1660

Thibescobar Aug 30, 2023

Replies: 0 comments

Thibescobar
Aug 30, 2023