Data leakage pre/post-processing. #1660
Thibescobar
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello all,
I just open this discussion to have your practical experience and opinion about the data leakage that occurs when planning (patch size, CNN topology, etc.), pre-processing (spacing, CT normalization, etc.), and eventually post-processing.
I know that nothing beats external validation on data that we do not touch at all until the end of the project. But in practice this is difficult to obtain such large datasets so we need to do the best we can with what we have.
In this context, I rely a lot on cross-validation, that does not include planning and pre/post-processing. I wonder if this data leakage would have a big optimistic bias effect or if in your experience, this could be considered ok ?
Precision of my particular case:
More precisely I am doing active learning. I do not have labelled data, so I label (with experts) few of them, then train a model (model M0), then predict on more data, correct the predictions (with experts) if needed, then train a model (model M1), and so on.
At the end of the active learning procedure, all my data are labelled. Thus, I can train my final model (eg, model M5). To assess the improvement by adding data, at the end, I merge cross-validation and test. For example if I add 10 data each time :
.
.
.
Eg:
In a pure methodological view, it is dirty I agree. But I found this approach the best suited in my context. But this is true only if the CV data leakage can be neglected. Otherwise, we do not really know if we measure the improvment due adding more data or to leaking more...
Thank you very much for your comments and advices in advances !
Have a good day !
Beta Was this translation helpful? Give feedback.
All reactions