-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re. the (Enterprise) DatabaseIntegration sample #579
Comments
Hi @Bartolomeus-649 - we welcome new samples from the community (see #86), and with your specific scenario in mind, it sounds like this may be a great opportunity for you to get involved! Feel free to open a PR on this sample with the changes you would like to see, or open a PR with a new sample if you would prefer. |
@Bartolomeus-649 - Our database sample uses a single table in a relational database because when training a model it is very much recommended to have a "prepared dataset". In data science and ML you do need data preparation. Basically: "Data preparation is the process of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics and machine learning applications." This is also because of the following reasons:
The reasons provided above and many more is why in machine learning data scientists usually work with isolated and prepared datasets, commonly in the form of files (.csv, etc.). However, we want to provide a way so you can directly prepare data in a relational database, then train from it. But trying to train a model from queries which are joining data from multiple business transactional tables can be very unpredictable and unreliable, therefore it is not an scenario we want to encourage but recommend to have "prepared data" tables in databases (could be multiple "prepared tables" by querying with a join, too, although that will have worse performance). Here you have a few links where you can see that DATA PREPARATION usually takes most of the work time of data scientists. Data needs to be analyzed and prepared before training a model: https://www.dezyre.com/article/why-data-preparation-is-an-important-part-of-data-science/242 https://www.datasciencecentral.com/profiles/blogs/in-big-data-preparing-the-data-is-most-of-the-work https://searchbusinessanalytics.techtarget.com/definition/data-preparation If you have additional feedback derived from your experience training machine learning models, please, we'd love to hear about it. Thanks for your feedback. 👍 |
@CESARDELATORRE so the sample has basically nothing to do with an enterprise environment then. Why is there need to manually prepare data? Why not just point the ML-Library to the database and then point out which field you want to predict, and then it figures out by itself which data it needs and prepare it the way it want it to be prepared? |
It does because it is related to the enterprise environment where enterprises do use relational databases not just for supporting business applications (transactional data) but also for preparing and integrating data. Relational databases (such as SQL Server, Oracle, etc.) are very much related to enterprise environments so you cannot say it has nothing to do with an enterprise environment. Technically (using ML.NET and the libraries we're creating), you could do it the way you propose ("Just point the ML-Library to the database and then point out which field you want to predict, and then it figures out by itself which data it needs"), but chances are that many issues will appear derived from the data coming 'live' with that approach. Sure, depending on the context "it could work the way you want it", but chances of getting issues are high when simply merging 'live data' from multiple environments/applications that's constantly evolving. It is not realistic for most of the cases and that's why we don't want to encourage that approach. Basically, the answers to the WHY you are asking is what I answered above. In short it is:
Again, please, read the links I provided above for further learning on data preparation as part of the machine learning and data science process. |
I also want to highlight that for certain scenarios normalized data impacts negatively and in many scenarios and architectures data is not normalized (meaning databases/models not in 3rd normal form). For instance when using CQRS (Command and Query Responsibility Segregation) you can have a non-normalized 'Read Model'. Read about CQRS: This case for training an ML.NET model is comparable in needs and effectiveness for training performance to other reasons in a CQRS architecture where you also want to have a non-normalized 'Read Model'. Then, in addition to performance, in the machine learning process you almost always need to clean-up and prepare data before training a model. That also fits well with this 'especially made' read model for the training process. |
Closing this issue since a response was provided and there has been no additional activity. |
From the DatabaseIntegration sample
I have so far, during my 20+ years as a consultant, never come across a customer who has a single table database for their Enterprise data.
Either rename and describe this sample as "How to read data from a single table instead of a text file", or improve it to resemble a real world enterprise senario, because as it is now, it is not even remotely related to what a real world situation would look like with for an Enterprise and their data.
Some suggestion of what a sample with the goal to show how ML.NET should/could be used in an Enterprise Environment should contain:
Bulk data is normalized into multiple related tables within a SQL database.
Some data need to be looked up in external systems, for example, a CRM system, which could need a rest call to get the data, but also, sometimes it could be in another SQL DB.
The available data might not be 100% complete nor consistent ...
There might not be any foreign keys defined between tables which still have a relation that might need to be used for the model
Data is live, changes will occur during the development of the model, and it might be very, very difficult and/or expensive to create static copies of the data. Even impossible when it comes to data from external systems were licensing might prohibit caching and/or storing data locally.
Some data might be in external third party systems which has a transaction cost associated with accessing it.
GDPG requires that the usage of personal data should be documented and described. Also all personal data of a person should be made available to that person in a standardized cross platform machine readable format.
The text was updated successfully, but these errors were encountered: