Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re. the (Enterprise) DatabaseIntegration sample #579

Closed
Bartolomeus-649 opened this issue Jul 20, 2019 · 6 comments
Closed

Re. the (Enterprise) DatabaseIntegration sample #579

Bartolomeus-649 opened this issue Jul 20, 2019 · 6 comments

Comments

@Bartolomeus-649
Copy link

From the DatabaseIntegration sample

Problem

Enterprise users have a need to use their existing data set that is in their company's database to train and predict with ML.Net. They need support to leverage their existing relational table schema, ability to read from the database directly, and to be aware of memory limitations as the data is being consumed.

I have so far, during my 20+ years as a consultant, never come across a customer who has a single table database for their Enterprise data.

Either rename and describe this sample as "How to read data from a single table instead of a text file", or improve it to resemble a real world enterprise senario, because as it is now, it is not even remotely related to what a real world situation would look like with for an Enterprise and their data.

Some suggestion of what a sample with the goal to show how ML.NET should/could be used in an Enterprise Environment should contain:

  1. Bulk data is normalized into multiple related tables within a SQL database.

  2. Some data need to be looked up in external systems, for example, a CRM system, which could need a rest call to get the data, but also, sometimes it could be in another SQL DB.

  3. The available data might not be 100% complete nor consistent ...

    1. Fields might be missing, have no data and/or contain only white space.
    2. Date format might be different between systems (YYYY-mm-DD, mm/DD/YYYY and so on)
    3. Values (numbers), dates and other data types might be stored as strings
    4. Time zone for stored data in different systems might be different, some could be UTC other i local time.
    5. Money might be stored in different currencies.
  4. There might not be any foreign keys defined between tables which still have a relation that might need to be used for the model

  5. Data is live, changes will occur during the development of the model, and it might be very, very difficult and/or expensive to create static copies of the data. Even impossible when it comes to data from external systems were licensing might prohibit caching and/or storing data locally.

  6. Some data might be in external third party systems which has a transaction cost associated with accessing it.

  7. GDPG requires that the usage of personal data should be documented and described. Also all personal data of a person should be made available to that person in a standardized cross platform machine readable format.

@bamurtaugh
Copy link
Member

Hi @Bartolomeus-649 - we welcome new samples from the community (see #86), and with your specific scenario in mind, it sounds like this may be a great opportunity for you to get involved! Feel free to open a PR on this sample with the changes you would like to see, or open a PR with a new sample if you would prefer.

@CESARDELATORRE
Copy link
Contributor

CESARDELATORRE commented Jul 22, 2019

@Bartolomeus-649 - Our database sample uses a single table in a relational database because when training a model it is very much recommended to have a "prepared dataset".

In data science and ML you do need data preparation. Basically: "Data preparation is the process of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics and machine learning applications."

This is also because of the following reasons:

  1. Performance: If you train against a join (joining multiple tables), even when it can work, the performance/time when training can be much worse than when working against a single table.

  2. Consistency and predictability: A "prepared dataset" (potentially one table, although multiple tables is also possible) is recommended so there are no additional updates in the data while training. New updates could introduce errors or impact the model accuracy. If you don't have "static data" for training, you won't be able to isolate the errors or impacts coming from the data if the data is constantly changing as it happens with a business transactional database.
    Also, if you don't have static prepared data you cannot properly analyze the distribution of the data, if you are dealing with a balanced or imbalanced dataset and many more concepts in data science.

The reasons provided above and many more is why in machine learning data scientists usually work with isolated and prepared datasets, commonly in the form of files (.csv, etc.). However, we want to provide a way so you can directly prepare data in a relational database, then train from it.

But trying to train a model from queries which are joining data from multiple business transactional tables can be very unpredictable and unreliable, therefore it is not an scenario we want to encourage but recommend to have "prepared data" tables in databases (could be multiple "prepared tables" by querying with a join, too, although that will have worse performance).

Here you have a few links where you can see that DATA PREPARATION usually takes most of the work time of data scientists. Data needs to be analyzed and prepared before training a model:

https://www.dezyre.com/article/why-data-preparation-is-an-important-part-of-data-science/242

https://www.datasciencecentral.com/profiles/blogs/in-big-data-preparing-the-data-is-most-of-the-work

https://searchbusinessanalytics.techtarget.com/definition/data-preparation

If you have additional feedback derived from your experience training machine learning models, please, we'd love to hear about it.

Thanks for your feedback. 👍

@Bartolomeus-649
Copy link
Author

@CESARDELATORRE so the sample has basically nothing to do with an enterprise environment then.

Why is there need to manually prepare data? Why not just point the ML-Library to the database and then point out which field you want to predict, and then it figures out by itself which data it needs and prepare it the way it want it to be prepared?

@CESARDELATORRE
Copy link
Contributor

CESARDELATORRE commented Jul 23, 2019

It does because it is related to the enterprise environment where enterprises do use relational databases not just for supporting business applications (transactional data) but also for preparing and integrating data.

Relational databases (such as SQL Server, Oracle, etc.) are very much related to enterprise environments so you cannot say it has nothing to do with an enterprise environment.

Technically (using ML.NET and the libraries we're creating), you could do it the way you propose ("Just point the ML-Library to the database and then point out which field you want to predict, and then it figures out by itself which data it needs"), but chances are that many issues will appear derived from the data coming 'live' with that approach.

Sure, depending on the context "it could work the way you want it", but chances of getting issues are high when simply merging 'live data' from multiple environments/applications that's constantly evolving. It is not realistic for most of the cases and that's why we don't want to encourage that approach.

Basically, the answers to the WHY you are asking is what I answered above. In short it is:

  • Risks in performance when using query joins for training a model
  • Unpredictable training when coming non-validated data from multiple sources (nulls/nan values, imbalanced data, not-clean data, etc.). In most situations data needs to be cleaned-up before training a model.

Again, please, read the links I provided above for further learning on data preparation as part of the machine learning and data science process.

@CESARDELATORRE
Copy link
Contributor

CESARDELATORRE commented Jul 23, 2019

I also want to highlight that for certain scenarios normalized data impacts negatively and in many scenarios and architectures data is not normalized (meaning databases/models not in 3rd normal form). For instance when using CQRS (Command and Query Responsibility Segregation) you can have a non-normalized 'Read Model'.

Read about CQRS:
https://martinfowler.com/bliki/CQRS.html
https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs

This case for training an ML.NET model is comparable in needs and effectiveness for training performance to other reasons in a CQRS architecture where you also want to have a non-normalized 'Read Model'.

Then, in addition to performance, in the machine learning process you almost always need to clean-up and prepare data before training a model. That also fits well with this 'especially made' read model for the training process.

@luisquintanilla
Copy link
Contributor

Closing this issue since a response was provided and there has been no additional activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants