Documentation Index
Fetch the complete documentation index at: https://openmetadata-feat-feat-2mbfixdeploy.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
ML Model
Let’s dive deeper into the ML Model (MlModel) Entity shape and usage with the Python SDK.
Introduction
When we are talking about Machine Learning tooling there are some projects that automatically come to our mind: Mlflow, TFX, Kubeflow, SageMaker… The main goal of these solutions is to enable Data Scientists and Machine Learning practitioners to handle the models’ lifecycle. As a short summary (not taking into account all the great DevOps work and more specific features), this means:- Preparing pipelines that extract the data from the sources & run the Feature Engineering.
- Train multiple models with different inputs and hyperparameters in external computes.
- Record all the experiments, their configurations and metrics.
- Serve the models or store them externally.
OpenMetadata & ML
What we are trying to build with OpenMetadata and ML Models is, by no means, a replacement of those tools or the information they provide. Data Science teams NEED structured experimentation and the metadata processing and discovery provided by the projects mentioned above. Instead, we want to look at ML Models from a different lens: Where do ML Models fit within a Data Platform? While - let’s call them “lifecycle tools” - give the underlying view of the modelling process, we want to bring context and relationships. There are some common questions with lifecycle tools, such as Where (URL) is the MlModel hosted, but we are bringing in a product and platform management view:- Where can we find the dashboards showing the performance evolution of our PROD algorithms? -> Trust & transparency
- What tables/sources are being used? -> Lineage and traceability
- Are we using sensitive data? -> Governance
Properties
Now that we have a clearer view of what we are trying to achieve, let’s jump into a deeper view on theMlModel Entity definition:
- The
nameandalgorithmare the only required properties when creating anMlModel. We can just bring the top shell to the catalogue automatically and then choose which models we want to enrich. - The
dashboardis a reference to a Dashboard Entity present in OpenMetadata (what we call anEntityReference). This should be a dashboard we have built that shows the evolution of our performance metrics. The transparency on ML services is integral, and that is why we have added it to theMlModelproperties. - A server with the URL on where to reach the model endpoint, to
POSTinput data or retrieve a prediction. - The mlStore specifies the location containing the model. Here we support both passing a URI with the object (e.g.
pkl) location and/or a Docker image in its repository. - With
mlHyperParameterswe can add thename,valueanddescriptionof the used hyperparameters.
- What are my model features?
- How do I link these features to their origin/sources?
- What do you mean by “versioning”?
ML Features
When talking about theMlModel features and how to inform them, we have made an interesting distinction:
- On the one hand, there are the results of Feature Engineering as the predictors used to train the model,
- On the other, the actual sources that the predictors are based on.
MlFeature are the name, the dataType and the featureAlgorithm. Now, the question is: Are we being thorough in describing what is really going on? Are we going to be able to detect at any point in the future if this model is based on biased predictors (such as gender)?
Here we want to give more depth. In order to do so, we can inform another property of the MlFeature, which are the MlFeatureSources. The idea is that any feature can be based on an arbitrary number of real/physical sources (e.g., the columns of a Table), and informing this field will allow us to obtain this relationship.
If we add this extra information, our feature will look like this:
FeatureSource, with values such as name, dataType (regarding the source type, not the categorical/numerical distinction of the MlFeature) and an EntityReference as the dataSource.
Thanks to this extra piece of information, we now can run the following method on our MlModel instance:
Versioning
One of the key features of OpenMetadata is the ability to version our Entities. Being able to keep track of schema changes or any community updates and having an alerting system on top can be life-saving. In the Solution Design, we discussed the versioning example on how a Table reacts to adding or deleting columns in terms of its version. How do we manage versions for our MlModels? Based on the experimentation nature of the ML field, it does not make sense to receive an alert when playing around with the model features. Therefore, we are using major version changes when any of the following properties are updated:- Changes in the algorithm might mean uneven results when comparing predictions (or if handled poorly different outcomes)
- Changes to the server can enable/break integrations to the ML prediction systems.