Glossary#

MLRun terms#

MLRun terms

Description

Feature set

A group of features that are ingested together and stored in logical group. See Feature sets.

Feature vector

A combination of multiple Features originating from different Feature sets. See Creating and using feature vectors.

HTTPRunDB

API for wrapper to the internal DB in MLRun. See mlrun.db.httpdb.HTTPRunDB.

hub

Used in code to reference the MLRun Function Hub.

MLRun function

An abstraction over the code, extra packages, runtime configuration and desired resources which allow execution in a local environment and on various serverless engines on top of K8s. See MLRun serverless functions and Creating and using functions.

MLRun Function Hub

A collection of pre-built MLRun functions avilable for usage. See MLRun Function Hub.

MLRun project

A logical container for all the work on a particular activity/application that include functions, workflow, artifacts, secrets, and more, and can be assigned to a specific group of users. See Projects.

mpijob

One of the MLRun batch runtimes that runs distributed jobs and Horovod over the MPI job operator, used mainly for deep learning jobs. See MLRun MPIJob and Horovod runtime.

Nuclio function

Subtype of MLRun function that uses the Nuclio runtime for any generic real-time function. See Nuclio real-time functions and Nuclio documentation.

Serving function

Subtype of MLRun function that uses the Nuclio runtime specifically for serving ML models or real-time pipelines. See Real-time serving pipelines (graphs) and Model serving pipelines.

storey

Asynchronous streaming library for real time event processing and feature extraction. Used in Iguazio’s feature store and real-time pipelines. See storey.transformations - Graph transformations.

Iguazio (V3IO) terms#

Name

Description

Consumer group

Set of consumers that cooperate to consume data from some topics.

Key Value (KV) store

Type of storage where data is stored by a specific key, allows for real-time lookups.

V3IO

Iguazio real-time data layer, supports several formats including KV, Block, File, Streams, and more.

V3IO shard

Uniquely identified data sets within a V3IO stream. Similar to a Kafka partition.

V3IO stream

Streaming mechanism part of Iguazio’s V3IO data layer. Similar to a Kafka stream.

Standard ML terms#

Name

Description

Artifact

A versioned output of a data processing or model training jobs, can be used as input for other jobs or pipelines in the project. There are various types of artifacts (file, model, dataset, chart, etc.) that incorporate useful metadata. See Artifacts.

DAG

Directed acyclic graph, used to describe workflows/pipelines.

Feature engineering

Apply domain knowledge and statistical techniques to raw data to extract more information out of data and improve performance of machine. learning models

EDA

Exploratory data analysis. Used by data scientists to understand dataset via cleaning, visualization, and statistical tests.

ML pipeline

Pipeline of operations for machine learning. It can include loading data, feature engineering, feature selection, model training, hyperparameter tuning, model validation, and model deployment.

Feature

Data field/vector definition and metadata (name, type, stats, etc.). A dataset is a collection of features.

MLOps

Set of practices that reliably and efficiently deploys and maintains machine learning models in production. Combination of Machine Learning and DevOps.

Dataframe

Tabular representation of data, often using tools such as Pandas, Spark, or Dask.

ML libraries / tools#

Name

Description

Dask

Flexible library for parallel computing in Python. Often used for data engineering, data science, and machine learning.

Keras

An open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library.

KubeFlow pipeline

Platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.

PyTorch

An open source machine learning framework based on the Torch library, used for applications such as computer vision and natural language. processing

Sklearn

Open source machine learning Python library. Used for modelling, pipelines, data transformations, feature engineering, and more.

Spark

Open source parallel processing framework for running large-scale data analytics applications across clustered computers. Often used for data engineering, data science, and machine learning.

TensorFlow

A Google developed open-source software library for machine learning and deep learning.

TensorBoard

TensorFlow’s visualization toolkit, used for tracking metrics like loss and accuracy, visualizing the model graph, viewing histograms of weights, biases, or other tensors as they change over time, etc.

XGBoost

Optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. Implements machine learning algorithms under the Gradient Boosting framework.

Back to top