Embeddings#

Auto-Generator#

class EmbeddingGenerator(**kwargs: str)[source]#

Bases: object

Factory class for creating embedding generators based on use case.

Raise error directing users to use from_use_case factory method.

Raises:

OSError – Always raised to prevent direct instantiation.

Parameters:

kwargs (str)

static from_use_case(use_case: str | NLPUseCases | CVUseCases | TabularUseCases, **kwargs: object) BaseEmbeddingGenerator[source]#

Create an embedding generator for the specified use case.

Parameters:
Return type:

BaseEmbeddingGenerator

classmethod list_default_models() DataFrame[source]#

Return a pandas.DataFrame of default models for each use case.

Return type:

DataFrame

classmethod list_pretrained_models() DataFrame[source]#

Return a pandas.DataFrame of all available pretrained models.

Return type:

DataFrame

Use Cases#

class UseCases[source]#

Bases: object

Container grouping all use case enums for embedding generators.

CV#

alias of CVUseCases

NLP#

alias of NLPUseCases

STRUCTURED#

alias of TabularUseCases

class NLPUseCases(value, names=_not_given, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Enum representing supported NLP use cases for embedding generation.

SEQUENCE_CLASSIFICATION = 1#
SUMMARIZATION = 2#
class CVUseCases(value, names=_not_given, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Enum representing supported computer vision use cases for embedding generation.

IMAGE_CLASSIFICATION = 1#
OBJECT_DETECTION = 2#
class TabularUseCases(value, names=_not_given, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Enum representing supported tabular/structured data use cases for embedding generation.

TABULAR_EMBEDDINGS = 1#

NLP Generators#

class EmbeddingGeneratorForNLPSequenceClassification(model_name: str = DEFAULT_NLP_SEQUENCE_CLASSIFICATION_MODEL, **kwargs: object)[source]#

Bases: NLPEmbeddingGenerator

Embedding generator for NLP sequence classification tasks.

Initialize the sequence classification embedding generator.

Parameters:
  • model_name (str) – Name of the pre-trained NLP model.

  • **kwargs (object) – Additional arguments for model initialization.

generate_embeddings(text_col: Series, class_label_col: Series | None = None) Series[source]#

Obtain embedding vectors from your text data using pre-trained large language models.

Parameters:
  • text_col (Series) – A pandas Series containing the different pieces of text.

  • class_label_col (Series | None) – If this column is passed, the sentence “The classification label is <class_label>” will be appended to the text in the text_col.

Returns:

A pandas Series containing the embedding vectors.

Return type:

Series

class EmbeddingGeneratorForNLPSummarization(model_name: str = DEFAULT_NLP_SUMMARIZATION_MODEL, **kwargs: object)[source]#

Bases: NLPEmbeddingGenerator

Embedding generator for NLP text summarization tasks.

Initialize the text summarization embedding generator.

Parameters:
  • model_name (str) – Name of the pre-trained NLP model.

  • **kwargs (object) – Additional arguments for model initialization.

generate_embeddings(text_col: Series) Series[source]#

Obtain embedding vectors from your text data using pre-trained large language models.

Parameters:

text_col (Series) – A pandas Series containing the different pieces of text.

Returns:

A pandas Series containing the embedding vectors.

Return type:

Series

Computer Vision Generators#

class EmbeddingGeneratorForCVImageClassification(model_name: str = DEFAULT_CV_IMAGE_CLASSIFICATION_MODEL, **kwargs: object)[source]#

Bases: CVEmbeddingGenerator

Embedding generator for computer vision image classification tasks.

Initialize the image classification embedding generator.

Parameters:
  • model_name (str) – Name of the pre-trained vision model.

  • **kwargs (object) – Additional arguments for model initialization.

class EmbeddingGeneratorForCVObjectDetection(model_name: str = DEFAULT_CV_OBJECT_DETECTION_MODEL, **kwargs: object)[source]#

Bases: CVEmbeddingGenerator

Embedding generator for computer vision object detection tasks.

Initialize the object detection embedding generator.

Parameters:
  • model_name (str) – Name of the pre-trained vision model.

  • **kwargs (object) – Additional arguments for model initialization.

Tabular Generators#

class EmbeddingGeneratorForTabularFeatures(model_name: str = DEFAULT_TABULAR_MODEL, **kwargs: object)[source]#

Bases: NLPEmbeddingGenerator

Embedding generator for tabular feature data using prompt-based LLM encoding.

Initialize the tabular features embedding generator.

Parameters:
  • model_name (str) – Name of the pre-trained NLP model for tabular data.

  • **kwargs (object) – Additional arguments for model initialization.

Raises:

ValueError – If model_name is not in supported models list.

generate_embeddings(df: DataFrame, selected_columns: list[str], col_name_map: dict[str, str] | None = None, return_prompt_col: bool = False) Series | tuple[Series, Series][source]#

Obtain embedding vectors from your tabular data.

Prompts are generated from your selected_columns and passed to a pre-trained large language model for embedding vector computation.

Parameters:
  • df (DataFrame) – Pandas DataFrame containing the tabular data. Not all columns will be considered, see selected_columns.

  • selected_columns (list[str]) – Columns to be considered to construct the prompt to be passed to the LLM.

  • col_name_map (dict[str, str] | None) – Mapping between selected column names and a more verbose description of the name. This helps the LLM understand the features better.

  • return_prompt_col (bool) – If set to True, an extra pandas Series will be returned containing the constructed prompts. Defaults to False.

Returns:

A pandas Series containing the embedding vectors and, if return_prompt_col is set to True, a pandas Series containing the prompts created from tabular features.

Return type:

Series | tuple[Series, Series]

static list_pretrained_models() DataFrame[source]#

Return a pandas.DataFrame of available pretrained tabular models.

Return type:

DataFrame