Embeddings#
Auto-Generator#
- class EmbeddingGenerator(**kwargs: str)[source]#
Bases:
objectFactory class for creating embedding generators based on use case.
Raise error directing users to use from_use_case factory method.
- static from_use_case(use_case: str | NLPUseCases | CVUseCases | TabularUseCases, **kwargs: object) BaseEmbeddingGenerator[source]#
Create an embedding generator for the specified use case.
- Parameters:
use_case (str | NLPUseCases | CVUseCases | TabularUseCases)
kwargs (object)
- Return type:
BaseEmbeddingGenerator
- classmethod list_default_models() DataFrame[source]#
Return a
pandas.DataFrameof default models for each use case.- Return type:
DataFrame
- classmethod list_pretrained_models() DataFrame[source]#
Return a
pandas.DataFrameof all available pretrained models.- Return type:
DataFrame
Use Cases#
- class UseCases[source]#
Bases:
objectContainer grouping all use case enums for embedding generators.
- CV#
alias of
CVUseCases
- NLP#
alias of
NLPUseCases
- STRUCTURED#
alias of
TabularUseCases
- class NLPUseCases(value, names=_not_given, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
EnumEnum representing supported NLP use cases for embedding generation.
- SEQUENCE_CLASSIFICATION = 1#
- SUMMARIZATION = 2#
NLP Generators#
- class EmbeddingGeneratorForNLPSequenceClassification(model_name: str = DEFAULT_NLP_SEQUENCE_CLASSIFICATION_MODEL, **kwargs: object)[source]#
Bases:
NLPEmbeddingGeneratorEmbedding generator for NLP sequence classification tasks.
Initialize the sequence classification embedding generator.
- Parameters:
- generate_embeddings(text_col: Series, class_label_col: Series | None = None) Series[source]#
Obtain embedding vectors from your text data using pre-trained large language models.
- Parameters:
text_col (Series) – A pandas Series containing the different pieces of text.
class_label_col (Series | None) – If this column is passed, the sentence “The classification label is <class_label>” will be appended to the text in the text_col.
- Returns:
A pandas Series containing the embedding vectors.
- Return type:
Series
- class EmbeddingGeneratorForNLPSummarization(model_name: str = DEFAULT_NLP_SUMMARIZATION_MODEL, **kwargs: object)[source]#
Bases:
NLPEmbeddingGeneratorEmbedding generator for NLP text summarization tasks.
Initialize the text summarization embedding generator.
- Parameters:
- generate_embeddings(text_col: Series) Series[source]#
Obtain embedding vectors from your text data using pre-trained large language models.
- Parameters:
text_col (Series) – A pandas Series containing the different pieces of text.
- Returns:
A pandas Series containing the embedding vectors.
- Return type:
Series
Computer Vision Generators#
- class EmbeddingGeneratorForCVImageClassification(model_name: str = DEFAULT_CV_IMAGE_CLASSIFICATION_MODEL, **kwargs: object)[source]#
Bases:
CVEmbeddingGeneratorEmbedding generator for computer vision image classification tasks.
Initialize the image classification embedding generator.
Tabular Generators#
- class EmbeddingGeneratorForTabularFeatures(model_name: str = DEFAULT_TABULAR_MODEL, **kwargs: object)[source]#
Bases:
NLPEmbeddingGeneratorEmbedding generator for tabular feature data using prompt-based LLM encoding.
Initialize the tabular features embedding generator.
- Parameters:
- Raises:
ValueError – If model_name is not in supported models list.
- generate_embeddings(df: DataFrame, selected_columns: list[str], col_name_map: dict[str, str] | None = None, return_prompt_col: bool = False) Series | tuple[Series, Series][source]#
Obtain embedding vectors from your tabular data.
Prompts are generated from your selected_columns and passed to a pre-trained large language model for embedding vector computation.
- Parameters:
df (DataFrame) – Pandas DataFrame containing the tabular data. Not all columns will be considered, see selected_columns.
selected_columns (list[str]) – Columns to be considered to construct the prompt to be passed to the LLM.
col_name_map (dict[str, str] | None) – Mapping between selected column names and a more verbose description of the name. This helps the LLM understand the features better.
return_prompt_col (bool) – If set to True, an extra pandas Series will be returned containing the constructed prompts. Defaults to False.
- Returns:
A pandas Series containing the embedding vectors and, if return_prompt_col is set to True, a pandas Series containing the prompts created from tabular features.
- Return type:
Series | tuple[Series, Series]
- static list_pretrained_models() DataFrame[source]#
Return a
pandas.DataFrameof available pretrained tabular models.- Return type:
DataFrame