language_processor¶
- class virtualitics_sdk.assets.language_processor.LanguageProcessor(model_name, asset_label, asset_name, document_identifier, narrative_feature, feature_names=None, pipeline_task=None, description=None, version=0, metadata=None, remove_model_after=True, seed=None, **kwargs)¶
Bases:
Asset
NLP Pipeline responsible to extract features like entities, events from raw text. It extends Asset, this way it can be easily stored and retrieved. It provides an easy interface to ingest documents and store their metadata.
- Parameters:
model_name (
str
) – The spacy model use under the hoodasset_label (
str
) – Asset parameterasset_name (
str
) – Asset parameterdocument_identifier (
str
) – The dataset feature to use as document idnarrative_feature (
str
) – The dataset feature that contains the text to processfeature_names (
Optional
[List
[str
]]) – The list of dataset features to store together with the doc extracted featurespipeline_task (
List
[str
]) – List of task names to use inside the pipelinedescription (
Optional
[str
]) – Asset parameterversion (
int
) – Asset parametermetadata – Asset parameter
remove_model_after (
bool
) – Remove the model after the ingestion processseed (
Optional
[int
]) – numpy seed value
EXAMPLE:
# Imports from virtualitics_sdk.assets.language_processor import LanguageProcessor . . . # Example usage selected_model = "en_core_web_lg" pipeline_config = ["Event Extraction", "Entity Extraction", "Corpus Statistics"] feature_names = [ "Unnamed: 0", "DATE RAISE ANNOUNCED", "COMPANY", "AMOUNT", "HQ Location", "TOP INVESTORS (in this round)", "LINK", "Website", "Round ", "Category", "NOTES", "Expansion Plans", "Founder First Name", "Founder Last Name", "Founder LinkedIn", "Founder Twitter", "Founder AngelList", "Unnamed: 16", ] asset_label = '17181175-32-1f4a-442c-8f91-4d32e2b905fd_lp' asset_name = '_lp' id_col = 'COMPANY' narr_col = 'COMPANY' nlp_module = LanguageProcessor( model_name=selected_model, pipeline_task=pipeline_config, feature_names=feature_names, asset_label=asset_label, asset_name=asset_name, document_identifier=id_col, narrative_feature=narr_col, )
- DATASET_UUID = 10¶
- available_models = {'en_core_web_lg'}¶
- static available_tasks(model)¶
Returns the available task (registered components) available with the specified model
- Parameters:
model (
str
) – The model name- Return type:
List
- Returns:
List of available tasks
- static entities2features(entities)¶
Transform the entities df into features that can be passed to the tf-idf Vectorizer.
- Parameters:
entities (
DataFrame
) – LanguageProcessor entities table- Return type:
Optional
[DataFrame
]- Returns:
The same df with the output features
- static events2features(events)¶
Transform the events df into features that can be passed to the tf-idf Vectorizer.
- Parameters:
events (
DataFrame
) – LanguageProcessor events table- Return type:
Optional
[DataFrame
]- Returns:
The same df with the output features
- get_doc_ids_and_dates()¶
- Returns:
List of tuples of format (doc_number, date)
- get_doc_nlp_features()¶
It extracts computed features from the docs. It also returns a list of docs name
- Return type:
Tuple
[List
[str
],List
[str
]]- Returns:
List that represent the doc names, List that represent the extracted features
- get_doc_node_attributes(doc_number)¶
Return the base information of a doc_number doc.
- Parameters:
doc_number – The doc name
- Return type:
Dict
- Returns:
a dict with the format {‘feature1’: ‘val1’, ‘feature2’: ‘val2’}
- get_doc_numbers()¶
- Returns:
A list with doc names
- get_single_doc_nlp_features(doc_number)¶
Produce a Counter of string->count for entities and events.
- get_table(table)¶
Useful if you want to access a specific language processor internal table. Available tables:
doc_data, where there are the dataset original features
entities,
= events
- Parameters:
table (
str
) – table name- Return type:
DataFrame
- Returns:
pd.DataFrame
- ingest(data, flow_metadata=None, extract_doc_bin=False, starting_progress=None, target_progress=None)¶
It runs the whole pipeline on the pd dataframe provided. For validation purpose, it will check if the init params provided are present inside the data df. For store, starting_progress and target_progress params docs, check StepProgressTqdm docs.
- Parameters:
data (
DataFrame
) – Mandatory. The data used to feed the pipeline.flow_metadata (
Optional
[FlowMetadata
]) – The flow metadata necessary to create a store interface.extract_doc_bin (
bool
) – If true, the method return a DocBin.starting_progress (
Union
[int
,float
,None
]) – Used to init a StepProgressTqdm. Mandatory if store provided.target_progress (
Union
[int
,float
,None
]) – Used to init a StepProgressTqdm.
- Return type:
Optional
[DocBin
]- Returns:
A DocBin object, if requested.
- initialize_model(model_name)¶
Initialize the internal Spacy model with the provided model name and save it as instance attribute.
- Parameters:
model_name (
str
)
- virtualitics_sdk.assets.language_processor.counter_to_str_repr(c)¶
Constructs a list of tokens from a counter, with each token appearing as many times as its count in the counter.
- Parameters:
c (
Counter
)- Return type:
List
[str
]- Returns:
List of tokens