entity_extractor#
V3
- class predict_backend.ml.nlp.components.entity_extractor.NerSummarization(nlp)#
Bases:
SqlCompliant
,PandasCompliant
Spacy custom component responsible for extracting, clean and count Ner entities from spacy docs. It will set ents_summary as a custom extension to the doc object. It also implements PandasCompliant so that it is compatible with the persistence handler.
- assigns = ['doc._.ents_summary']#
- beautiful_name = 'Entity Extraction'#
- static create_component(nlp, name, depends_on)#
Factory function. Create a new instance of this component.
- Parameters:
nlp (
Any
) – The spacy nlp objectname (
str
) – Useless. Just to give an example of registered paramsdepends_on (
List
[str
]) – Useless. Just to give an example of registered params
- Returns:
A new instance of the EventExtractor component
- depends_on = ['ner', 'tagger', 'attribute_ruler']#
- feature_columns = ['entity']#
- classmethod get_default()#
- Returns:
The default configuration for the component
- classmethod get_df_table_name()#
- Return type:
str
- Returns:
The table neme of the component
- classmethod get_feature_columns()#
- Return type:
List
- Returns:
Return the pandas columns names of the feature it extracts
- classmethod get_sql_table_name()#
- Return type:
str
- Returns:
the table neme of the table that should contain the information extracted by this component.
- classmethod init_dataframe()#
- Return type:
DataFrame
- Returns:
An empty dataframe with the necessary columns to store the information extracted by this component.
- classmethod init_extension()#
Register the doc extension used by this component.
- classmethod init_sql_db()#
- Return type:
List
[str
]- Returns:
a list with the necessary columns to store the information extracted by this component.
- name = 'virtualitics_ner_summary'#
- pandas_cols = ['doc_number', 'entity', 'count']#
- requires = ['doc.ents']#
- sql_cols = ['rowid SERIAL', 'doc_number text', 'entity text', 'count integer']#
- table_name = 'entities'#
- classmethod to_df(doc, idx_col)#
- Parameters:
doc – Spacy doc object.
idx_col (
str
) – the column to use as idx.
- Return type:
DataFrame
- Returns:
Extract a dict with the ents_summary information attached to a document.
- classmethod to_dict(doc, idx_col)#
- Parameters:
doc – Spacy doc object.
idx_col (
str
) – the column to use as idx.
- Return type:
List
[Dict
]- Returns:
Extract a dict with the ents_summary information attached to a document.
- classmethod transform_sql(doc, idx_col, value_placeholder='%s')#
- Parameters:
doc – The spacy doc object
idx_col (
str
) – The column to use as idvalue_placeholder – The type of placeholder to use (it can vary based on the client you’re using)
- Return type:
Tuple
[str
,List
]- Returns:
A list elements ready to be inserted into a db
- predict_backend.ml.nlp.components.entity_extractor.clean_items(item)#
- predict_backend.ml.nlp.components.entity_extractor.concat_ner_and_chunks(entities, chunks, remove_pos=True)#
Take entities and chunks and merge them into a single result. It also solve the conflicts of overlapping span by selecting the bigger.
- predict_backend.ml.nlp.components.entity_extractor.segment_overlaps(a, b)#
Check if two segments overlap.
- Parameters:
a (
Tuple
[int
,int
]) – First segmentb (
Tuple
[int
,int
]) – Second segment
- Return type:
bool
- Returns:
True if the two segments overlap