corpus_stats#
- class predict_backend.ml.nlp.components.corpus_stats.CorpusStats(nlp)#
Bases:
PandasCompliant
Spacy custom component responsible for extracting some statistics from the data. It will set corpus_stats as a custom extension to the doc object. It also implements PandasCompliant and SqlCompliant so that it is compatible with the various persistence handlers.
- assigns = ['doc._.stats']#
- beautiful_name = 'Corpus Statistics'#
- static create_component(nlp, name, depends_on)#
Factory function. Create a new instance of this component.
- Parameters:
nlp (
Any
) – The spacy nlp objectname (
str
) – Useless. Just to give an example of registered paramsdepends_on (
List
[str
]) – Useless. Just to give an example of registered params
- Returns:
A new instance of the CorpusStats component
- depends_on = ['ner', 'tagger', 'parser']#
- feature_columns = ['n_sents', 'n_token', 'n_char', 'avg_tok_x_sent', 'avg_char_x_sent', 'n_ents', 'n_events', 'unique_words', 'pos_freqs', 'average_dep_depth']#
- classmethod get_default()#
- Returns:
The default configuration for the component
- classmethod get_df_table_name()#
- Return type:
str
- Returns:
The table neme of the component.
- classmethod get_feature_columns()#
- Return type:
List
- Returns:
Return the pandas columns names of the feature it extracts
- classmethod init_dataframe()#
- Return type:
DataFrame
- Returns:
An empty dataframe with the necessary columns to store the information extracted by this component.
- classmethod init_extension()#
Register the doc extension used by this component.
- name = 'virtualitics_corpus_stats'#
- pandas_cols = ['doc_number', 'n_sents', 'n_token', 'n_char', 'avg_tok_x_sent', 'avg_char_x_sent', 'n_ents', 'n_events', 'unique_words', 'pos_freqs', 'average_dep_depth']#
- requires = ['doc.ents']#
- table_name = 'stats'#
- classmethod to_df(doc, idx_col)#
- Parameters:
doc – Spacy doc object.
idx_col (
str
) – The column to use as idx.
- Return type:
DataFrame
- Returns:
Extract a dict with the corpus_stats information attached to a document.
- classmethod to_dict(doc, idx_col)#
- Parameters:
doc – Spacy doc object.
idx_col (
str
) – The column to use as idx.
- Return type:
List
[Dict
]- Returns:
Extract a dict with the corpus_stats information attached to a document