significant_words module¶

class wayward.significant_words.SignificantWordsLM(documents: Iterable[Iterable[str]], lambdas: Tuple[numpy.floating, numpy.floating, numpy.floating], thresh: int = 0)[source]¶

Bases: wayward.parsimonious.ParsimoniousLM

Language model that consists of three sub-models:

Corpus model: represents term probabilities in a (large) background collection;
Group model: parsimonious term probabilities in a group of documents;
Specific model: represents the same group, but is biased towards terms that occur with a high frequency in single docs, and a low frequency in others.

References

M. Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx (2016). Luhn Revisited: Significant Words Language Models. Proc. CKIM‘16.

Parameters:	documents (iterable over iterable of str terms) – All documents that should be included in the corpus model. lambdas (3-tuple of float) – Weight of corpus, group, and specific models. Will be normalized if the weights in the tuple don’t sum to one. thresh (int) – Don’t include words that occur fewer than thresh times.

vocab¶

Mapping of terms to numeric indices

Type:	dict of term -> int

p_corpus¶

Log probability of terms in background model (indexed by vocab)

Type:	array of float

p_group¶

Log probability of terms in the last processed group model (indexed by vocab)

Type:	array of float

p_specific¶

Log probability of terms in the last processed specific model (indexed by vocab)

Type:	array of float

lambda_corpus¶

Log probability (weight) of corpus model for documents

Type:	array of float

lambda_group¶

Log probability (weight) of group model for documents

Type:	array of float

lambda_specific¶

Log probability (weight) of specific model for documents

Type:	array of float

See also

wayward.parsimonious.ParsimoniousLM: one-sided parsimonious model

fit_parsimonious_group(document_group: Iterable[Iterable[str]], max_iter: int = 50, eps: float = 1e-05, lambdas: Optional[Tuple[numpy.floating, numpy.floating, numpy.floating]] = None, fix_lambdas: bool = False, parsimonize_specific: bool = False, post_parsimonize: bool = False, specific_estimator: Callable[[Sequence[numpy.ndarray]], numpy.ndarray] = <function mutual_exclusion>) → Dict[str, float][source]¶

Estimate a document group model, and parsimonize it against fixed corpus and specific models. The documents may be unseen, but any terms that are not in the vocabulary will be ignored.

Parameters:	document_group (iterable over iterable of str terms) – All documents that should be included in the group model. max_iter (int, optional) – Maximum number of iterations of EM algorithm to run. eps (float, optional) – Epsilon: convergence threshold for EM algorithm. lambdas (3-tuple of float, optional) – Weight of corpus, group, and specific models. Will be normalized if the weights in the tuple don’t sum to one. fix_lambdas (bool, optional) – Fix the weights of the three sub-models (i.e. don’t estimate lambdas as part of the M-step). parsimonize_specific (bool, optional) – Bias the specific model towards uncommon terms before applying the EM algorithm to the group model. This generally results in a group model that stands out less from the corpus model. post_parsimonize (bool, optional) – Bias the group model towards uncommon terms after applying the EM algorithm. This may be used to compensate when the frequency of common terms varies much between the documents in the group. specific_estimator (callable, optional) – Function that estimates the specific terms model based on the document term frequencies of the doc group.
Returns:	t_p_map – Dictionary of terms and their probabilities in the group model.
Return type:	dict of term -> float

group_top(k: int, document_group: Iterable[Iterable[str]], **kwargs) → List[Tuple[str, float]][source]¶

Get the top k terms of a document_group and their probabilities. This is a shortcut to retrieve the top terms found by fit_parsimonious_group().

Parameters:	k (int) – Number of top terms to return. document_group (iterable over iterable of str terms) – All documents that should be included in the group model. kwargs – Optional keyword arguments for `fit_parsimonious_group()`.
Returns:	t_p – Terms and their probabilities in the group model.
Return type:	list of (str, float)