parsimonious module¶

class wayward.parsimonious.ParsimoniousLM(documents: Iterable[Iterable[str]], w: numpy.floating, thresh: int = 0)[source]¶

Bases: object

Language model for a set of documents.

Constructing an object of this class fits a background model. The top method can then be used to fit document-specific models, also for unseen documents (with the same vocabulary as the background corpus).

References

D. Hiemstra, S. Robertson, and H. Zaragoza (2004). Parsimonious Language Models for Information Retrieval. Proc. SIGIR‘04.

Parameters:	documents (iterable over iterable of str terms) – All documents that should be included in the corpus model. w (float) – Weight of document model (1 - weight of corpus model). thresh (int) – Don’t include words that occur fewer than thresh times.

vocab¶

Mapping of terms to numeric indices

Type:	dict of term -> int

p_corpus¶

Log probability of terms in background model (indexed by vocab)

Type:	array of float

p_document¶

Log probability of terms in the last processed document model (indexed by vocab)

Type:	array of float

get_term_probabilities(log_prob_distribution: numpy.ndarray) → Dict[str, float][source]¶

Align a term distribution with the vocabulary, and transform the term log probabilities to linear probabilities.

Parameters:	log_prob_distribution (array of float) – Log probability of terms which is indexed by the vocabulary.
Returns:	t_p_map – Dictionary of terms and their probabilities in the (sub-)model.
Return type:	dict of term -> float

top(k: int, d: Iterable[str], max_iter: int = 50, eps: float = 1e-05, w: Optional[numpy.floating] = None) → List[Tuple[str, float]][source]¶

Get the top k terms of a document d and their log probabilities.

Uses the Expectation Maximization (EM) algorithm to estimate term probabilities.

Parameters:	k (int) – Number of top terms to return. d (iterable of str terms) – Terms that make up the document. max_iter (int, optional) – Maximum number of iterations of EM algorithm to run. eps (float, optional) – Epsilon: convergence threshold for EM algorithm. w (float, optional) – Weight of document model; overrides value given to `ParsimoniousLM`
Returns:	t_p – Terms and their probabilities in the parsimonious model.
Return type:	list of (str, float)