DIRT – Discovery of Inference Rules from Text ABSTRACT inference rule because we also want to include relationships that
In this paper, we propose an unsupervised method for discovering
are not exactly paraphrases, but are nonetheless related and are
inference rules from text, such as “X is author of Y ≈ X wrote Y”,
potentially useful to information retrieval systems. For example,
“X solved Y ≈ X found a solution to Y”, and “X caused Y ≈ Y is
“X caused Y ≈ Y is blamed on X”is an inference rule even though
triggered by X”. Inference rules are extremely important in many
the two phrases do not mean exactly the same thing.
fields such as natural language processing, information retrieval,
Traditionally, knowledge bases containing such inference rules
and artificial intelligence in general.Our algorithm is based on an
are created manually. This knowledge engineering task is
extended version of Harris’ Distributional Hypothesis, which
extremely laborious. More importantly, building such a
states that words that occurred in the same contexts tend to be
knowledge base is inherently difficult since humans are not good
similar. Instead of using this hypothesis on words, we apply it to
at generating a complete list of rules. For example, while it is
paths in the dependency trees of a parsed corpus.
quite trivial to come up with the rule “X wrote Y ≈ X is the author of Y”, it seems hard to dream up a rule like “X manufactures Y ≈1. INTRODUCTION X’s Y factory”, which can be used to infer that “Chrétien visited
Text is the most significant repository of human knowledge.
Peugot’s newly renovated car factory in the afternoon” contains
Many algorithms have been proposed to mine textual data. Most
an answer to the query “What does Peugot manufacture?”
of them focus on document clustering [13], identifying
Most previous efforts on knowledge engineering have focused on
prototypical documents [20], or finding term associations [14] and
creating tools for helping knowledge engineers transfer their
hyponym relationships [9]. We propose an unsupervised method
knowledge to machines [6]. Our goal is to automatically discover
for discovering inference rules, such as “X is author of Y ≈ X wrote Y”, “X solved Y ≈ X found a solution to Y”, and “X caused Y
≈ Y is triggered by X”. Inference rules are extremely important in
In this paper, we present an unsupervised algorithm, DIRT, for
many fields such as natural language processing, information
Discovery of Inference Rules from Text. Our algorithm is a
retrieval, and artificial intelligence in general.
generalization of previous algorithms for finding similar words [10][15][19]. Algorithms for finding similar words assume the
For example, consider the query to an information retrieval
Distributional Hypothesis, which states that words that occurred
system: “Who is the author of the 'Star Spangled Banner'?”
in the same contexts tend to have similar meanings [7]. Instead of
Unless the system recognizes the relationship between “X wrote
applying the Distributional Hypothesis to words, we apply it to
Y”and “X is the author of Y”, it would not necessarily rank the
paths in dependency trees. Essentially, if two paths tend to link
the same sets of words, we hypothesize that their meanings are
. Francis Scott Key wrote the “Star Spangled Banner” in 1814.
similar. Since a path represents a binary relationship, we generate an inference rule for each pair of similar paths.
The remainder of this paper is organized as follows. In the next
…comedian-actress Roseanne Barr sang her famous shrieking
section, we review previous work. In Section 3, we define paths
rendition of the “Star Spangled Banner” before a San Diego
in dependency trees and describe their extraction from a parsed
corpus. Section 4 presents the DIRT system and a comparison of our system’s output with manually generated paraphrase
We call “X wrote Y ≈ X is the author of Y” an inference rule. In
expressions is shown in Section 5. Finally, we conclude with a
previous work, such relationships have been referred to as
paraphrases or variants [24]. In this paper, we use the term
2. Previous Work Most previous work on variant recognition and paraphrase has
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are
been done in the fields of natural language generation, text
not made or distributed for profit or commercial advantage and that
summarization, and information retrieval.
copies bear this notice and the full citation on the first page. To copy
The generation community has focused mainly on rule-based text
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
transformations in order to meet external constraints such as
length and readability [11][18][22]. Dras [4] described syntactic
Copyright ACM 2001 1-58113-391-x/01/08…$5.00
paraphrases using a meta-grammar with a synchronous Tree
Table 1. A subset of dependency relations in Minipar outputs.
In multi-document summarization, paraphrasing is important to avoid redundant statements in a summary. Given a collection of
the CEO, John
similar sentences (a theme) with different wordings, it is difficult
to identify similar phrases that report the same fact. Barzilay et al.
John’s dog
[3] analyzed 200 two-sentence themes from a corpus and
tiny hole
extracted seven lexico-syntactic paraphrasing rules. These rules covered 82% of syntactic and lexical paraphrases, which cover
station manager
70% of all variants. The rules are subsequently used to identify
John loves Mary.
common statements in a theme by comparing the predicate-
argument structure of the sentences within the theme.
In information retrieval, it is common to identify phrasal terms
from queries and generate their variants for query expansion. It
has been shown that such query expansion does promote effective
John found a solution to the problem.
retrieval [1][2]. Morphological variant query expansion was
treated by Sparck Jones and Tait [24] and Jacquemin [12].
Figure 1. Example dependency tree.
In [21], Richardson extracted semantic relationships (e.g., hypernym, location, material and purpose) from dictionary
Minipar parses newspaper text at about 500 words per second on
definitions using a parser and constructed a semantic network. He
a Pentium-III 700Mhz with 500MB memory. Evaluation with the
then described an algorithm that uses paths in the semantic
manually parsed SUSANNE corpus [23] shows that about 89% of
network to compute the similarity between words. In a sense, our
the dependency relationships in Minipar outputs are correct.
algorithm is a dual of Richardson’s approach. While Richardson used paths as features to compute the similarity between words,
3.2 Paths in Dependency Trees
we use words as features to compute the similarity of paths.
In the dependency trees generated by Minipar, each link between two words in a dependency tree represents a direct semantic
Many text mining algorithms aim at finding association rules
relationship. A path allows us to represent indirect semantic
between terms [14]. In contrast, the output of our algorithm is a
relationships between two content words. We name a path by
set of associations between relations. Term associations usually
concatenating dependency relationships and words along the path,
require human interpretation. Some of them are considered to be
excluding the words at the two ends. For the sentence in Figure 1,
the path between John and problem is named: N:subj:V!find"V:obj:N"solution"N:to:N (meaning “X finds 3. Extraction of Paths solution to Y”). The reverse path can be written as:
The inference rules discovered by DIRT are between paths in
N:to:N!solution!N:obj:V!find" V:subj:N. The root of both
dependency trees. In this section, we introduce dependency trees
paths is find. A path begins and ends with two dependency
and define paths in trees. Finally, we describe an algorithm for
relations. We call them the two slots of the path: SlotX on the left-
hand side and SlotY on the right-hand side. The words connected by the path are the fillers of the slots. For example, John fills the
3.1 Dependency Trees SlotX of N:subj:V!find"V:obj:N"solution"N:to:N and
A dependency relationship [8] is an asymmetric binary
problem fills the SlotY. The reverse is true for
relationship between a word called head, and another word called
N:to:N!solution!N:obj:V!find"V:subj:N. In a path,
modifier. The structure of a sentence can be represented by a set
dependency relations that are not slots are called internal
of dependency relationships that form a tree. A word in the
relations. For example, find"V:obj:N"solution is an internal
sentence may have several modifiers, but each word may modify
at most one word. The root of the dependency tree does not
We impose a set of constraints on the paths to be extracted from
modify any word. It is also called the head of the sentence.
For example, Figure 1 shows the dependency tree for the sentence
• most meaningful inference rules involve only paths that
“John found a solution to the problem”, generated by a broad-
coverage English parser called Minipar1 [16]. The links in the diagram represent dependency relationships. The direction of a
• the constraints significantly reduce the number of distinct
link is from the head to the modifier in the relationship. Labels
paths and, consequently, the amount of computation required
associated with the links represent types of dependency relations.
for computing similar paths (described in Section 4.3); and
Table 1 lists a subset of the dependency relations in Minipar
the constraints alleviate the sparse data problem because
long paths tend to have very few occurrences.
1Available at www.cs.ualberta.ca/~lindek/minipar.htm.
• slot fillers must be nouns because slots correspond to
Table 2. Sample slot fillers for two paths extracted from a
variables in inference rules and we expect the variables to be
newspaper corpus.
“X finds a solution to Y” “X solves Y”
• any dependency relation that does not connect two content
words (i.e. nouns, verbs, adjectives or adverbs) is excluded
from a path. E.g. in Figure 1, the relation between a and
• the frequency count of an internal relation must exceed a
They had previously bought bighorn sheep from Comstock.
The paths extracted from this sentence and their meanings are:
Based on these common contexts, one can statistically determine
that duty and responsibility have similar meanings.
≡ X buys something from Y
In the algorithms for finding word similarity, dependency links
are treated as contexts of words. In contrast, our algorithm for
finding inference rules treats the words that fill the slots of a path
as a context for the path. We make an assumption that this is an
(c) N:subj:V!buy"V:obj:N"sheep"N:nn:N
extension to the Distributional Hypothesis:
(d) N:nn:N!sheep!N:obj:V!buy"V:from:N
Extended Distributional Hypothesis:
≡ X sheep is bought from Y
If two paths tend to occur in similar contexts, the meanings of the paths tend to be similar.
For example, Table 2 lists a set of example pairs of words
An inverse path is also added for each one above.
connected by the paths N:subj:V!find"V:obj:N"solution" N:to:N (“X finds a solution to Y”) and N:subj:V!solve"
4. Discovering Inference Rules from Text
V:obj:N (“X solves Y”). As it can be seen from the table, there are
A path is a binary relation between two entities. In this section,
many overlaps between the corresponding slot fillers of the two
we present an algorithm, called DIRT, to automatically discover
paths. By the Extended Distributional Hypothesis, we can then
the inference relations between such binary relations.
claim that the two paths have similar meaning.
4.1 Underlying Assumption 4.2 Triples
Most algorithms for computing word similarity from text corpus
To compute the path similarity using the Extended Distributional
are based on a principle known as the Distributional Hypothesis
Hypothesis, we need to collect the frequency counts of all paths in
[7]. The idea is that words that tend to occur in the same contexts
a corpus and the slot fillers for the paths. For each instance of a
tend to have similar meanings. Previous efforts differ in their
path p that connects two words w1 and w2, we increase the
representation of the context and in their formula for computing
frequency counts of the two triples (p, SlotX, w1)and (p, SlotY,
the similarity between two sets of contexts. Some algorithms use
w2). We call (SlotX, w1)and (SlotY, w2)features of path p.
the words that occurred in a fixed window of a given word as its
Intuitively, the more features two paths share, the more similar
context while others use the dependency relationships of a given
word as its context [15]. Consider the words duty and
We use a triple database (a hash table) to accumulate the responsibility. There are many contexts in which both of these
frequency counts of all features of all paths extracted from a
parsed corpus. An example entry in the triple database for the
• duty can be modified by adjectives such as additional,
administrative, assigned, assumed, collective, congressional,
N:subj:V!pull"V:obj:N"body"N:from:N
constitutional, …, so can responsibility;
• duty can be the object of verbs such as accept, articulate,
is shown in Figure 2. The first column of numbers in Figure 2
assert, assign, assume, attend to, avoid, become, breach, …,
represents the frequency counts of a word filling a slot of the path
and the second column of numbers is the mutual information
X pulls body from Y: Table 3. The top-20 most similar paths to “X solves Y”.
where p1 and p2 are paths, s is a slot, T(pi, s) is the set of words
that fill in the s slot of path pi.
The similarity between a pair of paths p1 and p2 is defined as the geometric average of the similarities of their SlotX and SlotY Figure 2. An example entry in the triple database for the path “X pulls body from Y”. S( p , p = sim SlotX , SlotX × sim SlotY , SlotY
between a slot and a slot filler. Mutual information measures the
strength of the association between a slot and a filler. We explain
i and SlotYi are path i’s SlotX and SlotY slots.
mutual information in detail in Section 4.3. The triple database
4.4 Finding the Most Similar Paths
records the fillers of SlotX and SlotY separately. Looking at the
The discovery of inference rules is made by finding the most
database, one would be unable to tell which SlotX filler occurred
similar paths of a given path. The challenge here is that there are a
with which SlotY filler in the corpus.
large number of paths in the triple database. The database used in
4.3 Similarity between Two Paths
our experiments contains over 200,000 distinct paths. Computing the similarity between every pair of paths is obviously
Once the triple database is created, the similarity between two
paths can be computed in the same way that the similarity between two words is computed in [15]. Essentially, two paths
Given a path p, our algorithm for finding the most similar paths of
have high similarity if there are a large number of common
features. However, not every feature is equally important. For example, the word he is much more frequent than the word
(a) Retrieve all the paths that share at least one feature with psheriff. Two paths sharing the feature (SlotX, he) is less indicative
and call them candidate paths. This can be done efficiently
of their similarity than if they shared the feature (SlotX, sheriff).
by storing for each word the set of slots it fills in.
The similarity measure proposed in [15] takes this into account by
(b) For each candidate path c, count the number of features
computing the mutual information between a feature and a path.
shared by c and p. Filter out c if the number of its common
We use the notation |p, SlotX, w| to denote the frequency count of
features with p is less than a fixed percent (we used 1%) of the total number of features for p and c. This step
the triple (p, SlotX, w), |p, SlotX, *| to denote ∑ p, SlotX , w ,
effectively uses a simpler similarity formula to filter out
some of the paths since computing mutual information is
and |*, *, *| to denote ∑ p, s, w .
more costly than counting the number of features. This
idea has previously been used in Canopy [17].
Following [15], the mutual information between a path slot and its
(c) Compute the similarity between p and the candidates that
passed the filter using equation (2) and output the paths in descending order of their similarity to p.
mi( p, Slot, w)
Table 3 lists the Top-50 most similar paths to “X solves Y”
p, Slot,* × *,Slot, w
generated by DIRT. Most of the paths can be considered as paraphrases of the original expression.
The similarity between a pair of slots: slot1 = (p1, s) and slot2 = (p2, s), is defined as:
5. Experimental Results
We performed an evaluation of our algorithm by comparing the
w T ( p ,s )∩T ( p ,s )
inference rules it generates with a set of human-generated
paraphrases of the first six questions in the TREC-8 Question-
Table 4. First six questions from TREC-8. Table 5. Evaluation of Top-40 most similar paths.
Who is the author of the book, “The Iron Lady: A Biography of
What was the monetary value of the Nobel Peace Prize in 1989?
What does the Peugeot company manufacture?
How much did Mercury spend on advertising in 1993?
What is the name of the managing director of Apricot Computer?
Why did David Koresh ask the FBI for a word processor?
Answering Track, listed in Table 4. TREC (Text REtrievial
Conference) is a U.S. government sponsored competition on
information retrieval held annually since 1992. In the Question-Answering Track, the task for participating systems is to find answers to natural-language questions like those in Table 4.
5.2 Observations There is very little overlap between the automatically generated 5.1 Results
paths and the paraphrases, even though the percentage of correct
We used Minipar to parse about 1GB of newspaper text (AP
paths in DIRT outputs can be quite high. This suggests that
Newswire, San Jose Mercury, and Wall Street Journal). Using the
finding potentially useful inference rules is very difficult for
methods discussed in Section 3, we extracted 7 million paths from
humans as well as machines. Table 6 shows some of the correct
the parse trees (231,000 unique) and stored them in a triple
paths among the Top-40 extracted by our system for two of the
TREC-8 questions. Many of the variations generated by DIRT that are correct paraphrases are missing from the manually
The second column of Table 5 shows the paths that we identified
generated variations. It is difficult for humans to recall a list of
from the TREC-8 questions. For some questions, more than one
paraphrases. However, given the output of our system, humans
path was identified. For others, no path was found. We compare
can easily identify the correct inference rules. Hence, at the least,
the output of our algorithm with a set of manually generated
our system would greatly ease the manual construction of
paraphrases of the TREC-8 questions made available at ISI2.
inference rules for an information retrieval system.
We also extracted paths from the manually generated paraphrases.
The performance of DIRT varies a great deal for different paths.
For some paraphrases, an identical path is extracted. For example,
Usually, the performance for paths with verb roots is much better
“What things are manufactured by Peugeot?” and “What products
than for paths with noun roots. A verb phrase typically has more
are manufactured by Peugeot?” both map to the path “X is
than one modifier, whereas nouns usually take a smaller number
manufactured by Y”. The number of paths for the manually
of modifiers. When a word takes less than two modifiers, it will
generated paraphrases of TREC-8 questions is shown in the third
not be the root of any path. As a result, paths with noun roots
occur less often than paths with verb roots, which explains the
For each of the paths p in the second column of Table 5, we ran
lower performance with respect to paths with noun roots.
the DIRT algorithm to compute its Top-40 most similar paths
In Table 5, DIRT found no correct inference rules for Q2. This is
using the triple database. We then manually inspected the outputs
due to the fact that Q2 does not have any entries in the triple
and classified each extracted path as correct or incorrect. A path
p' is judged correct if a sentence containing p' might contain an answer to the question from which p was extracted. Consider question Q6. Conclusion and Future Work
3 in Table 4 where we have p = “X manufactures Y”
and we find p' = “X’s Y factory” as one of p’s Top-40 most similar
Better tools are necessary to tap into the vast amount of textual
paths. Since “Peugeot’s car factory” might be found in some
data that is growing at an astronomical pace. Knowledge about
corpus, p' is judged correct. Note that not all sentences containing
inference relationships in natural language expressions would be
p' necessarily contain an answer to Q
extremely useful for such tools. To the best of our knowledge, this
factory” gives the location of a Peugeot factory in France). The
is the first attempt to discover such knowledge automatically from
fourth column in Table 5 shows the number of Top-40 most
a large corpus of text. We introduced the Extended Distributional
similar paths classified as correct and the fifth column gives the
Hypothesis, which states that paths in dependency trees have
intersection between columns three and four. Finally, the last
similar meanings if they tend to connect similar sets of words.
column in Table 5 gives the percentage of correctly classified
Treating paths as binary relations, our algorithm is able to
generate inference rules by searching for similar paths. Our experimental results show that the Extended Distributional
Hypothesis can indeed be used to discover very useful inference
2 Available at http://www.isi.edu/~gerber/Variations2.txt
Table 6. Paths found for two of the six questions in TREC-8 8. REFERENCES and the variations discovered manually and by DIRT.
[1] Anick, P.G. and Tipirneni, S. 1999. The Paraphrase Search Assistant:
Terminological Feedback for Iterative Information Seeking. In Proceedings of SIGIR-99. pp. 153-159. Berkeley, CA.
[2] Arampatzis, A. T., Tsoris, T., Koster, C. H. A., and van der Weide, T. P. 1998.
Phrase-based infase-bas retrieval. Information Processing & Management, 34(6):693-707.
[3] Barzilay, R., McKeown, K., and Elhadad, M. 1999. Information Fusion in the
X makes Y; X produce Y; X is in Y
Context of Multi-Document Summarization. In Proceedings of ACL-99. College Park, Maryland.
business; Y is manufactured by X; Y is
[4] Dras, M. 1999. A meta-level Grammar: Redefining Synchronous TAGs for
provided by X; Y is X's product; Y is
Translation and Paraphrase. In Proceedings of ACL-99. pp. 80-97. College Park, Maryland.
product from X; Y is X product; Y is
product made by X; Y is example of X
[5] Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M.,
Schler, Y., and Zamir, O. 1998. Text Mining at the Term Level. In
product; X is manufacturer of Y;
Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery. pp. 65-73. Nantes, France.
find Y in X's product line; find Y in X
[6] Hahn, U. and Schnattinger, K. 1998. Towards Text Knowledge Engineering.
In Proceedings of AAAI-98 / IAAI-98. Menlo Park, CA. pp. 524-531.
[7] Harris, Z. 1985. Distributional Structure. In: Katz, J. J. (ed.) The Philosophy of Linguistics. New York: Oxford University Press. pp. 26-47.
[8] Hays, D. 1964. Dependency Theory: A Formalism and Some Observations.
[9] Hearst, M. 1992. Automatic Acquisition of Hyponyms from Large Text
Corpora. In Proceedings of ACL-92. Nantes, France.
[10] Hindle, D. 1990. Noun Classification from Predicate-Argument Structures. In
Proceedings of ACL-90. pp. 268-275. Pittsburgh.
[11] Iordanskaja, L, Kittredge, R., and Polguere, A. 1991. Natural Language Generation in Artificial Intelligence and Computational Linguistics. Kluwer. Boston, MA.
[12] Jacquemin, C., Klavans, J. L., and Tzoukermann, E. 1999. NLP for Term
Variant Extraction: A Synergy of Morphology, Lexicon, and Syntax. Natural Language Information Retrieval, T. Strzalkowski, editor. pp. 25-74. Kluwer. Boston, MA.
[13] Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time
document clustering. In Proceedings of KDD-99. pp. 16-22. San Diego, CA.
[14] Lin, S. H., Shih, C. S., Chen, M. C., et al. 1998. Extracting Classification
rules, many of which, though easily recognizable, are difficult for
Knowledge of Internet Documents with Mining Term Associations: A
Semantic Approach. In Proceedings of SIGIR-98. Melbourne, Australia.
Many questions remain to be addressed. One is to recognize the
Lin, D. 1998. Extracting Collocations from Text Corpora. Workshop on Computational Terminology. pp. 57-63. Montreal, Canada.
polarity in inference relationships. High similarity values are often assigned to relations with opposite polarity. For example, “X
[16] Lin, D. 1993. Principle-Based Parsing Without OverGeneration. In
Proceedings of ACL-93. pp. 112-120. Columbus, OH.
worsens Y” has one of the highest similarity to “X solves Y” according to equation (2). In some situations, this may be helpful
[17] McCallum, A., Nigam, K., and Ungar, L. H. 2000. Efficient Clustering of
High-Dimensional Data Sets with Application to Reference Matching. In
while for others it may cause confusion.
Proceedings of KDD-2000. Boston, MA.
Another is to extend paths with constraints on the inference rule’s
[18] Meteer, M. M. and Shaked, V. 1988. Strategies for Effective Paraphrasing. In
variables. For example, instead of generating a rule “X Proceedings of COLING-88. pp. 431-436 Budapest.
manufactures Y ≈ X’s Y factory”, we may want to generate a rule
[19] Pereira, F., Tishby, N., and Lee, L. 1993. Distributional Clustering of English
Words. In Proceedings of ACL-93. pp. 183-190. Columbus, Ohio.
with an additional clause: “X manufactures Y ≈ X’s Y factory, where Y is an artifact”. The “where” clause can be potentially
[20] Rajman, M. and Besançon, R. 1997. Text Mining: Natural Language
discovered by generalizing the intersection of the SlotY fillers of
Techniques and Text Mining Applications. In Proceedings of the seventh IFIP 2.6 Working Conference on Database Semantics (DS-7).
[21] Richardson, S. D. 1997. Determining Similarity and the Inferring Relations in a Lexical Knowledge-Base. Ph.D. Thesis. The City University of New York.
7. ACKNOWLEDGMENTS
[22] Robin, J. 1994. Revision-based Generation of Natural Language Summaries
The authors wish to thank the reviewers for their helpful
Providing Historical Background. Ph.D. Dissertation. Columbia University.
comments. This research was partly supported by Natural
[23] Sampson, G. 1995. English for the Computer - The SUSANNE Corpus and
Sciences and Engineering Research Council of Canada grant
Analytic Scheme. Clarendon Press. Oxford, England.
[24] Sparck Jones, K. and Tait, J. I. 1984. Automatic Search Term Variant
Generation. Journal of Documentation, 40(1):50-66.
Sphincter of OddiDysfunction Treatments Part II: Medicinals and Supplements This Sphincter of Oddi Dysfunction (SOD) Treatments List is a culmination of patient experiences from Internet support groups and blogs; and information gathered from the professional medical and alternative health community. Although there is no known "permanent cure" for SOD, there are many different t
Pressemitteilung Nr. 2/2008 - Nürnberg, 07. April 2008 STUDIO GONG und SOMA 2 vermarkten gemeinsam Um das Onlineangebot der STUDIO GONG GmbH & Co. Studiobetriebs KG zu erweitern, arbeitet der führende Hörfunkvermarkter mit dem Onlinevermarkter SOMA 2 GmbH zusammen. Die Kooperation stärkt die Position der beiden Unternehmen im Bereich Radio und Regionalwerbung sowie der nationalen