The product homomorphism method and its applications to learning logic programs
Learning logic programs, also referred to as inductive logic programming (ILP), is a relatively new area of artificial intelligence. The theoretical study of efficient learnability of logic programs is motivated by the increasing importance of using ILP in different practical applications. Since the general ILP learning problem is known to be computationally intractable, one of the challenging problems in this field of research is to show positive and negative theoretical results about the efficient learnability of different classes of logic programs in the formal learning models of computational learning theory. In this work we present a general method, called the product homomorphism method, that can be used to develop PAC-learning algorithms for different classes of logic programs. Our general approach makes use of the fact that least general generalizations of first-order clauses can be represented as products. This relationship is already mentioned in the seminal papers of Plotkin, but he did not study its computational aspects. We show that the concept classes considered in this work are closed under nonempty intersection, and thus, for a set S+ of positive examples one can consider the least concept containing S+, also referred to as the concept generated by S+. This result implies that for another given set S-of negative examples there is a hypothesis consistent with S+ and S- if and only if S- and the concept generated by S+ are disjoint. Our main result is that the concept generated by S+ can be characterized by a certain homomorphism from the product of | S+ | copies of the background knowledge. Thus products and homomorphisms enable the use of algebraic and combinatorial tools in certain cases and provide the following general method for researchers to develop efficient ILP learning algorithms: 1.find a characterization for the existence of a certain homomorphism, 2. give an efficient algorithm translating this characterization into the language of nonrecursive Horn clauses. We show different classes of logic programs where this method can indeed be used to obtain positive learnability results in Valiant's PAC-model of learning. In these cases we assume that the background knowledge consists of ground atoms of a single binary background predicate, and thus it can be represented as a directed graph. We study those cases where this graph is a set of disjoint directed paths, a forest, a set of disjoint directed cycles, or a function graph (i.e., each node has outdegree 1). We show that for these classes of logic programs the product homomorphism method outlined above can be used to derive efficient hypothesis finding algorithms, i.e., an algorithm which finds a nonrecursive definite Horn clause that is consistent with a given set of examples. In order to get efficient learning algorithms in the PAC model of learning, one also needs upper bounds on the sample size by estimating the Vapnik-Chervonenkis dimension of the concept classes. A polynomial upper bound for this quantity is given as a byproduct of our characterization results. We improve these bounds by providing better ones that are sharp up to order of magnitude. We note that the above restrictions on the background knowledge are related to the notion of determinateness that is often used to obtain positive ILP learnability results. In particular, having background knowledge consisting of paths or cycles implies determinateness. On the other hand, in the case of forests or function graphs the clauses are not necessarily determinate. Another important difference is that we do not assume any bound on the depth of the clauses considered. In the case when the target definition may consist of multiple clauses, the problem becomes more difficult. We consider the problem of hypothesis finding, which turns out to be polynomially solvable if the background knowledge is a union of paths and the target predicate is unary, but it is NP-complete if the background knowledge may be a forest or the union of cycles (and the target is still unary). In order to illustrate how the product homomorphism method can be used to derive a domain specific efficient learning algorithm for a real-world problem, we consider its application to the protein secondary structure prediction problem. We first show how to map proteins' primary structures to colored path graphs. By using the product homomorphism method, we then give an efficient hypothesis finding algorithm for the class of logic programs where the ground atoms in the background knowledge form a set of disjoint colored directed paths. Since the problem of finding a consistent hypothesis consisting of multiple clauses is NP-complete, we show how to approximate the solution by combining domain specific local search with the efficient single clause hypothesis finding algorithm obtained by using the product homomorphism method.