Query by Browsing

Alan Dix and Andrew Patrick
At time of publication: HCI Group and Dept. of Computer Science,
University of York

Full reference:

A. Dix and A. Patrick (1994). Query By Browsing. Proceedings of IDS'94: The 2nd International Workshop on User Interfaces to Databases, Ed. P. Sawyer. Lancaster, UK, Springer Verlag. 236-248.
https://alandix.com/academic/papers/QbB-IDS94/ (html)

See also:


Abstract

The paper describes Query-by-Browsing (QbB), which uses machine learning techniques to generate a database query from examples of interesting records supplied by the user. Unlike Query-by-Example, in QbB the user's principal focus is on the actual data rather than on a schema for the database query. The system employs dual representation, displaying both a textual form of the inferred query and highlighting the selected records. This is important so that the user can recognise whether the inferred query is indeed what is required. It also fulfills an important tutorial rôle as the user can learn the form of textual queries.


1. Introduction

Most commercial database systems now include some form of Query-by-Example (QBE) (Zloof 1975). However, this paper argues that QBE is not really 'by example' at all. A new form of querying process is described, Query-by-Browsing (QbB). This works by allowing the user to browse a database and select examples of records. The system then uses machine learning techniques to generate the query from the user's examples.

The basic concept of QbB was described as part of a previous paper (Dix 1992). However, the purpose of that paper was to highlight the potential user interface pitfalls of what appears to be straightforward application of existing technology. Also at that time no implementation of QbB existed.

This paper describes the technical details of a prototype version of QbB and some of the problems observed. The paper has a dual thesis. On the one hand, we want to show that QbB is a viable technique. On the other hand, we want to show that the application of artificial intelligence techniques to the user interface is more than the adoption of technical solutions, but requires careful crafting of both the user interaction and the 'intelligent' algorithms.

The rest of this section will describe some related work including traditional Query-by-Example. Section 2 describes the external interface of Query-by-Browsing. This includes a dual representation of queries and Section 3 gives the rationale for this key feature of the design. QbB uses inductive learning to generate a query, the details of which are covered in Section 4. Finally Section 5 discusses some of the problems encountered and some potential solutions. Many of these problems had been anticipated and confirmed the importance of many of the issues raised in (Dix 1992).


Related work

There have been several systems which use automatic learning to assist in the browsing of large databases, such as on-line news services. In particular, latent semantic indexing can be used to produce measures of similarity between free text records and hence help guide a reader's browsing (Foltz 1990). The closest work to QbB is perhaps KNOWBOT (Sharif Heger and Koen 1991). This uses a connectionist approach to generate similarity between records of nuclear accident reports. This is built over a standard relational model of data and is designed to be applied to large incident databases compiled by nuclear regulatory boards. The differences between this and QbB are discussed later.

In a previous paper (Dix 1992), one of the authors used QbB as an example of the problems of intelligent interfaces. What appears to be a simple technical fix to a problem has, upon deeper reflection, many potential problems at the user interface. These are worst when 'black box' tecnhiques, such as neural networks, are used as the user in this case does not know why the system makes its responses. Even where, as in QbB, a more comprehensible learning technique is used, one expects problems managing the dialogue between user and system. The real problem is that AI is not just artificial intelligence, it is non-human - an alien intelligence. Seen in this light, it is not surprising that communication problems arise.


Query not by example

The details of Query-by-Example vary among different database systems, but are essentially as follows.

The process starts with some sort of template giving the headings of the entries in the desired table of results. For example, it might consist of the headings Department, Name and Salary. These headings might correspond to attributes from a single database relation, or might be obtained from several as the result of a join. The specification of this template is itself a non-trivial process, but for the purpose of this paper the user interface for this phase will not be discussed further.

Given such a template, the user must then fill in the slots to produce a sort of archetypal line of the listing.

NameDepartmentSalary
= 'Accounts'> 15000

Formulating this query is clearly much simpler for the naïve user than the equivalent textual query:

SELECT Department, Name, Salary
WHERE Department = 'Accounts'
      and Salary > 15000

However, it still demands that the user thinks in terms of queries. In both textual queries and QBE, the user has to formulate the desired listing in abstract terms and be able to predict from the query what records will be in the final listing. As many database systems are not incremental in displaying the results of queries, the user may have to wait for the whole query to be processed before being able to see whether the resulting listing is as required.

In short, QBE is little more than an alternative format for a standard SQL-like textual query, and is not really 'by example' at all.


2. Query by Browsing - the interface

Like QBE, Query-by-Browsing assumes that the user has already defined an appropriate template. However, rather than starting off with a blank form for the user to fill in, QbB generates a complete on screen listing of all records. In a large database, this listing can be generated incrementally as the user browses through the listing, so the cost need not be prohibitive. The user then browses the listing and marks those records which are required and those that are not. In Figure 1, the user has ticked the records of interest and has put a cross against others. Only a representative sample of records have been marked by the user.


Figure 1. Query-by-Browsing - user ticks interesting records


At some point the system infers the pattern underlying the user's selections. It then highlights all those records which correspond to the pattern and displays a textual form of the query it has inferred. At this point, the user has merely to verify that the query the system has generated is correct and if so confirm it by the click of a button. This is shown in Figure 2.


Figure 2. Query-by-Browsing - system highlights inferred selection


Note that the user has not had to generate an abstract query. Instead, from examples of relevant records the system has generated the query. The user only has to recognise that it is correct.

Unlike standard Query-by-Example, Query-by-Browsing really is 'by example'.


Continuing interaction

When the system has inferred a query, the user has four options:

  1. Accept the query by clicking the 'yes' button. This completes the query construction process, the listing is then reduced so that only the selected records are displayed.
  2. Ignore the query and continue entering ticks and crosses. The system may at some stage change the presented query when it has been given more examples to work on.
  3. Reject the query by clicking the 'no' button, thus telling the system that it has got it completely wrong. The user can then continue to enter ticks and crosses as in the previous option. However, the system has the added clue that the user's intended query is nothing like the one it presented!
  4. Refine the query by clicking the 'Yes but ...' button. This then allows the user to give more guidance as to what is right and what is wrong with the query. The simplest such refinements are: 'too many' and 'too few'. Too many tells the system that it has selected all the desired records but some more besides. The user will thereafter only enter negative examples from the selected records and the system will retain the query but try to make it stronger. Too few is similar, but the user enters more positive examples and the system weakens the query.

So long as the query is right, this interaction is quite smooth. However, as predicted in (Dix 1992), it is much more difficult to design a graceful interaction when the query is wrong.


3 Dual representation

In Figure 2, the system's inferred selection is shown in two ways, by highlighting and by the display of the inferred query. Some of the reasons for this dual representation are described in detail below, but in short:

Note however, that the form of highlighting and of query display is not critical. For example, we could substitute the textual display of the query with a QbB template that has been filled in by the system, or with a natural language rendering of the query.


Precise and imprecise domains

In similar domains where intelligent systems have been used to aid in the browsing or searching of databases, the matching has only been required to be useful, not exact. The users of on-line news services cannot possibly read the all the available information. Any facility to prioritise and present more relevant articles will be helpful. Even if an automatic aid sometimes fails to present articles which are wanted or presents those which are not, it is still better than nothing!

Even in the safety critical domain of nuclear accidents, Knowbot can still afford to be imprecise. As with the news service, the user cannot expect to see all accident reports and thus the interface is successful if the user views more relevant reports than before. However, the critical nature of this application favours more inclusive selections and this is reflected in the Knowbot interface. The user originally makes a 'query' by selecting keywords. The system then selects all those records which match the keyword, but in addition it displays records which are semantically close to the user's original selection.

In both cases the only disadvantage of displaying extra records is that they act as noise for the reader, and in neither case does the user have a fixed idea of what records are 'right', there is a fuzzy idea of more or less interesting records.

Contrast this with generating a standard database query. Now there is a precise set of records that are to be addressed. For example, if we want to select all the employees with certain characteristics in order to give them a pay rise, it is critical that we select exactly those which match the criteria, a fuzzy match is not acceptable. In addition, the user needs to be able to verify that the match is as desired.

Hence we see the importance of displaying of the query. Even if the system has correctly inferred the user's desired selection, the user must know that it has. Merely browsing the highlighted records would tell the user that the query is correct for the records which have been seen, but unless the user browses the whole database (in which case why use a query!) it is not possible to be sure that the query is exactly as required. Furthermore, if the query is to be reused as the database changes, even an exhaustive search of the database is insufficient to confirm the correctness of the query. For example, in Figure 2, one cannot tell from the highlighted selection whether the selected salary is '> 15000' or '> 15500'.


Recognition vs. generation

The astute reader may wonder at the apparent contradiction. On the one hand, we say that users find queries complex and hence need some new method (QbB) of generating queries really by examples. On the other hand, we also assert that the user needs to be able to verify the correctness of the query.

However, there is a great difference between the effort and knowledge needed to generate a query and that needed to recognise that a query is indeed what is wanted. This difference is important at various levels.

Even though it is easier to recognise a query than to generate it, there are still residual problems. The user may be confused as to the meaning of logical connectives, especially where the natural language meaning differs from the mathematical meaning. Also it is well known that people find it hard to understand statements including negation, especially double negatives:

''Select all employees except those where department is 'Accounts' and the manager is not 'Thomas'''

It is here that the dual representation becomes important. If the user is uncertain about the interpretation a glance at a few key records can confirm which interpretation is correct.


Learning

Last, but not least, dual representation can perform an important tutoring rôle. By seeing the query presented as well as the selection highlighted, the naïve user can begin to learn the appearance of textual or QBE queries, and eventually perhaps move on to become a 'power user'.


4. Inductive learning - the internals

The current implementation of QbB uses a variant of Quinlan's ID3 algorithm (Quinlan 1986a). The ID3 algorithm takes a number of examples of data values taken from different classes and builds a decision tree which can be used to sort unseen examples into their appropriate classes. That is it learns how to recognise the class of a data value. In the case of QbB we only have two classes: wanted (ticked) and not wanted (crossed).

ID3 is used because its data model, being based on frame-and-slot models as used in artificial intelligence, is very similar to the tuples of relational databases. Also, the decision trees generated are relatively easy to represent as database queries.

In the original ID3 each attribute had only a finite number of possible values, for example, a colour attribute might be in the set { red, blue, green } or a rating in the set { 1, 2, 3, 4, 5, 6 }. However, in common with other variants of ID3, QbB uses a richer range of data types. Attributes can in addition have general string values or be integers. These were chosen as they were typical of most other kinds of data type. For example, dates or real numbers behave very similarly to integers in that they can be compared for equality or inequality. Figure 3 shows an example data set suitable for the algorithm.


A database for a bank might have the following attributes:

forename string
surname string
title enumerated: { Dr, Mrs, Mr, Ms, Miss }
balance integer
overdraft-limit integer

A typical set of records would be

forenamesurnametitlebalanceoverdraft-limit
1. Jo Parkin Ms 575 0
2. John McCawber Mr -101 -100
3. Liz Windsor Mrs 27629 -9999
4. Peter Coney Dr -730 -1000
5. Gladys Green Ms -15 0
6. Bob Maxwell Mr -7329 -5000

Figure 3. Example bank database


In the decision tree built by the algorithm each internal node contains a decision based on the values of attributes of a datum. The branches are labelled by the possible choices of that decision. In the original ID3 the branches were labelled by the possible values of an attribute, for example, a decision might be of the form 'colour = ?' with branches for 'red', 'blue' and 'green'. In QbB all the branches are binary (yes/no) decisions so that all the different forms of decisions can be treated equally. The leaves of the tree are labelled by the classes and in the case of QbB these are simply wanted and not wanted.


Figure 4. Example decision tree


Consider the decision tree in Figure 4 which can be used to classify the records from Figure 3. This represents a query for all customers who have exceeded their overdraft limit and are at least 100 pounds overdrawn excluding anyone called 'Maxwell'. (Perhaps Maxwell owns the bank.) Underneath each leaf node is the list of records which are classified at that node.

Notice that the decisions allowed include comparisons of attributes with constants: 'balance > -100' and 'surname = Maxwell', and between attributes: 'balance < overdraft-limit'. Variants of the ID3 algorithm have normally included only the former. However, the use of inter-attribute matching adds considerable power. We have seen this for numeric comparison, but also equality tests between strings can be important. For example, the bank's auditor might want to check records where large overdrafts were authorised by sub-managers with the same surname as the customer!

The ID3 algorithm builds the tree top down. Imagine the user has marked records 2 and 6 as interesting and marked the rest uninteresting. The algorithm first looks at all single decisions. If one of these simple decisions splits the data perfectly into the two classes that one will be chosen. In fact, for the sample data set, the decision 'title = Mr' would do this and would lead to a very simple tree. However, let's assume that the learning algorithm has been forbidden from using gender information. In this case, the algorithms looks at the non-perfect decisions and looks for the best as measured by an entropy based measure. In this case, the decision 'balance < overdraft-limit' splits the data so that the 'no' partition has three uninteresting records and the 'yes' partition has two interesting and one uninteresting records. This is not far from perfect and is thus chosen as the first attribute. The algorithm then considers each branch. Down the 'no' branch all the records are from the same class and therefore it becomes a leaf node classifying as uninteresting. The 'yes' branch is more complex as it has mixed records. The algorithm is then applied recursively to these three records (2, 5 and 6) and looks for another simple decision to split them. The decision 'balance > -100' does this perfectly and is thus chosen leading to two new leaf nodes. The resulting tree is a simplified version of Figure 4 without the last test.


5. Problems, alternatives and future developments

As we said in the introduction, there are two reasons for producing a prototype of Query-by-Browsing. First we think Query-by-Browsing is a potentially powerful technique for naïve database users. However, we also want to investigate the potential problems outlined in (Dix 1992) concerning the interaction between human users and 'intelligent' systems. The problems we have encountered are thus as valuable as the successes!

The discussion of these problems and potential future directions is divided into those principally concerned with the learning algorithms and those concerned with the user interaction. However, the reader will notice that the separation is not perfect and indeed, one would expect that the design of an effective system would involve an intimate connection between the interface and the 'intelligent' engine within.


Types of decision

We saw in the example above how the naïve application of inductive learning could come up with a spurious decision tree based on the title of the customer. Note that this is not a wrong answer, it does explain the examples. In general, there can be several different decision trees to explain the same example set. In the context of an interactive session this is not critical as the user can always add more examples. However, if the initial decisions are really obscure this can be both confusing for the user and reduce the user's confidence in the system. Even experiments using more knowledge rich learning algorithms have found that automatically generated classification criteria are both different from human criteria and hard for the human reader to comprehend (Hanson 1990). This is a very difficult problem, but some of the very worst cases can be addressed.

One reason for the production of obscure trees is that the number of examples from browsing is relatively small compared with those generally expected by inductive learning systems. Indeed, one of Quinlan's papers includes the phrase ''large collections of examples'' in its title (Quinlan 1979)! The small number of examples means that by the time one gets to the leaves of a complex tree there may be only one or two records in each class on which to base the learning. Obviously these may be distinguished by virtually anything. One solution to this is for the system to refrain from presenting a query until it has a certain level of confidence in its solution.

A more interactive solution is for the system to assign confidence to each leaf of a tree. A possible measure to use for this is the 'gain ratio' as described in (Quinlan 1988). The system can then specifically mark those records (say with a question mark) indicating to the user which records it would be useful for the user to classify. Alternatively, it could specifically ask the user 'do you want this record'. Hopefully a small number of records classified in the uncertain areas of the tree will serve to build a confidence and generate a robust query. This sounds in fact very close to the interaction of a user consulting a database expert who was constructing the query. The expert would explicitly probe the boundaries of the query in order to check that it was as required.

In order to make the query more 'human', the inductive learning algorithm can be biased towards certain types of query. For example, some of Quinlan's work on ID3 has investigated the biasing of the algorithm towards degenerate trees - that is effectively conjunctive queries and other forms of more comprehensible query (Quinlan 1986b). ID3 is derived originally from the Concept Learning System and there have been other algorithms developed in this area which learn different sorts of rules, often including more complex domains. This includes purely conjunctive and disjunctive formulas and formulas based on 'internal disjunctive concepts' (Haussler 1990) - that is, rules which are conjunctions of terms of the type:

'department = Accounts or Computing'.

Whereas ID3 will return any tree which separates the classes, some algorithms will give the most specific rule to specify the positive examples. That is a rule which includes everything which is true about all of the examples. The query shown in Figure 2 is of this form. The examples shown can be classified by the criterion 'salary > 1500', but the most specific rule also includes the fact that all the positive examples are also in the accounts department:

'department = Accounts and salary > 1500'

In order to make ID3 give this query negative examples had to be given in other departments (not shown on the screen shot). In principle, a system using most specific criteria can learn mainly from positive examples, which would probably lead to a more natural interaction style for the user.

In addition, we may want to add some knowledge to the learning process. For example, in a banking system we may bias the system to favour queries dependent on monetary fields over those using personal attributes such as title or address. In (Dix 1992) it was argued that inadvertent discrimination was a danger of the use of example based approaches. QbB clearly portrays what decisions are being made which reduces (but does not eliminate) this problem. Despite these dangers, it is likely that knowledge should be in the form of a bias rather than a prohibition. For example, a bank's market research may have shown that its male middle-earning customers are likely to purchase redundancy insurance and so it may want to send insurance brochures to that group. (Whether such direct mailing is socially acceptable is, of course, questionable in itself.) There has been little work on such hybrid learning systems which employ a combination of rich knowledge with relatively 'blind' techniques such as inductive learning (Michalski and Kodratoff 1990). However, fairly simple techniques should be sufficient to improve the performance of QbB. In fact, the current implementation does employ some minimal general knowledge in that when making integer comparisons it chooses nice numbers. For example, in Figure 2, the examples have a ticked record with salary £1600 and a crossed one of £1450. Rather than choosing the midpoint of these £1525 the algorithm instead chooses the round number £1500.

Finally, in this area it is worth emphasising the unsuitability of neural networks as the primary mechanism for QbB. Although they can be highly effective as classifiers, the nature of their decision making is such that they must be regarded as 'black box' methods. It would be virtually impossible to generate a readable query from standard networks, thus ruling out a dual representation. That is, they are possibly acceptable in imprecise domains, but not where the user needs confidence that the classification performed by the system is exactly as required.


Complex interaction

The difficulties with interaction come when the system gets the wrong query. The 'no' button is particularly problematical as it gives no indication of what is wrong. Such an answer would be rather rude if delivered to a human expert, but perhaps one of the advantages of a computer assistant is that one can be as rude as one likes! At present the 'no' answer means that the system never displays that precise query again. However, this might involve a long period of inactivity as the additional examples that the user enters may all agree with the incorrect query. As the inductive learning algorithm is deterministic the system will keep on coming up with the same tree until an example is given which disagrees with it. There are two ways out of this impasse. The user can be encouraged to select helpful records by continuing to show the highlighting of the incorrect query. This will suggest marking which of the highlighted records aren't wanted and which of the unhighlighted ones are wanted. The other option is to alter the algorithm so that it finds a different tree to explain the examples. One way to achieve this is to make the algorithm slightly random, so that repeated invocations are likely to give different trees. Although deliberately making the interface non-deterministic sounds rather extreme, there are situations when it can be extremely effective (Dix 1991).

QbB is an example of a system where autonomous intelligent agents interact with the user. General interface guidelines for such systems are given in (Dix, Finlay et al. 1994). In particular, it is important that it is clear which interface objects (windows, buttons, etc.) are under the control of the user, which belong to the agents, and which are under shared control. We can apply this analysis to the present QbB system.

The listing window is a shared object where the user controls the marking and the agent (inductive learning component) controls the highlighting. The labelled buttons are a communication medium from the user to the agent and the query window is a form of communication from the agent to the user (not to forget that the shared listing window is also a medium of communication).

However, if we want to allow continuity between the novice, example based interface and an expert, query based interface, it suggests that the query window should also be shared and that the user ought to be able to make contributions there as well as to the listing.

For example, imagine the query generated in Figure 2 was correct except that the criterion the user was really after was those with salary over £1550. Rather than trying to find examples to force the system to the correct value, the user might find it easier to simply edit the query. However, in such cases the system would have to remember which bits of the query had been generated by the user and which by the system so that subsequent work by the inductive learning agent does not overwrite the user's work. This could be achieved by marking some of the tree's nodes as fixed.


6. Summary

We have described Query-by-Browsing a novel database query mechanism where the system generates the query. As opposed to traditional Query-by-Example, QbB really does work by example. The dual representation of the system's query by highlighting the selected records and by textual representation ensures that the user understands both the specific and general implications of the query. In addition, dual representation may serve an important tutoring rôle for the naïve user.

The current prototype exposes various shortcomings, several of which had been predicted in previous work. It serves as an exemplar of the general design problems where interfaces include autonomous intelligent agents. The prototype demonstrates the care and attention to detail which is needed in the design of such systems. In particular, it emphasises the close connection between the nature of the underlying algorithms and the details of user interaction.

We believe that Query-by-Browsing is a promising method which we hope will one day be included in many database systems alongside QBE and SQL queries.


7. References

  1. A. Dix (1992). Human issues in the use of pattern recognition techniques. In Neural Networks and Pattern Recognition in Human Computer Interaction, Eds. R. Beale and J. Finlay. Ellis Horwood. pp. 429-451.
  2. A. Dix, J. Finlay and J. Hassell (1994). Environments for cooperating agents: Designing the interface as medium. In CSCW and Artificial Intelligence, Eds. J. Connolly and E. Edmonds. Springer Verlag. pp. 23-37.
  3. A. J. Dix (1991). Formal Methods for Interactive Systems. Academic Press.
  4. P. W. Foltz (1990). Using latent semantic indexing for information filters. Proceedings of the Conference on Office Information Systems, . pp. 40-47.
  5. S. J. Hanson (1990). Conceptual clustering and categorisation: Bridging the gap between induction and causal models. In Machine Learning: An Artificial Intelligence Approach, . Morgan Kaufmann. pp. 235-268.
  6. D. Haussler (1990). Applying Valient's learning framework to AI concept-learning problems. In Machine Learning: An Artificial Intelligence Approach, . Morgan Kaufmann. pp. 641-669.
  7. R. S. Michalski and Y. Kodratoff (1990). Research in machine learning: recent progress, classification of methods, and future directions. In Machine Learning: An Artificial Intelligence Approach, . Morgan Kaufmann. pp. 3-30.
  8. J. R. Quinlan (1979). Discovering rules by induction from large collections of examples. In Expert Systems in the Micro-Electronic Age, Ed. D. Michie. Edinburgh University Press. pp. 168-201.
  9. J. R. Quinlan (1986a). Induction of decision trees. Machine Learning, 1(1) pp.
  10. J. R. Quinlan (1986b). Simplifying decision trees. International Journal of Man-Machine Studies, 27(4) pp. 221-234.
  11. J. R. Quinlan (1988). Decision trees and multi-valued attributes. In Machine Intelligence 11, . Oxford University Press. pp. 305-318.
  12. A. Sharif Heger and B. V. Koen (1991). Knowbot: An adaptive database interface. Nuclear Science and Engineering, 107 pp. 142-157.
  13. M. M. Zloof (1975). Query by example. Proc. AFIPS National Computer Conf. 44, . AFIPS Press, New Jersey. pp. 431-438.


Alan Dix 5/7/98