Alan Dix and Andrew Patrick
At time of publication: | HCI Group and Dept. of Computer Science, University of York |
Full reference:
A. Dix and A. Patrick (1994). Query By Browsing. Proceedings of IDS'94: The 2nd International Workshop on User Interfaces to Databases, Ed. P. Sawyer. Lancaster, UK, Springer Verlag. 236-248.
https://alandix.com/academic/papers/QbB-IDS94/ (html)
See also:
https://alandix.com/academic/topics/qbb/
Abstract
The paper describes Query-by-Browsing (QbB), which uses machine learning techniques to generate a database query from examples of interesting records supplied by the user. Unlike Query-by-Example, in QbB the user's principal focus is on the actual data rather than on a schema for the database query. The system employs dual representation, displaying both a textual form of the inferred query and highlighting the selected records. This is important so that the user can recognise whether the inferred query is indeed what is required. It also fulfills an important tutorial rôle as the user can learn the form of textual queries.
Most commercial database systems now include some form of Query-by-Example (QBE) (Zloof 1975). However, this paper argues that QBE is not really 'by example' at all. A new form of querying process is described, Query-by-Browsing (QbB). This works by allowing the user to browse a database and select examples of records. The system then uses machine learning techniques to generate the query from the user's examples.
The basic concept of QbB was described as part of a previous paper (Dix 1992). However, the purpose of that paper was to highlight the potential user interface pitfalls of what appears to be straightforward application of existing technology. Also at that time no implementation of QbB existed.
This paper describes the technical details of a prototype version of QbB and some of the problems observed. The paper has a dual thesis. On the one hand, we want to show that QbB is a viable technique. On the other hand, we want to show that the application of artificial intelligence techniques to the user interface is more than the adoption of technical solutions, but requires careful crafting of both the user interaction and the 'intelligent' algorithms.
The rest of this section will describe some related work including traditional Query-by-Example. Section 2 describes the external interface of Query-by-Browsing. This includes a dual representation of queries and Section 3 gives the rationale for this key feature of the design. QbB uses inductive learning to generate a query, the details of which are covered in Section 4. Finally Section 5 discusses some of the problems encountered and some potential solutions. Many of these problems had been anticipated and confirmed the importance of many of the issues raised in (Dix 1992).
There have been several systems which use automatic learning to assist in the browsing of large databases, such as on-line news services. In particular, latent semantic indexing can be used to produce measures of similarity between free text records and hence help guide a reader's browsing (Foltz 1990). The closest work to QbB is perhaps KNOWBOT (Sharif Heger and Koen 1991). This uses a connectionist approach to generate similarity between records of nuclear accident reports. This is built over a standard relational model of data and is designed to be applied to large incident databases compiled by nuclear regulatory boards. The differences between this and QbB are discussed later.
In a previous paper (Dix 1992), one of the authors used QbB as an example of the problems of intelligent interfaces. What appears to be a simple technical fix to a problem has, upon deeper reflection, many potential problems at the user interface. These are worst when 'black box' tecnhiques, such as neural networks, are used as the user in this case does not know why the system makes its responses. Even where, as in QbB, a more comprehensible learning technique is used, one expects problems managing the dialogue between user and system. The real problem is that AI is not just artificial intelligence, it is non-human - an alien intelligence. Seen in this light, it is not surprising that communication problems arise.
The details of Query-by-Example vary among different database systems, but are essentially as follows.
The process starts with some sort of template giving the headings
of the entries in the desired table of results.
For example, it might consist of the headings Department
,
Name
and Salary
.
These headings might correspond to attributes from a single database relation,
or might be obtained from several as the result of a join.
The specification of this template is itself a non-trivial process,
but for the purpose of this paper the user interface for this phase
will not be discussed further.
Given such a template, the user must then fill in the slots to produce a sort of archetypal line of the listing.
Name Department Salary = 'Accounts'
> 15000
Formulating this query is clearly much simpler for the naïve user than the equivalent textual query:
SELECT Department, Name, Salary WHERE Department = 'Accounts' and Salary > 15000
However, it still demands that the user thinks in terms of queries. In both textual queries and QBE, the user has to formulate the desired listing in abstract terms and be able to predict from the query what records will be in the final listing. As many database systems are not incremental in displaying the results of queries, the user may have to wait for the whole query to be processed before being able to see whether the resulting listing is as required.
In short, QBE is little more than an alternative format for a standard SQL-like
textual query, and is not really 'by example' at all.
Like QBE, Query-by-Browsing assumes that the user has already defined
an appropriate template.
However, rather than starting off with a blank form for the user to fill in,
QbB generates a complete on screen listing of all records.
In a large database, this listing can be generated incrementally as the
user browses through the listing, so the cost need not be prohibitive.
The user then browses the listing and marks those records which are
required and those that are not.
In Figure 1, the user has ticked the records of interest and
has put a cross against others.
Only a representative sample of records have been marked by the user.
Figure 1. Query-by-Browsing - user ticks interesting records
At some point the system infers the pattern underlying the user's selections.
It then highlights all those records which correspond to the pattern
and displays a textual form of the query it has inferred.
At this point, the user has merely to verify that the query the system has
generated is correct and if so confirm it by the click of a button.
This is shown in Figure 2.
Figure 2. Query-by-Browsing - system highlights inferred selection
Note that the user has not had to generate an abstract query.
Instead, from examples of relevant records the system
has generated the query.
The user only has to recognise that it is correct.
Unlike standard Query-by-Example, Query-by-Browsing really is 'by example'.
When the system has inferred a query, the user has four options:
So long as the query is right, this interaction is quite smooth.
However, as predicted in (Dix 1992), it is much more
difficult to design a graceful interaction when the query is wrong.
In Figure 2, the system's inferred selection is shown
in two ways, by highlighting and by the display of the inferred
query.
Some of the reasons for this dual representation are described in detail
below, but in short:
Note however, that the form of highlighting and of query display
is not critical.
For example, we could substitute the textual display of the query
with a QbB template that has been filled in by the system,
or with a natural language rendering of the query.
In similar domains where intelligent systems have been used to
aid in the browsing or searching of databases, the matching has
only been required to be useful, not exact.
The users of on-line news services cannot possibly read the all the
available information.
Any facility to prioritise and present more relevant articles will be
helpful.
Even if an automatic aid sometimes fails to present articles which are wanted
or presents those which are not, it is still better than nothing!
Even in the safety critical domain of nuclear accidents,
Knowbot can still afford to be imprecise.
As with the news service, the user cannot expect to see all accident reports
and thus the interface is successful if the user views more relevant reports
than before.
However, the critical nature of this application favours more inclusive
selections and this is reflected in the Knowbot interface.
The user originally makes a 'query' by selecting keywords.
The system then selects all those records which match
the keyword, but in addition it displays records which are semantically
close to the user's original selection.
In both cases the only disadvantage of displaying extra records is that they
act as noise for the reader,
and in neither case does the user have a fixed idea of what records
are 'right', there is a fuzzy idea of more or less interesting records.
Contrast this with generating a standard database query.
Now there is a precise set of records that are to be addressed.
For example, if we want to select all the employees with certain characteristics
in order to give them a pay rise, it is critical that we select
exactly those which match the criteria, a fuzzy match is not acceptable.
In addition, the user needs to be able to verify that the match is as desired.
Hence we see the importance of displaying of the query.
Even if the system has correctly inferred the user's desired selection,
the user must know that it has.
Merely browsing the highlighted records would tell the user that
the query is correct for the records which have been seen,
but unless the user browses the whole database (in which case why use a query!)
it is not possible to be sure that the query is exactly as required.
Furthermore, if the query is to be reused as the database changes, even an
exhaustive search of the database is insufficient to confirm the correctness
of the query.
For example, in Figure 2, one cannot tell from the highlighted
selection whether the selected salary is '> 15000' or '> 15500'.
The astute reader may wonder at the apparent contradiction.
On the one hand, we say that users find queries complex and hence need some
new method (QbB) of generating queries really by examples.
On the other hand, we also assert
that the user needs to be able to verify the correctness of the query.
However, there is a great difference between the effort and knowledge
needed to generate a query and that needed to recognise that
a query is indeed what is wanted.
This difference is important at various levels.
Even though it is easier to recognise a query than to generate it,
there are still residual problems.
The user may be confused as to the meaning of logical connectives,
especially where the natural language meaning differs from the mathematical
meaning.
Also it is well known that people find it hard to understand statements
including negation, especially double negatives:
It is here that the dual representation becomes important.
If the user is uncertain about the interpretation a glance at a few key
records can confirm which interpretation is correct.
Last, but not least, dual representation can perform an important
tutoring rôle.
By seeing the query presented as well as the selection highlighted,
the naïve user can begin to learn the appearance of
textual or QBE queries, and eventually perhaps move
on to become a 'power user'.
The current implementation of QbB uses a variant of Quinlan's ID3
algorithm (Quinlan 1986a).
The ID3 algorithm takes a number of examples of data values taken from
different classes and builds a decision tree which can be used to sort unseen
examples into their appropriate classes.
That is it learns how to recognise the class of a data value.
In the case of QbB we only have two classes: wanted (ticked) and not wanted
(crossed).
ID3 is used because its data model, being based on frame-and-slot models
as used in artificial intelligence, is very similar to the tuples
of relational databases.
Also, the decision trees generated are relatively easy to represent as database
queries.
In the original ID3 each attribute had only a finite number of
possible values, for example, a colour attribute might be in the set
{ red, blue, green } or a rating in the set { 1, 2, 3, 4, 5, 6 }.
However, in common with other variants of ID3, QbB uses a richer range
of data types.
Attributes can in addition have general string values or be integers.
These were chosen as they were typical of most other kinds of data type.
For example, dates or real numbers behave very similarly to integers
in that they can be compared for equality or inequality.
Figure 3 shows an example data set suitable for the algorithm.
A database for a bank might have the following attributes:
A typical set of records would be
Figure 3. Example bank database
In the decision tree built by the algorithm each internal node contains
a decision based on the values of attributes of a datum.
The branches are labelled by the possible choices of that decision.
In the original ID3 the branches were labelled by the possible values of an
attribute, for example, a decision might be of the form 'colour = ?'
with branches for 'red', 'blue' and 'green'.
In QbB all the branches are binary (yes/no) decisions so that all the different
forms of decisions can be treated equally.
The leaves of the tree are labelled by the classes and in the case of QbB these
are simply wanted and not wanted.
Figure 4. Example decision tree
Consider the decision tree in Figure 4
which can be used to classify the records from Figure 3.
This represents a query for all customers who have exceeded their overdraft
limit and are at least 100 pounds overdrawn excluding anyone called 'Maxwell'.
(Perhaps Maxwell owns the bank.)
Underneath each leaf node is the list of records which are classified
at that node.
Notice that the decisions allowed include comparisons of attributes
with constants: 'balance > -100' and 'surname = Maxwell',
and between attributes: 'balance < overdraft-limit'.
Variants of the ID3 algorithm have normally included only the former.
However, the use of inter-attribute matching adds considerable power.
We have seen this for numeric comparison, but also equality tests between
strings can be important.
For example, the bank's auditor might want to check records where
large overdrafts were authorised by sub-managers with the same surname as
the customer!
The ID3 algorithm builds the tree top down.
Imagine the user has marked records 2 and 6 as interesting and
marked the rest uninteresting.
The algorithm first looks at all single decisions.
If one of these simple decisions splits the data perfectly into the two classes
that one will be chosen.
In fact, for the sample data set, the decision 'title = Mr' would
do this and would lead to a very simple tree.
However, let's assume that the learning algorithm has been forbidden from using
gender information.
In this case, the algorithms looks at the non-perfect decisions and looks
for the best as measured by an entropy based measure.
In this case, the decision 'balance < overdraft-limit' splits the data
so that the 'no' partition has three uninteresting records and the
'yes' partition has two interesting and one uninteresting records.
This is not far from perfect and is thus chosen as the first attribute.
The algorithm then considers each branch.
Down the 'no' branch all the records are from the same class and therefore
it becomes a leaf node classifying as uninteresting.
The 'yes' branch is more complex as it has mixed records.
The algorithm is then applied recursively to these three records (2, 5 and 6)
and looks for another simple decision to split them.
The decision 'balance > -100' does this perfectly and is thus chosen
leading to two new leaf nodes.
The resulting tree is a simplified version of Figure 4 without
the last test.
As we said in the introduction,
there are two reasons for producing a prototype of Query-by-Browsing.
First we think Query-by-Browsing is a potentially powerful technique
for naïve database users.
However, we also want to investigate the potential problems
outlined in (Dix 1992) concerning the interaction between human users
and 'intelligent' systems.
The problems we have encountered are thus as valuable as the successes!
The discussion of these problems and potential future directions
is divided into those principally concerned with the learning algorithms
and those concerned with the user interaction.
However, the reader will notice that the separation is not perfect
and indeed, one would expect that the design of an effective system would
involve an intimate connection between the interface and the 'intelligent'
engine within.
We saw in the example above
how the naïve application of inductive learning
could come up with a spurious decision tree based on the title of the customer.
Note that this is not a wrong answer, it does explain the examples.
In general, there can be several different decision trees to explain
the same example set.
In the context of an interactive session this is not critical
as the user can always add more examples.
However, if the initial decisions are really obscure this can be both
confusing for the user and reduce the user's confidence in the system.
Even experiments using more knowledge rich learning algorithms have found that
automatically generated classification criteria are both different from
human criteria and hard for the human reader to comprehend (Hanson 1990).
This is a very difficult problem, but some of the very worst cases
can be addressed.
One reason for the production of obscure trees is that the number of examples
from browsing is relatively small compared with those generally expected
by inductive learning systems.
Indeed, one of Quinlan's papers includes the phrase
''large collections of examples'' in its title (Quinlan 1979)!
The small number of examples means that by the time one gets to the leaves of a
complex tree there may be only one or two records in each class on which
to base the learning.
Obviously these may be distinguished by virtually anything.
One solution to this is for the system to refrain from presenting a query
until it has a certain level of confidence in its solution.
A more interactive solution is for the system to assign confidence to
each leaf of a tree.
A possible measure to use for this is the 'gain ratio' as described in
(Quinlan 1988).
The system can then specifically mark those records (say with a question mark)
indicating to the user which records it would be useful for the user to
classify.
Alternatively, it could specifically ask the user 'do you want this record'.
Hopefully a small number of records classified in the uncertain areas of the
tree will serve to build a confidence and generate a robust query.
This sounds in fact very close to the interaction
of a user consulting a database expert who was constructing the query.
The expert would explicitly probe the boundaries of the query in order to
check that it was as required.
In order to make the query more 'human', the inductive learning
algorithm can be biased towards certain types of query.
For example, some of Quinlan's work on ID3 has investigated the biasing
of the algorithm towards degenerate trees - that is effectively conjunctive
queries and other forms of more comprehensible query (Quinlan 1986b).
ID3 is derived originally from the Concept Learning System and there
have been other algorithms developed in this area which learn different
sorts of rules, often including more complex domains.
This includes purely conjunctive and disjunctive formulas and
formulas based on 'internal disjunctive concepts' (Haussler 1990)
- that is, rules which are conjunctions of terms of the type:
Whereas ID3 will return any tree which separates the classes,
some algorithms will give the most specific rule to specify the
positive examples.
That is a rule which includes everything which is true about all of
the examples.
The query shown in Figure 2 is of this form.
The examples shown can be classified by the criterion 'salary > 1500',
but the most specific rule also includes the fact that all the positive
examples are also in the accounts department:
In order to make ID3 give this query
negative examples had to be given in other departments (not shown
on the screen shot).
In principle, a system using most specific criteria can learn mainly from
positive examples, which would probably lead to a more natural interaction
style for the user.
In addition, we may want to add some knowledge to the learning process.
For example, in a banking system we may bias the system to favour queries
dependent on monetary fields over those using personal attributes such as title
or address.
In (Dix 1992) it was argued that inadvertent discrimination
was a danger of the use of example based approaches.
QbB clearly portrays what decisions are being made which reduces
(but does not eliminate) this problem.
Despite these dangers, it is likely that knowledge should be in the form of
a bias rather than a prohibition.
For example, a bank's market research may have shown that its male
middle-earning customers are likely to purchase redundancy insurance and
so it may want to send insurance brochures to that group. (Whether
such direct mailing is socially acceptable is, of course,
questionable in itself.)
There has been little work on such hybrid learning systems which employ a
combination of rich knowledge with relatively 'blind' techniques such as
inductive learning (Michalski and Kodratoff 1990).
However, fairly simple techniques should be sufficient to improve the
performance of QbB.
In fact, the current implementation does employ some minimal general knowledge
in that when making integer comparisons it chooses nice numbers.
For example, in Figure 2, the examples have a ticked
record with salary £1600 and a crossed one of £1450.
Rather than choosing the midpoint of these £1525 the algorithm
instead chooses the round number £1500.
Finally, in this area it is worth emphasising the unsuitability of
neural networks as the primary mechanism for QbB.
Although they can be highly effective as classifiers, the nature of
their decision making is such that they must be regarded as 'black box'
methods.
It would be virtually impossible to generate a readable query from standard
networks, thus ruling out a dual representation.
That is, they are possibly acceptable in imprecise domains, but not where
the user needs confidence that the classification performed by the system
is exactly as required.
The difficulties with interaction come when the system gets the wrong
query.
The 'no' button is particularly problematical as it gives no indication
of what is wrong.
Such an answer would be rather rude if delivered to a human expert, but perhaps
one of the advantages of a computer assistant is that one can be as rude as one
likes!
At present the 'no' answer means that the system never displays that precise
query again.
However, this might involve a long period of inactivity as the additional examples
that the user enters may all agree with the incorrect query.
As the inductive learning algorithm is deterministic the system will keep on
coming up with the same tree until an example is given which disagrees with it.
There are two ways out of this impasse.
The user can be encouraged to select helpful records by continuing to show
the highlighting of the incorrect query.
This will suggest marking which of the highlighted records aren't wanted and
which of the unhighlighted ones are wanted.
The other option is to alter the algorithm so that it finds a different tree
to explain the examples.
One way to achieve this is to make the algorithm slightly random, so that
repeated invocations are likely to give different trees.
Although deliberately making the interface non-deterministic sounds rather
extreme, there are situations when it can be extremely effective
(Dix 1991).
QbB is an example of a
system where autonomous intelligent agents interact with the user.
General interface guidelines for such systems are given in (Dix, Finlay et al. 1994).
In particular, it is important that it is clear which interface objects
(windows, buttons, etc.)
are under the control of the user, which belong to the agents, and which
are under shared control.
We can apply this analysis to the present QbB system.
The listing window is a shared object where the user controls
the marking and the agent (inductive learning component) controls the
highlighting.
The labelled buttons are a communication medium from the user to the agent
and the query window is a form of communication from the agent to the user
(not to forget that the shared listing window is also a medium of
communication).
However, if we want to allow continuity between the novice, example based
interface and an expert, query based interface, it suggests that the
query window should also be shared and that the user ought to be able to
make contributions there as well as to the listing.
For example, imagine the query generated in Figure 2 was correct
except that the criterion the user was really after was those with salary
over £1550.
Rather than trying to find examples to force the system to the correct
value, the user might find it easier to simply edit the query.
However, in such cases the system would have to remember which bits of the
query had been generated by the user and which by the system so that
subsequent work by the inductive learning agent does not overwrite the user's
work.
This could be achieved by marking some of the tree's nodes as fixed.
We have described Query-by-Browsing a novel database query mechanism
where the system generates the query.
As opposed to traditional Query-by-Example, QbB really does work by example.
The dual representation of the system's query by highlighting the selected
records and by textual representation ensures that the user understands both
the specific and general implications of the query.
In addition, dual representation may serve an important tutoring rôle for
the naïve user.
The current prototype exposes various shortcomings, several of which had been
predicted in previous work.
It serves as an exemplar of the general design problems where interfaces
include autonomous intelligent agents.
The prototype demonstrates the care and attention to detail which is needed
in the design of such systems.
In particular, it emphasises the close connection between the nature of the
underlying algorithms and the details of user interaction.
We believe that Query-by-Browsing is a promising
method which we hope will one day be included in many database systems alongside
QBE and SQL queries.
2. Query by Browsing - the interface
Continuing interaction
3 Dual representation
Precise and imprecise domains
Recognition vs. generation
''Select all employees except those where department is 'Accounts' and the
manager is not 'Thomas'''
Learning
4. Inductive learning - the internals
forename string
surname string
title enumerated: { Dr, Mrs, Mr, Ms, Miss }
balance integer
overdraft-limit integer
forename surname title balance overdraft-limit
1. Jo Parkin Ms 575 0
2. John McCawber Mr -101 -100
3. Liz Windsor Mrs 27629 -9999
4. Peter Coney Dr -730 -1000
5. Gladys Green Ms -15 0
6. Bob Maxwell Mr -7329 -5000
5. Problems, alternatives and future developments
Types of decision
'department = Accounts or Computing'.
'department = Accounts and salary > 1500'
Complex interaction
6. Summary
7. References
Alan Dix 5/7/98