the plague of bugs

Like some Biblical locust swarm, every attempt to do anything is thwarted by the dead weight of innumerable bugs! This time I was trying … and failing … to upload a Word file into Google docs. I uploaded the docx file and it said the file was unreadable, tried saving it as .doc, and when that failed created an rtf file. Amazingly from a 1 Meg word file the rtf was 66 Meg, but very very slowly Google docs did upload the file and when it was eventually all uploaded …

To be fair the same document imports pretty badly into Pages (all the headings disappear).  I think this is because it is originally a 2003 Word file and gets corrupted when the new Word reads it.

Now I have griped before about backward compatibility issues for Word, and in general about lack of robustness in many leading products, and to add to my woes, for the last month or so (I guess after a software update) Word has decided not to show its formatting menus on an opened document unless I first hide them, then show them, and then maximise the window. Mostly these things are annoying, sometimes really block work, and always waste time and destroy the flow of work.

However, rather than grousing once again (well I already have a bit), I am trying to make sense of this.  For some time it has become apparent that software is fundamentally breaking down, in that with every new version there is minimal new useful functionality, but more bugs.  This may be simply issues of scale, of the training of programmers, or of the nature of development processes.  Indeed in the talk I gave a bit over a  year ago to PPIG, “as we may code“, I noted that coding in th 21st Century seems to be radically different, more about finding tricks and community know-how and less about problem solving.

Whatever the reason, I don’t think the Biblical plague of bugs is simply due to laziness or indifference on the part of large vendors such as  Microsoft and Adobe, but is symptomatic of a deeper crisis in software development, certainly where there is a significant user interface.

Maybe this is simply an inevitable consequence of scale, but more optimistically I wonder if there are new ways of coding, new paradigms or new architectural models.  Can 2010 be the decade when software is reborn?

why software need never hang

Over 20 years ago I wrote “The Myth of the Infinitely Fast Machine“, about the way software developers effectively assume that everything on the machine side of human interaction happens instantly. Often interaction is programmed in a turn-taking style:

  1. wait for user action
  2. process the event
  3. display changes
  4. back to step 1

This assumption of instant (or at least infinitely fast) response at step 2 often ignores network delays, disk IO or heavy computation. This tends to work fine on a high-spec development or test machine, with a fast network and clean install of all system software … but when the software hits a real machine, a few years old, untidy system, slow network … things fall to pieces.

So 20 years later (as I described in my post last week) I am sitting watching the spinning rainbow ball as Word struggles to save a document (over an hour now, I think I will need to kill it). To be fair I think the root ’cause’ of the problem … or at least one problem … may be the printer as the Cannon printer driver has never worked properly on an Intel Mac (maybe new driver when I upgrade to Leopard?) and perhaps some change in the rest of the system (maybe the Office install) has tipped it over into not working at all.

As far as I can tell Word then decides to ask the printer things in order to set the margins properly when saving the document, and then gets stuck. I found a post on a Microsoft forum about a different print related problem and the ‘helpful’ tech support from MS simply said “not our fault, re-install everything”.

So to recap:

  • user asks Word to save – probably the most critical operation in the system, or the system auto-saves, again to ensure safety against crashes, so really critical
  • Word decides it needs information from the printer (although it has been displaying the page to the users using some existing information on page properties).
  • Word asks for info from the printer driver of the currently selected printer
  • if the printer doesn’t respond Word hangs and blocks all user interaction

However, the printer driver may be third party, may be connecting to a shared printer hanging off a different network, or in the case of a laptop on a network currently disconnected from the computer … and any resulting delay is not the fault of the developers of Word??!

The annoying thing is that such ‘hanging’ delays need never happen.

Basically there are four main causes for delays:

  1. ordinary computation takes a long time due to it being too complex for the available hardware
  2. unbounded internal computation -for example iterative algorithms
  3. waiting for external resources (disk, network, etc.)
  4. bugs that lead to the system going crazy (effectively case 2 by accident!)

Type 1 will surface during testing and may require re-design of the interaction, but is simply ‘slow’ rather than ‘hanging’. Typically it leads to things gradually getting slower as the document or data gets larger or more complicated. This requires standard profiling and optimisation.

Type 4 is hard to deal with – bugs do happen. However, the majority of the problems I’m experiencing in Word at the moment are not a failure of this kind as Word does, most of the time, eventually complete without crashing.

Types 2 and 3, especially the latter, should be detected and then dealt with in the design of the user interface.

Some real-time programming languages have ways of automatically working out how long code will take to run in order to be able to assert “this will respond within a 10 ms interrupt cycle”. However, this is hard, even for relatively simple embedded systems; so not practical for complex operating systems or user interfaces.

However, a simpler version of the above is possible. Certain system functions invoke external resources such as the disk, or the network. If any function or method in your own application invokes one of these system functions, then it could potentially hang – and should be documented to say so or return some sort of ‘promise’: “I’ve started to do X, please check back later to see if it is ready”. Of course the methods that call these themselves need to be documented as potentially hanging … and so forth.

If the response to any form of user interaction ends up calling a potentially hanging function, then it is in danger of having a delay of type 3 above. However, so long as this is known, it can be dealt with at the user interface level by spawning a thread to do the work so that some form of progress indicator or at least “Cancel” button can be active – it should never ‘hang’.

This marking of functions as potentially ‘hanging’ could be done by programmers themselves, but equally can be automated as a form of static analysis, simply starting with a known set of hanging system functions and recursively ‘colouring’ functions that call them. This kind of automated checking should be standard practice in any large software project.

The type 2 hanging is a little more complicated. The ADA programming language has a ‘safe’ subset that only allows loops where the bounds are fixed at compile time. This is probably too restrictive for complex software, but certainly any loop with unknown limits could be flagged. If as part of a code walk through or similar practice it is decided that the loop is ‘safe’ it can be annotated as such, otherwise, just like the case of system calls, the system can propagate the fact that certain functions may have unbounded computation and then the UI adjusted accordingly.

For small bespoke software development I can be forgiving, but for large vendors like Microsoft, Apple or Adobe, there is no excuse for this form of culpable failure.

… but I have a bad feeling that in 20 years time I may be writing again …

[[ News flash – 1.5 hours later Word has finished saving the document! … 14 pages obviously hard work. … but then it has hung again 🙁 ]]

pain, tears and office 2008

Some weeks ago I upgraded Microsoft Office to Office 2008 (yes it does still have menus on the Mac!), and life since has been constant trouble.

OK first there are ‘minor’ niggles like it eating 1/2 my screen space in huge tool bars replicated at the top of every window, or eveytime I read in an Excel spreadsheet it telling me that old macros no longer work … actually I don’t use Excel macros, but f you do and have lots of spreadsheets that use them what then? … and don’t get me started in the fact that I can no longer cut and paste directly between Word and Dreamweaver.

… and then, just over 2 weeks ago, I was at the AVI conference and, as one does, writing the slides for the presentation the day before. I had produced all the diagrams for the presentation in Powerpoint and then copied them into Word, so thought it would be easy – start with the Powerpoint file with all the diagrams in it and add a few words around them – after all pictures always best. However, this was reckoning without Office 2008. The figures had been produced in PPT 2004, and when I opened them in Office 2008 half the images just disappeared. I tried opening in the old version of office, but it simply crashed every time I tried to update a file, I assume the Office 2008 install broke the old Office 2004 install in some way. In desperation I tried cutting and pasting the slides between PPT 2004 and PPT 2008, but that failed (I guess because Powerpoint thought it was pasting back into itself!). Eventually I managed to get the crucial images by cutting and pasting via a third program.

But the reason I am blogging now, rather than doing the pile of work that I need to do, is that Word has decided that about every 10 minutes it needs a 15 minute break and disappears into a little spinning rainbow – it does eventually come back, but only after several cups of tea.

To be fair most of the problems seem to be with compatibility mode … but surely backward compatibility is not so difficult … after all we have a lot of old files out here .. or if they can’t code it properly simply produce one-off converters rather than pretending to work when they don’t!

But the spinning disk has at last stopped … so back to another 10 minutes work before it halts again.