Saturday 19 December 2015

Ofsted and the Parable of the Red Beads


The government’s proposed reforms of children’s services in England assign a pivotal role to the inspectorate Ofsted. If a local authority’s children’s services department is rated ‘inadequate’ by Ofsted, it will now be given six months to improve or risk being taken over. That’s drastic stuff, so there has never been a better time to think very hard about how valid and reliable Ofsted inspections are.

To help do just that I have developed a thought experiment which is based on the red bead game that was used by the quality guru, Dr. W. Edwards Deming, as a teaching aid in the seminars and lectures he gave across the world until his death in 1993. Dr. Deming used the game to demonstrate that even with identical methods and tools there will always be variation in results and that this variation often has nothing to do with what individuals and groups actually contribute to delivering a particular process.

My thought experiment adapts the red bead game as follows:

Imagine you have 150 pots, each one corresponding to a local authority in England. In each pot you place 5000 beads, 4000 of which are white and 1000 of which are red. The beads represent ‘cases’ or ‘service episodes’. The white beads are examples of acceptable or good practice and the red ones are examples of poor practice. So 1 in every 5 cases (20%) is substandard. [1]

Now simulate the activity of an inspector by randomly extracting from each pot 50 beads and examining what you get [2]. You will be very lucky indeed to find that each extract contains 40 white beads and 10 red ones (corresponding to the overall proportion of 20% red beads in the pot). On the contrary you are highly likely to have quite a lot of variation in the white/red proportion of each extract. In some cases the number of red beads will be well below 10, in some it may even be 0, and in some cases it will be considerably higher than 10. In a few cases there may even be more red beads in the extract than white.

Results for the first 10 pots might look like this:

Pot
No. (%) red
A
5 (10)
B
15 (30)
C
11(22)
D
19 (38)
E
2 (4)
F
17 (34)
G
8 (16)
H
23 (46)
I
5 (10)
J
18 (36)


This variation cannot be ascribed to anything that is going on inside the pots (because we know that we put in 4000 white and 1000 red beads into each one and that they have just stayed there until they were extracted). So it would be very wrong indeed to ascribe to any particular pot a description such as “too many reds” or “too much poor practice” or “inadequate”. And it would be very wrong to conclude that pots D, F, H and J should be made subject to special measures while those responsible for pots E, A and I should be lauded for their outstanding performance!

But I hear you ask, perhaps Ofsted has taken steps in the way it has designed its inspections, and the ways in which it selects its samples, to minimise the natural variation which occurs in the red bead game? Perhaps they use clever statistics to ensure that their results are valid? Well, perhaps they do but there is no evidence of it. I have scoured the Ofsted website for anything which suggests that they have thought about the red bead problem. And I have written to them and pursued them with a Freedom of Information Act request to find out if they use statistical techniques to try to ensure inspections are valid. The reply I received gives no indication that they do. [2]

But it is not really up to me to justify Ofsted’s methods. It is up to them. In 2012 Professor Dylan Wiliam, of the University of London’s Institute of Education, challenged Ofsted to evaluate the reliability of its school inspections and publish the findings, asking: “If two inspectors inspect the same school, a week apart, with no communication between them, would they come to the same ratings?” (Times Educational Supplement 03/02/12 ).

I don’t know whether Prof. Wiliam got an answer but I can’t find one that has been published. Maybe in 2016 Ofsted could answer a similar question for me. “How can Ofsted be sure that the variation between different local authorities, revealed in its inspections of children’s services in England, is due to differences in performance rather than just due to chance?”

If Ofsted cannot answer that question in a convincing way it should not be in the business of inspecting children’s social care and the government should certainly not be assigning a pivotal role to Ofsted in its so-called ‘reforms’.

Notes

[1] I have no evidence that 1 in 5 cases is in fact substandard, although it seems to me to be a reasonable 'guestimate', especially in view of the fact that Ofsted finds such a large number of authorities ‘inadequate’ or ‘requiring improvement’. I have tried, without success, to discover if Ofsted is able to estimate what the proportion of substandard cases is in the entire ‘population’ of the cases they have reviewed in (say) the last 10 years.

[2] Ofsted’s ‘Inspection Handbook’ speaks of ‘tracking’ no more than 30 children during an inspection and ‘auditing’ a ‘sample’ of 20 case files. I could find no detailed information in this document about how the cases are chosen.