Information Quality problems impact reporting in US Education

Via The Miami Herald comes a story that highlights a number of impacts of poor quality information in key processes.

In Oklahoma, schools are placed on an “improvement list” if they fail to meet standards for two consecutive years. Once on the list the school show progress in improving standards for two years before they can be taken off the list. This can have implications for funding and access to resources as well. Some Oklahoma School districts are, it is reported, concerned that they don’t make the grade against Federal requirements.

Problems with the quality of demographic data in electronic testing performed by Pearson has affected the publication of the reports against which schools are graded.  These will now be available a full month late, being released in September and not August as expected. This will affect the ability of School Boards to effectively respond to their report card.

Other problems reported on top of missed deadlines include errors in printing report cards to be sent to parents

Oklahoma’s Superintendent of Schools Janet Barresi has described the impacts of poor quality data in this process as a “ripple effect”  that is “imposing an unacceptable burden on school districts” and has called for Pearson’s contract to be reviewed. Pearson are engaging an independent 3rd party to help verify the accuracy and validity of the scoring data (which they are confident in).

Oklahoma is not the first State where data issues have been a problem.

  • In 2010 in Florida Pearson was penalised $14.7 million, and had to ramp up staffing levels and make changes to systems as a result of problems with information quality leading to delays. The problems here related to matching of student records.
  • In 2010 in Wyoming, Pearson also had to pay penalties arising from problems with the testing, ranging from data going missing to other administrative problems such as improperly calibrated protractors.

This video from the Data Quality Campaign, a US Non-Profit working to improve standards of data quality in the US Education system, highlights the value of good quality and timely information in this important sector:

Space triggers early release of Scottish Exam Results

It was widely reported yesterday that students in Scotland who had signed up for SMS notification of their results had received them a day early, giving them a jump on their less technically minded compatriots and competitors and causing stress and distress to students, parents, educators, civil servants, and politicians.

The Scottish Education Authorities began a root cause investigation as the secrecy and security of the examination results system seemed to have been compromised.

Right now I suspect you are settling in for a tale of hackers and Jason Bourne like derring do. Well, here at IQtrainwrecks we never get that lucky. After all, this is a blog that looks at information and data quality problems.

According to The Register the root cause of this problem is good old data interchange and exchange across organisations.

  1. A template spreadsheet was used to perform the data interchange between the Scottish Education Authority and the company which provides the SMS gateway and related processing as a pro-bono to the Education Authority, AQL. A batch template is used rather than an on-line interface as the service is only used once a year.
  2. The template was populated and saved in a later version of Microsoft Excel
  3. The process of populating and saving the spreadsheet appended a white space to the end of each date stamp (the date that the sms was to be sent).
  4. The ETL process interpreted the “DATE” field as text (which it was thanks to that errant space) and rejected the field on the load. Luckily AQL had developed error handling for situations where a date field couldn’t be loaded and applied a default… the day of the file load (which was the day before the messages were to go out).
  5. As a result the SMS system read the file and sent the messages a full day early.
One way of looking at this is that a technical information management issue has resulted in a gaggle of Scottish students got to go celebrating a full 24 hours early and are now making life changing decisions about their future education with the biggest hangovers of their young lives. Which will obviously end well.
Another way of looking at it is that this highlights the importance of proper standards and defined information flows particularly where the cycle time of the process is long and the frequency of the operation is low. What kind of “pre-flight” checks could have been built into the governance to prevent this? What assumptions were being made that should have been challenged.

It could be YOU (and 44,999 others)

The Irish National Lottery had an embarrassment last week when their Bank Holiday promotion draw went awry.

As part of a special draw for the August Bank Holiday weekend, the Lottery were offering a prize of a Jaguar XK convertible as an additional prize to the person who won the jackpot.

Unfortunately, due to apparent “human error” the National Lottery Company informed anyone who checked their numbers on-line and had matched any combination of numbers that they had won the car, even if the money value of the prize was as little as €5.00. They hadn’t, but the story still made headline news. Some outlets report that disgruntled non-winners are considering legal action.

It is important to have validation checks in place on reports and publication of data, particularly where that data would be of value or could be relied upon to the detriment of another person.

Google maps inaccuracies

We spotted this on Gawker.com. From my experience using Google Maps, it rings true (I recently was sent 15 miles out of my way on a trip in rural Ireland).

It seems that Google Maps has plotted the location of a tourist attraction in New Jersey right at the end of a driveway to a private residence. So, on the 4th of July weekend, the owners of the property had to fend off increasingly irate visitors who were looking for the lake and wound up in a private driveway.

So, the data is inaccurate and of poor quality. Google have responded to their error and replotted the location of the tourist area at the lake? Not yet, according to the story on Gawker.

Green Card, Red Faces

The United States Government is being sued in a massive class-action suit representing Green Card applicants from over 30 countries which alleges that the United States unfairly denied 22,000 people a Green Card due to a computer blunder.

This story is reported in the Irish Times and the Wall Street Journal.

It is not in the remit of this blog to debate the merits of awarding working visas on the basis of a random lottery, but this is precisely what the Green Card system is, offering places to 50,000 people each year based on a random selection of applications submitted over a 30 day period. According to the WSJ:

In early May, the State Department notified 22,000 people they were chosen. But soon after, it informed them the electronic draw would have to be held again because a computer glitch caused 90% of the winners to be selected from the first two days of applications instead of the entire 30-day registration period.

Many of these 22,000 people are qualified workers who had jobs lined up contingent on their getting the Green Card. The WSJ cites the example of a French neurospyschology PhD holder (who earned her PhD in the US) who had a job offer contingent on her green card.

The root causes that contributed to this problem are:

  1. that the random sampling process did not pull records from the entire 30 day period, with the sampling weighted to the first two days of applicants, with 90% of the “winners” being drawn from the first two days.
  2. There was no review of the sampling process and outputs before the notifications were sent to the applicants and published by the State Department. It appears there was a time lag in the error being identified and the decision being taken to scrap the May Visa Lottery draw.

The first error looks like a possible case of a poorly designed sampling strategy in the software. The regulations governing the lottery draw require that there be a “fair and random sampling” of applicants. As 90% of the applicants were drawn from the first two days, the implication is that the draw was not fair enough or was not random enough. At the risk of sounding a little clinical however, fair and random do not always go hand in hand when it comes to statistical sampling.

If the sampling strategy was to pool all the applications into a single population (N) and then randomly pull 50,000 applicants (sample size n), then all applicants had a statistically equal chance of being selected. The fact that the sampling pulled records from the same date range is an interesting correlation or co-incidence. Indeed, the date of application would be irrelevant to the sampling extraction as everyone would be in one single population. Of course, that depends to a degree on the design of the software that created the underlying data set (were identifiers assigned randomly or sequentially before the selection/sampling process began etc.)

This is more or less how your local State or National lottery works… there is a defined sample of balls pulled randomly which create an identifier which is associated with a ticket you have bought (i.e. the numbers you have picked). You then have a certain statistical chance of a) having your identifier pulled and b) being the only person with that identifier in that draw (or else you have to share the winnings).

If the sampling strategy was to pull a random sample of 1666.6667 records from each of the 30 days that is a different approach. Each person on each day of application has the same chance as anyone else who applied that day, with each day having an equal chance at the same number of applicants being selected. Of course it raises the question of what do you do with the rounding difference you are carrying through the 30 days (equating to 20 people) in order to still be fair and random (a mini-lottery perhaps).

Which raises the question: if the approach was the “random in a given day” sampling strategy why was the software not tested before the draw to ensure that it was working correctly?

In relation to the time lag between publication of the results and the identification of the error, this suggests a broken or missing control process in the validation of the sampling to ensure that it conforms to the expected statistical model. Again, in such a critical process it would not be unreasonable to have extensive checks but the checking should be done BEFORE the results are published.

Given the basis of the Class Action suit, expect to see some statistical debate in the evidence being put forward on both sides.

Unhealthy Healthcare data

For a change (?) it is nice (?) to see stories about healthcare IQ Trainwrecks that don’t necessarily involve loss of life, injury, tears, or trauma.

Today’s Irish Examiner newspaper carries a story of the financial impacts of poor quality data in healthcare administration. At a time when the budgets for delivery of healthcare in Ireland are under increasing pressure due to the terms of the EU/IMF bailout of Ireland, it is essential that the processes for processing payments operate efficiently. It seems they do not:

  1. Staff continued to be paid pensions where they retired from one role and then re-entered the Health Service in a different role (HSE South)
  2. Absence of Controls meant staff who were on sick leave with pension entitlements being paid continued to be paid when they returned to work (HSE South)
  3. Pensions were calculated off incorrect bases for staff who were on secondment/shared with other agencies (HSE South)
  4. Inaccurate data about the ages of dependents resulted in overpayments of death in service benefits (HSE South).
  5. “Inappropriate” filing systems were resulting in “needlessly incurring wastage of scarce resources” (HSE Dublin/Mid Lenister)

 

Poor quality information costs between 10% and 35% of turnover in the average organisation. So the HSE may not be too bad. But the failure of controls and processes resulting in poor quality data leading to financial impacts is all too familiar.

Electoral finger flub causes kerfuffle

Via the twitters and google comes this story from Oh Canada about the unforeseen confluence of an election, the adoption of new technology (QR codes), and a careless fingerflub that has resulted in a bit of embarassment for a Liberal party candidate.

This is the comedic counterpoint to our story last month of the finger flub that resulted in death and lawyers.

It seems that staffers working for candidate Justin Trudeau fat fingered the creation of the QR code that is being used on his posters. Instead of the code containing a URL for the Liberal Party they hit the “U” key instead, creating a URL that sent people to a “lifestyle” site that promoted the use of lubricants in sexual activity.

Sadly Luberal.ca has been taken down at the request of the party, and it seems that they may be in discussion to buy the domain name from the current owner. The candidate has tweeted about the issue on his twitter feed, and staff have been dispatched out to replace the offending QR code with a corrected version.

All of which adds up to cost and resource headaches for an election candidate who probably had other things planned for his staff to be doing at this stage in the campaign.

Of course, we remain slightly concerned that, given that it is April 1st this may be too good a story to be true. But in that case take it as a parable of what could happen, not necessarily a report of what did!

Gas by-products give a pain in the gut

Courtesy of Lwanga Yonke comes this great story about how the choice of unit of measure for reporting, particularly for regulatory reporting or Corporate Social Responsibility reports can be very important.

The natural gas industry’s claim that it is making great strides in reducing the polluted wastewater it discharges to rivers is proving difficult to assess because of inconsistent reporting and a big data entry error in the system for tracking contaminated fluids.

The issue:

Back in February the Natural Gas industry in the US released statistics which appeared to show that they had managed to recycle at least 65% of the toxic waste brine that is a by-product of natural gas production. Unfortunately they had their data input a little bit askew, thanks to one company who had reported data back to the State of Pennsylvania using the wrong unit of measure – confusing barrels with gallons.

For those of us who aren’t into the minutiae of natural gas extraction, the Wall Street Journal helpfully points out that there are 42 gallons in a barrel. So, by reporting 5.2 million barrels of wastewater recycled instead of the 5.2 million gallons that were actually recycled, the helpful data entry error overstated the recycling success by a factor of 42.

Which is, co-incidentally, the answer to Life the Universe and Everything.

According to the Wall Street Journal, it may be impossible to accurately identify the rate of waste water recycling in the natural gas industry in the US.

Not counting Seneca’s bad numbers — and assuming that the rest of the state’s data is accurate — drillers reported that they generated about 5.4 million barrels of wastewater in the second half of 2010. Of that, DEP lists about 2.8 million barrels going to treatment plants that discharge into rivers and streams, about 460,000 barrels being sent to underground disposal wells, and about 2 million barrels being recycled or treated at plants with no river discharge.

That would suggest a recycling rate of around 38 percent, a number that stands in stark contrast to the 90 percent recycling rate claimed by some industry representatives. But Kathryn Klaber, president of the Marcellus Shale Coalition, an industry group, stood by the 90 percent figure this week after it was questioned by The Associated Press, The New York Times and other news organizations.

The WSJ article goes on to point out that there is a lack of clarity about what should actually be reported as recycled waste water and issues with the tracking of and reporting of discharges of waste water from gas extraction.

At least one company, Range Resources of Fort Worth, Texas, said it hadn’t been reporting much of its recycled wastewater at all, because it believed the DEP’s tracking system only covered water that the company sent out for treatment or disposal, not fluids it reused on the spot.

Another company that had boasted of a near 100 percent recycling rate, Cabot Oil & Gas, also Houston-based, told The AP that the figure only included fluids that gush from a well once it is opened for production by a process known as hydraulic fracturing. Company spokesman George Stark said it didn’t include different types of wastewater unrelated to fracturing, like groundwater or rainwater contaminated during the drilling process by chemically tainted drilling muds.

So, a finger flub on data entry, combined with lack of agreement on meaning and usage of data in the industry, and gaps in regulation and enforcement of standards means that there is, as of now, no definitive right answer to the question “how much waste water is recycled from gas production in Pennsylvania?”.

What does your gut tell you?

 

Calculation errors casts doubt on TSA Backscatter safety

It is reported in the past week on Wired.com and CNN that the TSA in the United States is to conduct extensive radiation safety tests on their recently introduced backscatter full body scanners (affectionately known as the “nudie scanner” in some quarters).

An internal review of the previous safety testing which had been done on the devices revealed a litany of

  • calculation errors,
  • missing data and
  • other discrepancies on paperwork

In short, Information Quality problems. A TSA spokesperson described the issues to CNN as being “record keeping errors”.

The errors affected approximately 25% of the scanners which are in operation, which Wired.com identifies as being from the same manufacturer, and included errors in the calculation of radiation exposure that occurs when passing through the machine. The calculations were out by a factor of 10.

Wired.com interviewed a TSA spokesperson and they provided the following information:

Rapiscan technicians in the field are required to test radiation levels 10 times in a row, and divide by 10 to produce an average radiation measurement. Often, the testers failed to divide results by 10.

For their part, the manufacturer is redesigning the form used by technicians conducting tests to avoid the error in the future. Also, it appears from documentation linked to from the Wired.com story that the manufacturer spotted the risk of calculation error in December 2010.

Here at IQTrainwrecks.com we are not nuclear scientists or physicists or medical doctors (at least not at the moment) so we can’t comment on whether the factor of 10 error in the calculations is a matter for any real health concern.

But the potential health impacts of radiation exposure are often a source of concern for people. Given the public disquiet in the US and elsewhere about the privacy implications and other issues surrounding this technology any errors which cast doubt on the veracity and trustworthiness of the technology, its governance and management, and the data on which decisions to use it are based will create headlines and headaches.

 

The Wrong Arm of the (f)Law

Courtesy of Steve Tuck and Privacy International comes this great story from the UK of how a simple error, if left uncorrected, can result in significantly unwelcome outcomes. It is also a cautionary tale for those of us who might think that flagging a record as being “incorrect” or inaccurate might solve the problem… such flags are only as good as the policing that surrounds them.
Matthew Jillard lives on Repton Road in a suburb of Birmingham. In the past 18 months he has been raided over 40 times by the police. During Christmas week he was raided no fewer than 5 times, with some “visits” taking place at 3am and 5am, disturbing him, his family, his family’s guests, his neighbours, his neighbour’s guests….
According to Mr Jillard,
9 times out of 10 they are really apologetic.
Which suggests that 1 time out of 10 the visiting police might annoyed at Mr Jillard for living at the wrong address(??)
The root cause: The police are confusing Mr Jillard’s address with a house around the corner on Repton Grove.
(scroll the map to the right to find Repton Grove)
Clancy Wiggum from the Simpsons
Not a spokesman for West Midlands Police

View larger map
Complaints to the police force in question have been met with apologies and assurances that the police have had training on how important it is to get the address right for a search. Some officers have blamed their Sat Nav for leading them astray.
Given the cost to the police of mounting raids, getting it wrong 40 times will be putting a dent in their budget. Also, the costs to the police of putting right any damages done to Mr Jillard’s home due to the incorrect raids (which have included kicking in his door at 3am on Christmas Day) will also be mounting up.
The police have said that “measures” have been taken to prevent Mr Jillard’s home being raided, including putting a marker against his address on the police computer systems. None of these measures appear to have stopped the raids, which come at an average frequency of more than one a fortnight (40 raids in 18 months).
This Trainwreck highlights the impact of apparently simple errors in data:
  1. Mr Jillard’s home is being disturbed without cause on a frequent basis
  2. His neighbours must be increasingly suspicious of him, what with the police calling around more often than the milkman
  3. The police force is incurring costs and wasting man power with a continuing cycle of fruitless raids.
  4. The real target of the raids are now probably aware of the fact that the police are looking for them and will have moved their activities away from Repton Grove.