Close subtreeProject   
     Project Documents
     Quality Assurance

Open subtreeLinks Acknowledgements Copyright

Quality Assurance

histpop online/ quality assurance

Standards and verification

Modern technology and methods allow extensive bodies of ‘pre-digital’ data, including academic, scientific and official papers, to be converted, described and stored for delivery to a wider audience than ever before. Quality Assurance (QA) is a key part of digitisation projects such as this Online Historical Population Reports (OHPR) creation of histpop.

Why quality assurance?

Quality Assurance is necessary on websites such as histpop. It ensures that all scanned images are legible and that misspellings are corrected in the OCR text, so that data are not missed when a search is being carried out. As histpop also includes tables in a spreadsheet format created from the original images, it is vital that the numerical data are correct and clear.

The generation of this material has been undertaken by a number of partners/suppliers who have been required to meet specified levels of quality. The assurance that these levels of quality have been achieved is necessary to ensure that users of histpop do not experience difficulties when they use the website. To achieve this objective a team of quality assessors has been employed to carry out rigorous quality assurance for all image, textual and tabular material received in an electronic format from partner suppliers.


The first stage of the Quality Assurance process was to ensure that the scanned images matched those in the original volumes, i.e. that the number of images corresponded to the number of pages in the printed volumes, and their file names to the page numbers of the originals. Only then the image quality was considered. This involved, firstly, the comparison of the margins of an image with those of the original; if these did not match then the image was returned for re-scanning. Secondly, the image was checked for its legibility. If any information was illegible, the page was crooked or if scanning operators' hands had been scanned accidentally, then that image was returned for re-scanning. If the problem was due to the quality of the original volume, then re-sourcing the original volume was the only way to resolve this.

As an example of the kinds of problems encountered, here are two consecutive pages: one perfectly generated and the other untrimmed, without margin cropping, with unreadable and obscured text and including a scanning operator's thumb impression.

The right-hand image has been re-scanned and cropped correctly before being included on the website. Hopefully, all such instances of 'problem' pages have been resolved, though because some of the original volumes used for re-scanning have become brittle through time or damaged through use some less than perfect images will be found on this site. Yet, because of the quality assurance processes in place the images that appear on the website are of the best quality possible.


The second stage of Quality Assurance entailed the checking of c.50,000 pages of text obtained by OCR (Optical Character Recognition). OCR is an automatic, computerised method of reading text. However, mistakes occur mainly when characters or sequences of characters look alike, e.g. 'c' and 'e'. Therefore, every OCR text page has been checked manually for misspellings and missing text alongside the scanned image. If the problems were excessive the scanned page was sent back to the partner suppliers. Yet, if a different spelling or misspelling occurred in the original volume this was retained in the OCR text.

This Quality Assurance process is vital in ensuring that users find the data they need when performing searches on the website. Therefore textual pages have been checked to a very high standard. It should be noted, however, that it was not possible to correct many pages of OCRd tabular material to the same standard, so some problems may be encountered when searching.


Another Quality Assurance process involved the checking of the downloadable machine-readable tables to verify the accuracy of the tabular data and carry out selective correction of these data. Most of the tabular material has been 'double re-keyed', which means two independent electronic transcriptions have been made and then compared by the supplier. Most errors were found in this way, but some queries remained because of indistinct scanning. These were highlighted by the supplier and corrected where possible by the quality assessors, who would then check a sample of tables against the scanned images, and record and correct any further mistakes. If they found many errors, the entire batch was checked. If a sum in an original table was incorrect then this was highlighted, and a note made of the change or error. In all cases scanned images have been retained and can be viewed alongside the machine-readable table.

All of these Quality Assurance processes contribute to a high standard of searchable data accessible on the histpop website.