Hello all,
the indexing by Ancestry has caused much amusement with English and Welsh censuses long before they turned to Scotland. However, I am certain that indexing is done by humans and OCR is not to blame. It would appear that little or no training takes place and quality-control is missing.
OCR has been used for a long time with special characters, such as those printed on the bottom of cheques, and it is becoming more successful with typewritten fonts. PDAs (hand-held computers) also use OCR but generally they have to "learn" a person's writing and the person must write each character separately. The technology used in Palm Pilots seemed better than the Compaq PDA that I formerly used, where anything other than Ladybird-style letters caused problems. Even the best OCR software will have severe problems with cursive handwriting; and lots of lines, such as on census forms, would add to the woe.
If OCR were any good with handwriting, it would be used. As FreeBMD puts it, with my apologies for the dreadful conversion of an acronym into a verb:
"Handwritten records
No-one thinks that OCRing handwritten records is feasible."
Regards,
John
OCR madness .....
Moderator: Global Moderators
-
LesleyB
- Posts: 8184
- Joined: Fri Mar 18, 2005 12:18 am
- Location: Scotland
-
SarahND
- Site Admin
- Posts: 5647
- Joined: Thu Apr 27, 2006 12:47 am
- Location: France
Could it be that the untrained transcribers write it out by hand in printed letters, and the OCR is reading that??? Or even that they are typing them out on old fashioned typewriters and the result is scanned by the OCR program? I find it hard to believe that humans would make the sort of mistakes that we are seeing. Granted, the job may be given to people who are used to other alphabets, who are being paid next to nothing and that by the job, rather than the hour, so no incentive to spend more time on it than absolutely necessary... It could be any combination of things, but I STILL can't believe that it is simply human error-- sorry, John
Regards,
Sarah
Regards,
Sarah
-
SarahND
- Site Admin
- Posts: 5647
- Joined: Thu Apr 27, 2006 12:47 am
- Location: France
-
grannysrock
- Posts: 472
- Joined: Wed Mar 02, 2005 9:21 am
- Location: Belgium
They have the technology - I couldn't find anything on their own website about using OCR on census returns ( they admit to it for newspapers obviously )
but I found this while googling:
http://www.integratedsolutionsmag.com/i ... e&aid=5356
ShAlL i CoMpArE tHeE tO sOmE tIcHmEaL PhRoSe.
etc
Sally
but I found this while googling:
http://www.integratedsolutionsmag.com/i ... e&aid=5356
ShAlL i CoMpArE tHeE tO sOmE tIcHmEaL PhRoSe.
etc
Sally
Newhaven-DRYBURGH,NICOLL,HUNTER(+Alloa) ; Lesmahagow-MITCHELL,LAMB, BARR, BROWN,CALLAN; Comrie-MCDOUGALL, MCEWEN, MCLAREN, BRYSON; BEW - PRINGLE, FISHER,SPENCE;Edzell-MIDDLETON,DORWARD;
Edin.-JOHNSTON, MONTGOMERY;Fife-SIME, FORRESTER, WANLESS
Edin.-JOHNSTON, MONTGOMERY;Fife-SIME, FORRESTER, WANLESS
-
Russell
- Posts: 2559
- Joined: Sat Dec 24, 2005 5:59 pm
- Location: Kilbarchan, Renfrewshire
Sorry folks but
Maybe the kind you store in a drawer until needed.
I'm still giggling so much I can't think of an answer - sensible or otherwise.
I think they should be framed so they can just be admired
Who cares what they are supposed to mean.
Russell
PS can Archibald Phryman not be the local fish and chippie ?
is actually a recipe for a type of porridge.Quote:
Tichmeal Phrose
Is it maybe a writing style?
I'm still giggling so much I can't think of an answer - sensible or otherwise.
I think they should be framed so they can just be admired
Who cares what they are supposed to mean.
Russell
PS can Archibald Phryman not be the local fish and chippie ?
Working on: Oman, Brock, Miller/Millar, in Caithness.
Roan/Rowan, Hastings, Sharp, Lapraik in Ayr & Kirkcudbrightshire.
Johnston, Reside, Lyle all over the place !
McGilvray(spelt 26 different ways)
Watson, Morton, Anderson, Tawse, in Kilrenny
Roan/Rowan, Hastings, Sharp, Lapraik in Ayr & Kirkcudbrightshire.
Johnston, Reside, Lyle all over the place !
McGilvray(spelt 26 different ways)
Watson, Morton, Anderson, Tawse, in Kilrenny
-
DavidWW
- Posts: 5057
- Joined: Sat Dec 11, 2004 9:47 pm
That's the solution, except it's not quite "R X H", and I'm uncertain about the "Pay" bit. In fact, the occupation description for the Head of Household spills over into yet a further person's Rank, Profession, or Occupation line.grannysrock wrote:mmm... Here is Joanna's husband's ( Walter W Lennox) occupation :Joanna W Lennox in HAmilton is (of The Faculty Of Phrjeicitian Boycon)
of the Faculty of Physicians & Surgeons - as the Glasgow Royal College was known, - only slight problem is "Joanna" !
Assist Surgeon R X H Pay Licentiate)
Without seeing the original , I's say OCR allocated her a bit of her husband's job ! And I thought job sharing was a recent thing ...
That's me for today, - there's a few rather obvious ones that I'll do my best to doublecheck if I have any spare moments when recording the "NRH Challenge" for BBC on Tuesday at NRH.grannysrock wrote:[I agree with David's and Sarah's suggestions - pity - I'd got a nice vision of the Royalhairy regiment - kilted of course .
![]()
I'm still puzzling with Sovaneno - I'm sure I've got a bottle of it somewhere and I dont know what Tichmeal Phrose is either.
Sally
Vide supra Sporran's comments, but I just cannot believe that "*******" was transcribed as Sovaneno, nor "*********" as Phranaker, unless the person involved was an absolute and complete, 100% untrained, English fluency doubtful, dunderheid.
John Lindsay is going to be the difficult man, as everything that follows his two word occupation is a very faint note, possibly added later, and I'm not convinced that it applies to the occupation, but possibly to the schedule as a whole, - whether or not it will be possible to make out all the words involved on a look at the microfilm, or even the original enumeration book remains to be seen.
Lastly, a new Scottish county for you - "Resburlishire"
And lastly, lastly, - this has been my first major foray into the Ancestry Scottish censuses.
What has particularly disturbed and worried me in relation to 2 or 3 of the entries in the list concerned here is the difficulty involved in linking to the SP index via more than one of the entries for the Ancestry list involved for that page, - OK there could be an element of unfamiliarity with the Ancestry search form, but I'm discombobulated to say the least as a result of this first experience, - this despite the glaringly obvious names of others on the same page on the actual images
David
-
DavidWW
- Posts: 5057
- Joined: Sat Dec 11, 2004 9:47 pm
Thanks Sally !!grannysrock wrote:They have the technology - I couldn't find anything on their own website about using OCR on census returns ( they admit to it for newspapers obviously )
but I found this while googling:
http://www.integratedsolutionsmag.com/i ... e&aid=5356
ShAlL i CoMpArE tHeE tO sOmE tIcHmEaL PhRoSe.
etc
Sally
And I just have to copy the whole reference, for very obvious reasons !!
Speeding Up Document Capture
Electronic document imaging and validation yield big benefits for AncestryDPS.
Integrated Solutions, November 2006
Written by: Julie Ritzer Ross
AncestryDPS is part of Provo, UT-based MyFamily.com, Inc., which bills itself as a network for connecting families via the Internet by providing consumers with online access to information about their family trees. AncestryDPS processes 100,000 document pages per day, including birth/death records, military records, community directories, immigration records, and census records.
Not long ago, the company began looking for a document capture solution to replace a manual data entry procedure that was excessively time-consuming and, given the volume of pages handled daily, increasingly impractical. “We had staff who would manually key in information and process documents by hand, but considering the complexity of many of our projects, as well as the effort involved, we knew this wasn’t the best approach,” says Shawn Reid, AncestryDPS’ development director.
‘NO GO’ FOR OFF-THE-SHELF OPTIONS
Reid and his team first looked at several off-the-shelf options, none of which proved suitable. “By this point, we had, in preparation for moving away from manual methods, developed several homegrown add-on solutions to suit the special requirements and challenges presented by the type of documents we handle,” Reid explains. One such tool defined settings for brightness, contrast, and other parameters and applied these settings to batches of images. However, the configuration of the off-the-shelf products allowed for neither an interface with the proprietary modules nor other customization.
AncestryDPS then turned to integrator DoxTek, which recommended that the company implement the Ascent platform and INDICIUS solution from Kofax because of these products’ abilities to process and classify unstructured (e.g. handwritten census forms), semistructured (e.g. printed birth certificates containing some handwritten data), and structured (e.g. telephone directories) documents. The availability of an open application program interface (API) from the vendor helped clinch the deal.
In tandem with DoxTek and the Kofax Professional Services Team, AncestryDPS developed several more custom modules and, via the API, interfaced them with the new technology. For example, the integrator built an audit module designed to assess the accuracy of data keyed by offshore operators before that data flows electronically into the new system. Another custom module converts images to a required JPEG2000 compression format prior to their publication on the Ancestry.com Web site. It also converts images to a format required by INDICIUS.
All modules reside on AncestryDPS’ main server. Documents are scanned with a variety of hardware, such as the Kurta optical character recognition (OCR) robotic scanner. A custom module parses all data entries according to company-defined parameters and imports them to Ascent Capture, which digitizes entire documents or extracted data and uses OCR, intelligent character recognition (ICR), and optical mark recognition (OMR) to recognize machine- and hand-printed text. Customized image processing, review, and conversion are performed using the company’s proprietary tools.
The digitized documents or data are then imported to INDICIUS, which classifies, corrects, and validates AncestryDPS’ digitized data and images. The final data is then output to a file structure for import to the appropriate Web site.
IMPROVE DOCUMENT PROCESSING TIME BY 85%
According to Reid, automating document capture via the system is helping the company reap big gains on efficiency and accuracy. For example, AncestryDPS recently faced the challenge of capturing 72 million entries from British Telecom telephone directories published between 1880 and 1984. While it took operators 20 minutes to manually enter one page of a directory (about 300 listings), capturing the data via OCR and validating it using the new solution required 3 to 4 minutes.
“The Ascent platform has improved the processing of our documents by 85%,” Reid notes. “And adding the INDICIUS component means we can automatically capture both handwritten and printed information from all types of documents with a high degree of validation, ensuring that business processes run as smoothly as possible and allowing us to better serve the customers who access our Web site.”
Back to top
QED, or whit
David
-
SarahND
- Site Admin
- Posts: 5647
- Joined: Thu Apr 27, 2006 12:47 am
- Location: France
While out at the woodpile fetching wood for the stove just now, it suddenly occurred to me what this reminds me of...
Back in the early 70's, long before OCR readers, long before email and before most people in India had telephones in their homes, I chanced to send a telegram to Leonard Dart at 18 Rajdoot Marg, New Delhi
By the time it got there, it was addressed to:
Psonurf Fach
18 Cujfomt Marg
... and yet IT GOT THERE!!!!!
So, go figure that one out. Note that whenever the correct letter remains, it always remains in the right place in the word. I'm thinking it's something wrong with OUR brains that are somehow incapable of making that intuitive leap that was all in a day's work for the telegram carrier in Delhi.
Sarah
Back in the early 70's, long before OCR readers, long before email and before most people in India had telephones in their homes, I chanced to send a telegram to Leonard Dart at 18 Rajdoot Marg, New Delhi
By the time it got there, it was addressed to:
Psonurf Fach
18 Cujfomt Marg
... and yet IT GOT THERE!!!!!
So, go figure that one out. Note that whenever the correct letter remains, it always remains in the right place in the word. I'm thinking it's something wrong with OUR brains that are somehow incapable of making that intuitive leap that was all in a day's work for the telegram carrier in Delhi.
Sarah
-
LesleyB
- Posts: 8184
- Joined: Fri Mar 18, 2005 12:18 am
- Location: Scotland
Hi Sally
With regard to the link you posted above...did you understand the article? If so, you have waaaaay more patience and willing brain cells than I do by this time on a Sunday afternoon ! I dipped into a paragarph ...and promptly gave up:
eh??
I think they may already have invented "Tichmeal Phrose" or gobbledegook as it is also known....
Best wishes
Lelsey
With regard to the link you posted above...did you understand the article? If so, you have waaaaay more patience and willing brain cells than I do by this time on a Sunday afternoon ! I dipped into a paragarph ...and promptly gave up:
AncestryDPS then turned to integrator DoxTek, which recommended that the company implement the Ascent platform and INDICIUS solution from Kofax because of these products’ abilities to process and classify unstructured (e.g. handwritten census forms), semistructured (e.g. printed birth certificates containing some handwritten data), and structured (e.g. telephone directories) documents. The availability of an open application program interface (API) from the vendor helped clinch the deal.
I think they may already have invented "Tichmeal Phrose" or gobbledegook as it is also known....
Best wishes
Lelsey