Vintage Newspapers: Untapped Source for E-book Publishers (Part 3)

I’ve been writing about mining old newspapers in the public domain for content for e-books. (Part 1Part 2) Those papers constitute a virtually untapped and free-to-use trove of weird news, forgotten history and even the occasional piece of good literature.

Long before digitized newspaper archives became readily available and searchable, I loved poring through those yellowed pages of newsprint or scrolling through microfilmed copies of them. All of it fascinated me — the ads, the news stories (history’s first draft, they’ve been called), and even the crime reports and scandals, which confirm that people haven’t changed much.

It’s one of my odd enthusiasms, not shared by everyone. However, there are enough people who do to make an audience for e-books compiled of extracts from yesteryear’s papers.

I’ve already sketched the process for coming up with a theme for a book and searching digital archives for your content. Now I want to address the most laborious part: transcribing that content.

There’s no way around it. You’d better be a damned good typist if you embark on this kind of book, or be willing to pay someone who is. You will spend hours sitting and banging away on your keyboard as you copy one story after another from your screen to a text file.

(You probably don’t literally bang your keyboard, but I do. Blame it on my learning to type on a big manual Underwood. In my last office job people would ask me why I pounded the keys so hard. The youngsters were especially mystified, and I’m sure I annoyed everyone.)

This is why I said earlier that it’s a pain in the butt to create this kind of book, as fun and in some ways easy as it is, too. A pain the in fingers, also.

What about OCR?

You might think Optical Character Recognition, or OCR, is the answer. Just let your computer do the typing!

OCR programs are terrible with old newspapers, though. Faded or muddied type, variously sized headlines and subheads, the column layout format, the pictures and ads — all play havoc with their algorithms. At best the OCR output will need heavy correcting; at worst it will be hash.

Most of the digital newspaper archives already offer OCR-created text versions of their holdings. But don’t get excited. Here, for example, is the LOC’s text version of the first part of a New York Tribune story on the Titanic, from April 17, 1912:

I’KICjTj UaNr.? I ???> 1 EL8?WPERE TITO CEMS.
TIT?NICS DEAD, 1,342, INCLUDING CAPTAIN
AND ALL OFFICERS EXCEPT FOUR-868 SURVIVE
su
DETAILS OF CRASH
Story of Wreck Credited to Brit?
ish Steamer Bruce, Alleged
to Have Overheard
Wireless.
TITANIC MAKING 18 KNOTS
Bottom Said to Have Been Ripped
Off from Bow to Amidships
?Perfect Order Report?
ed Turned to Panic
as Liner Sinks.
Ell John’s, N F.. April 16? A mor?? or
lees detailed storj of collision <>f the
Titanic ?a ith an iceberg Sunday nlfhl
?nd of her sinking la ? unent here to?
day, although the version is n??t cred?
ited Th?. -ii’-n .? ?if tho story is the
British Bt??amer Bruce, which wa
?l-is port on March 1*9 <-?nd is ntm on
h?-*r ?raj ‘” Sydney, S. B. she la
posed to h.*\ o picked up 1>> wireless the
story from other hipe which were near
Titanic an?l from other vessels
rhlch took u; the thread ??- they re?
…i?-.-?i ?t from Intercepted ?wlreleaa mea?
??g? s.
“icoordlng to this account, the Titanic
-??-.iniing at the rate of eighteen
kfioK- when ?-he hit the berg, and that
the Impact was so terrifh? as alnv’St t.?
tear the ship asunder. The d.-ckincs
were l?r??k?^n through and the bulkhea?ls
forming the watertight c-omp-axtmenta
. .I in from the boa t?> nearly
amidships, it is .???lid. The story has it
that th” force of the collision sm;ish??’l
il <?L the b,.ats and nil the upp<ar
n.-.rks to ple.es.

I’ve seen worse and I’ve seen better, but this example is fairly typical.

I’ve tried downloading newspaper page images as PDFs or JPGs and running OCR on them myself. The output is rarely any better.

Interestingly, a few of the major online newspaper repositories — the Library of Virginia’s Virginia Chronicle and the California Digital Newspaper Collection are two I’m aware of — have ongoing projects in which volunteers check and correct the machine-outputted OCR. Where this has been done, the cleaned-up text is quite good. The problem is that most of the newspapers even in these archives have yet to be touched by human clean-up crews.

How about Dictation?

The other possibility for avoiding all that typing is dictating the stories you want to use to your computer.

One of my colleagues who creates Kindle books using vintage newspapers as her source, Karen Ballentine, told me she made a valiant effort to use the acclaimed Dragon Naturally Speaking program for this purpose. She said between all the training she had to put the program through, and the still-less-than-perfect results, it wasn’t worth it. She went back to typing everything.

There are now several free dictation programs available, which you use through your browser and which seem to do about as good a job as premium programs such as Dragon. One of these is a part of Google Docs now (you have to be in Google Chrome to use it). Here are the instructions for it.

Another one I’ve tried is called simply Online Dictation. Here is the text output of the same Titanic story, which I just now dictated:

St Johns hears details of Crash

Story of Rick credited to British steamer Bruce, alleged to have overheard Wireless. 

Titanic making 18 knots

Bottom said to have been ripped off from bound to amidships perfect order reported turned to panic as liner sinks. 

St John’s, NF, April 16th. A more or less detailed story of collision of the Titanic with an iceberg Sunday night and of her sinking is current here today. The source of the story is the British steamer Bruce, which was in this port on March 19th and is now on her way to Sydney, n. S. She is supposed to have picked up by wireless the story from other ships which were near the Titanic and from other vessels which took up the thread as they received it from intercepted Wireless messages.

According to this account, the Titanic was steaming at the rate of 18 knots when she hit the bird, and that the impact was so terrible is almost to tear the ship of thunder. Were broken through and the bulkheads forming the watertight compartments were crushed in from the back nearly amidships, it is said. The story has it that the force of the Collision smashed several of the boats and all the upper Works to pieces.

Clearly, this is light years ahead of anything OCR can accomplish. However, there are still problems, such as odd capitalizations, annoying word substitutions (“Rick” for “wreck,” “bound” for “bow,” “bird” for “berg,” “of thunder” for “asunder”) and dropouts (It should say, “The deckings were broken through …”). If you’re dictating a long story, tidying up will still be tedious.

Back to the Keyboard …

There you have it. If you’ve a mind to publish a vintage newspaper book similar to what I and a few others have been doing, be prepared to exercise your hands — a lot. I hate to end this series on that negative note, but I want to be completely honest about the work involved.

Or, perhaps you’ll come up with a workaround for the typing that I hadn’t considered. If you do, please let me know. I — and my fingers — will thank you.

(Be sure and read Part 1 and Part 2 of this series if this is a subject that interests you.)