Ascii-to-Rader

This idea needs some help. Or maybe the hook. I don’t know:

With a digitally scanned copy of a Gregg Dictionary, it seems to me that the only major hurdle (besides copywrite?) to having a program that translated type into Charles Rader’s Gregg would be a method of batch-analysing the dictionary page images into sets of type-to-outline pairs. With a table matching type-word to outline-image, writing a translating program would be straight-forward.

The method would be brutal (compared to something like psetus’ real-time translator, Greggory); the static database would be irritating to some degree. But the Simplified dictionary, at least, has a pretty big lexicon. Gaps could be filled in by a skilled writer to allow for most of the ordinary prose in your Steven King novels. Don’t bother trying to convert Finnegan’s Wake, of course.

Can anyone imagine a method of doing that kind of batch-programmmed image parsing? I can imagine the mechanics of converting from word to image, but I don’t know anything about image manipulation.

-Derek

(by routine-sibling for everyone)

 


Previous post:
Next post:
22 comments Add yours
  1. Derek   I've been thinking along the same lines, even going so far as scanning a book and using (gasp) MS Paint to cut out the words I wanted.   It would be just great if we had a program that turned words into complete shorthand outlines.   Would it be hard to replace a word w/ a graphic?

  2. In principle, it's easy to substitute type with a graphic; any powerful markup language does it: TeX, LaTeX, Troff, even html (although html isn't as programmable).

    I should say that by "easy", I mean straightforward and inevitable—not doable in one sitting. But within my own grasp, at least, given sufficient time. I'd be surprised if there weren't others on the list more adept at their particular markup language than I am with LaTeX, though. (…Can you tell I'm really hoping some one else wants to do that part?)

    Getting the type-word to outline table seems like the big hurdle to me; it *has* to be programmed. Unless someone wants to spend 300-pages-times-15-minutes-per-page (75 hours) pointing and clicking it together.

    -Derek

  3. Derek   I have to tell you I'd be willing to do the 75 hours of point and click if I could read some of my favourites (post copyright expiration, of course) in Gregg DJS. It's my biggest shortfall at the moment, reading, and from what I've read, it seems a fairly common shortfall indeed.   Billy

  4. Wow, that's commitment! But I have to believe there's a way that would be easier on one's sanity. I had a character recognition ("OCR") program (like PaperPort, or something) when I was running Windows that made attempts at distinguishing between graphics and text. Maybe that's a start?

    In the mean time, here's proof of concept:

    http://www3.telus.net/familyvonthomas/gregg/txt2gregg_html-out.html

    That horribly moused gregg is the output this text file:

    http://www3.telus.net/familyvonthomas/gregg/test_text2.txt

    put through this *10 line* sed script:

    http://www3.telus.net/familyvonthomas/gregg/txt2gregg_html.sed

    which just turns the words in its input into references to this hacked-together Gregg dictionary:

    http://www3.telus.net/familyvonthomas/gregg/dictionary/

    I left "paper" "tied" and "strings" out of the dictionary just as an idea of how the system might handle gaps (with an alt tag).

    The technical of you might smell the problems already:

    1) "A 30,000-gif file database?!"

  5. …continued…

    1) "A 30,000-gif file database?!"

    Honestly, I really don't know how this method would scale. But plain text can be chopped up into RAM-manageable size, and the sed script only works on one line at time anyway. "Page" sized output could be generated a unit at a time this way, but I don't know about novel-sized.

    2) Typesetting

    I think this is all in the way the outlines are extracted and located on their gif canvas. Transparency is also possible with gifs for colliding outlines.

    3) Phrasing

    Sed uses regular expressions to pick out patterns; common phrases could be included in the dictionary.

  6. I've been putting some thought into the same kind of thing. What you're proposing – a large database of words and phrases – seems like the best way of tackling it.

    As I see it, the main snag is the large number of possible phrases. The dictionary (at least, the one I have – Anniversary 1930). lists only single words.

    To produce something close to "real" Gregg, you'd need quite a large number of phrases. This isn't a problem for the common ones ("I would have" "he did not" etc.) which you could just scan from 5000 Most Used Shorthand Forms, but when you get into verbs, it starts to be a problem, since even uncommon ones would ideally include the various forms with each of the pronouns – so a verb like "reenlist" would have to be indexed as "reenlist" "I reenlist" "you reenelist" "it reenlists" etc. – as well as "to reenlist". A database could easily handle the number of forms required, but creating those forms would be a big job. Not impossible, but big.

    Most of these forms would have to be created by hand, unless there was an automatic method for sticking words together. But, of course, if you had that, you wouldn't need to bother with all the dictionary scans – you could make do with just a few hundred letters and letter-groups, and have the program create them on the fly.

  7. I wouldn't mind if the program had more than 30,000 entries, I have multi-GB games on my computer after all!

    It might be possible to get both a better filesize and a better image by using PNGs rather than GIFs. Scan the outlines in using only black and white, remove the white, and shrink the filesize using pngcrush. I'm not sure though, it's just a suggestion.

  8. DoonKhan:
    > …the main snag is the large number of possible phrases…

    Exactly: brutal method, right? But it's pragmatic. Remember Zipf's Law; a critical mass of Gregg outlines would give an exciting and probably helpful word-based translation of *most* prose. Common phrases could be thrown into the image bin, like you say.

    > …when you get into verbs, it starts to be a problem, since even uncommon ones would ideally include the various forms with each of the pronouns …creating those forms would be a big job. Not impossible, but big …

    Maybe impossible; isn't the number of properly formed phrases indenumerable?

    > …Most of these forms would have to be created by hand, unless there was an automatic method for sticking words together…

    Earlier on I was thinking about handling this problem with typesetting (using TeX) and some simple grammatical rules. But I can't imagine the solution clearly enough yet (and I'm not holding my breath).

    In contrast, with a method as simple as what I've got above, gaps could be filled in just by adding an image to a directory. And if there was public access to that image collection, gaps may get filled in quicker. Remaining ones are handled legibly if not elegantly. But as long as we're just *mapping* (as opposed to really *transcribing*), we're faced with *Heaps' Law*: little gaps in our dictionary will persistently show up. (Thanks Wikipedia!)

    I won't try to convince anyone that the output would be *real* Gregg, but for a begginner like me, at least, it would be beneficial. And a lot more enjoyable than "…Mr. John's meeting with the Atlanta staff will end at 4 o'clock, at which point he will he will go to the train station…"

    CarleyMcNinch1:
    > …may tie in another thread…resurrecting the Gregg Writer?

    I'm not sure whether that question is for John, or me, the OP. What are your thoughts?

    Orlee:
    > I wouldn't mind if the program had more than 30,000 entries, I have multi-GB games on my computer after all!

    I did the math, actually, and at 250 bytes a gif, 30,000 is still well under 8MB. Breaking the one directory up into 24 (A-Z) directories might speed up seek times, too.

    > It might be possible to get both a better filesize and a better image by using PNGs rather than GIFs…

    Sounds like you know more about image formats than me. I just read an MS help site that said:

    Use GIF files for:

    * Images that contain transparent areas.
    * A limited number of colors, such as 256 or less.
    * Colors in discrete areas.
    * Black-and-white images.
    * A small-size image, such as a button on a site.
    * Images in which sharpness and edge clarity are important, such as line drawings or cartoons.

    Incidentally, for anyone concerned, I've faxed McGraw-Hill Ryerson to make sure they don't mind this happening to the Simplified Dictionary, whose copyright is still in effect, I think. I'll re-post if I hear back from them.

    -Derek

  9. If the GIFs you're producing now are that small, then there's no reason for the extra effort because not having to compress the PNGs will save time in completing the program.

  10. As I have mentioned, before, I think the best solution for this is similar to that used for languages like Japanese and Chinese. These languages have double-byte character sets, which are accessed using a front-end processor, commonly called an "IME" or "Input Method Editor." One types in a sentence, and the editor parses it and provides a sentence with candidate characters, based upon a complex set of rules. The sentence can be reviewed, and candidates can be selected and changed (using a drop-down menu) prior to "finalizing" the sentence. The IME can also "learn" your own choices, and improve its first candidates over time.   The application of this technology to Gregg shorthand should be obvious. Since the process involves conversion, one could also select various input styles. For instance, one could convert standard keyboard input to Gregg, or by switching modes, one could choose to convert abbreviated keyboard inputs.   Elegant, but of course would require a relatively high investment of time and energy compared to the number of folks who would benefit by it.   The advantage would be the ability to easily create new Gregg documents, and produce study and review materials using current desktop publishing technology.   Cheers, JSW  

  11. Again, this outline-mapping method is *not* supposed to be the best solution—just an acceptable one requiring minimal investment.

    I'm completely unfamiliar with the Chinese or Japanese languages, stenomouse. However, the solution you're suggesting sounds like it requires a closed character set (an "alphabet"; something like 13500 characters for modern Chinese, the internet tells me) for its object. The point of the phrasing problem referred to above is just that *as a set of outline-units,* Gregg is *not* a closed set—the number of possible phrase outlines is indenumerable. The only closed set we've got in our object domain is the Gregg alphabet; so applying the IME method would still require a program that joined up the alphabetical elements properly. But maybe you already knew that.

    -Derek

  12. Again, this outline-mapping method is *not* supposed to be the best solution—just an acceptable one requiring minimal investment.

    I'm completely unfamiliar with the Chinese or Japanese languages, stenomouse. However, the solution you're suggesting sounds like it requires a closed character set (an "alphabet"; something like 13500 characters for modern Chinese, the internet tells me) for its object. —-   Actually, Chinese has something like 60,000 total characters, but perhaps Big5 only maps 13,000 or so. I'd be somewhat surprised if that were the case.   ___Derek___ The point of the phrasing problem referred to above is just that *as a set of outline-units,* Gregg is *not* a closed set—the number of possible phrase outlines is indenumerable. —–   Yet, as in the case of Chinese characters, some outlines are statistically more likely than others. I suspect that while the potential number of outlines is very great, that the actual number of outlines utilized isn't really "indenumerable."   ___Derek___ The only closed set we've got in our object domain is the Gregg alphabet; so applying the IME method would still require a program that joined up the alphabetical elements properly.  —–   First, a very large number of outlines could be stored on code pages, just like Chinese characters. Extensions/expansions of the same could be developed over time, depending upon need and usefulness. The individual elements of the Gregg alphabet itself would appear as a part of this code page, just as Japanese includes both native phonetic elements (hiragana/katakana) and kanji on its codepages.   However, perhaps you are correct that handling this programmatically might be best. Again, the same theory behind the IME should be adaptable to this. That is, joining up the alphabetical elements is based upon established rules. An IME could provide outline candidates based upon those rules, much as it is possible to construct a Chinese character from its constituent elements, or much like an IME presents a candidate sentence from the input.   Here is how an IME works:   http://www.microsoft.com/globaldev/handson/user/IME_Paper.mspx#EGAA      ___Derek___ But maybe you already knew that. —–   Could be! 🙂   JSW

  13. Indeed. Maybe we should start a "Gregg, Linguistics, Set theory, and Computational Models of Translation" thread sometime.

    *The next action* (for anyone with a scanner and interest) is scanning that dictionary. When I get my scanner back up I'll start on the Simplified edition.

    Then it's just ripping those gifs out of the dictionary, saving each file as the word name itself—just as in my example directory above.

    Once there's a good bunch of words in the directory (call it the online dictionary), just run some favourite text through the sed script above, and there you have it. Add more outlines to your directory, see more in your html output. Later, add some phrase outlines, let me define them in the sed script, see them in your html; fix bugs as we go, repeat…

    I think I can add a little more helpful detail for that process. I'm still eagerly awaiting a smart (programmed) of getting the outlines out of the book (anybody…still?), but if we *must* use brute point-and-click force, a couple of tiny methodological points should be kept in mind when ripping the gifs, and some conventions are probably in order—especially if the online dictionary eventually gets input from more than one person (it would be nice to spread out those 75 hours!). I'll try to write a little thing in the next day or two and post it as http://www3.telus.net/familyvonthomas/gregg/modus.html

    Soon I suppose we should take the discussion off-list, too, as it's getting pretty specific.

    -Derek

  14. I think an important convention is that the person who scans it in pays attention to the angling of the pages and characters. A list of complete, scanned pages wouldn't hurt either.

    As for the discussion surrounding the Japanese and Chinese orthographies, I can say that Japanese uses Chinese characters (kanji), but shows verb tenses and noun cases on kanji (representing the bases of words) with its syllabary (hiragana, based on the Chinese grass script); grammatical number doesn't exist in the language. In total, the Japanese only use about 6,000 kanji, and their newspapers use 2,000. The Japanese orthographies benefit, as far as programming goes, from their language's regularity and low number of phonemes.

  15. Thanks, Orlee; I've ordered a $1.98 copy of the dictionary to de-bind and put through my scanner's auto-feed to ensure a square, consistent scan of everything.

    As I mentioned, I have some other thoughts about the dictionary creating process (see http://www3.telus.net/familyvonthomas/aggregate/modus.html ). But in deference to the rest of the group here I'd like to move this now kind-of specialized discussion to a new mailing list: http://groups.yahoo.com/group/aggregate_users email: aggregate_users@yahoogroups.com . (I'm calling this thing "Aggregate" now.) I'd really like to hear any and all your thoughts and suggestions on this, so please join the list—even if you just want the updates!

    I will repost to this thread if and when Aggregate becomes general Gregg interest again.

    -Derek

  16. Months ago I bought a cheap DJ dictionary to remove the binding and feed thru the scanner.   Clearly, this thread has been focussing on Simplified — will we eventually expand to other series?   Also, what program are you creating the gif files in? I use (I can just hear the gasps) Paint, which doesn't seem to offer me the ability to render the background transparent. Or does it?   It may be a good idea to scan the shorthand portions of texts as well — the dictionaries have holes, some of which can be plugged with mouse drawn outlines, but likely could be plugged with an image from the book: in my Simplified dictionary there are: seem, seemed, seemingly, seemliness, and seemly, but no seems. It seems to me that we might need it and others.   Billy (sidhetaba)

  17. Would you mind re-posting that message to aggregate_users@yahoogroups.com, Billy? Even if there's only a few of us there, it would be nice to chat about gifs, transparency, resolutions, scanners, etc. for a little while without the "keep posts Gregg-related" stipulation.

    Aggregate Home: http://groups.yahoo.com/group/aggregate_users

    Subscribe: aggregate_users-subscribe@yahoogroups.com

    Post message: aggregate_users@yahoogroups.com

    Unsubscribe: aggregate_users-unsubscribe@yahoogroups.com

    List owner (that's me!): aggregate_users-owner@yahoogroups.com

    Cheers!

    -Derek

  18. I've mentioned that there's no deadline for this project, but as
    my personal responsibilities have changed quite a bit and the weeks
    continue to pass, I've discovered that time for these tasks is just not
    going to avail itself. I've decided to take the project off my desk,
    with sincerest appreciation to those who've encouraged it thus far.

    If anyone is keen and with spare time enough to move this forward better
    than I've been able to, I'd be very happy to see him or her take over
    the Yahoo group; just email me.

    A quick summary of where the project is at for anyone interested. First,
    we have set up a simple, non-technical way of extracting the outlines into
    a digital index; using an image editor (such as Erfanview), we "cut out,"
    then drag and drop an outline into it's matching word-directory in our
    huge English dictionary directory (currently still available through the
    link on the Yahoo web page). We have a very simple and functional program
    (written in sed) that converts any plain digital text into gregg outlines;
    it can match words and all the common phrases one would expect to see
    in a published Gregg text. These are the things that remain on the
    "administrative" todo list:

    1) create and post a "sign up" table for people to list what portion of
    which dictionary they would like to edit into the directory.

    2) set up an email address (or anonymous ftp) for a central collection of
    image submissions.

    3) read up on ImageMagick and set up a simple way to batch-process and
    standardize the size and colour of the outline images.

    Finally, there was some earlier discussion about the limitation of
    phrasing for a dictionary-based approach to digital transcription,
    ("Phrases can be transcribed easily enough using the same regular
    expressions that I used to match words (in the sed script), but whither
    the Gregg phrase outlines—especially for phrases like 'I don't like
    computers'?"); I'd like to get one more thought in on this. I've found
    that Gregg texts recommend against making outlines for phrases whose parts
    can't be joined easily, phrases whose words lead the outline away from the
    line of writing, and *uncommon phrases* (unless they're special to the
    writer's field). In other words, the set of all *sensible* phrases is
    pretty well-limited; a dictionary like Aggregate's could easily include
    enough of them to produce the kind of well-transcribed English a Gregg
    reader would expect—not just word-for-word mapping. Gregg phrase books,
    including that very set of sensible phrase outlines, are well published,
    and so could be used along with the ordinary dictionaries to source
    digital outlines.

    Cheers,

    -Derek

  19. Wow!  I just now realized there was a Yahoo group on this subject—I thought everyone had forgotten about computerizing Gregg.  By the way, I would consider this discussion Gregg-related, and appropriate in this section, but if you are ever unsure, you can post in the Anything Goes section—any subject is allowed there.  Since this particular project is such a huge one, I think it was a great idea to create the dedicated discussion board.  I am adding it to the links section.  Also, in case anyone didn't realize, SH List Demo.zip is available for download in the Documents section—a prototype Gregg software made by Tyler.   __________________________ Shorthand: isn't it about time?

Leave a Reply