Content generation technique

Alrighty. So few people despise content creation as much as myself. While I do have a trick or two I will be hording to myself, I thought I'd share the basics methods of content generation, and the things you should look into if you want to create an engine that's bringing it up to the next level.

  1. The Text Jumble
    Scraped from different sources, recombining text of randomally sized blocks(2-6 words) is a decent way to create text, so long as no one will ever see it(cloaking), and so long as the search engines don't get any better at recognizing proper language patterns. There's not much to say on this one, so I'll leave it for now. It's the basic entry level of content generation, and rarely makes a coherent sentence.
  2. Markov/String Permutations
    I've been told this is the concept behind Markovs, so I'll call it that. I know the concept better than I know the name
    This is a common way of generating semi coherent text, and has the greatest possibility for the future, if made more intelligent. Here's the concept.
    1. Scrape lots and lots of data on a given topic. Arrange it by sentence.
    2. Select a random start word from your sentences.
    3. Search through your for all the words that come AFTER that word, and append a random word.
    4. Rinse, lather, repeat.

    The end result is a bunch of text that is somewhat coherent, but still obviously generated. So let's take a look at how to expand on this, and make it more readable.

    • Fix capitalization. This is an easy one to do.
    • Maintain proper tense. If you do not doing this, then your content had easily looked like this did.
      • Either load a dictionary of these up, or do your best at adding/removing the proper endings. Sometimes searching Google for your attempt can reveal the proper spelling for irregular words.
    • If you're feeling especially zesty, note the type of word each is(verb, adjective, noun, etc). Note the combinations of these that is normally sensical, and try to recreate it like that.
    • Break up by paragraph as well as sentence. That way, you can weight the randomization by not only if the word is there, but if the paragraph is similar to the one you've already created. If you don't have enough data for this though, you'll edn up with very, very similar lines.
  3. The Synonym Switch
    This solution is the one many arrive at logically first. All it is is looking up synonyms in an online thesaurus, and swapping out the current word for another one.
    HOWEVER be warned. There's a lot of synonyms no longer in use, or rarely used. As a result, your text can come out very footprintable, and sounding as if a mixture between a thug and Shakespeare wrote it. A good way to offset this affect is to search for each keyword on Google(store this in a database, so you only have to do it once and can space out search times), and record the number of hits. Weight the algorithm deciding which word to swap in according to how many results it got. This will help you only get more common synonyms.

Combining the Processes
Combining these(errr #2 and #3) is a pretty decent way to create unique content without the hassle of writing. However, they are quite CPU intensive, so don't say I didn't warn you!

The largest issue with writing proper text automatically is that it's hard to scale. Not enough content. To supplement website scrapes, make use of all the free text out there. Project Gutenberg is a good resource(17k books with expired US copyrights, free for download). Cliffnotes. Wikipedia. Random ebooks on emule/bittorrent. RSS. Newspaper articles. Even scrapes of lyrics sites. There's a lot of organized data out there, and all of it can be used. Give it a go.

