Archive

Archive for the ‘Social Media’ Category

Creating compelling content in the Web 5.0 world

April 30th, 2009

Whoa, there. Web 5.0?

Okay, so I made up web 5.0. Actually, I detest the numbered generations we’ve applied to the web. The main problem I have with these terms is that they imply a linear progression. They suggest that we are going to abandon the interactive web, Web 2.0, for the semantic web, Web 3.0. Obviously we aren’t. I doubt anyone would even suggest it. Web developers will continue to use both. Hence Web 5.0 (do the maths).

I’m going to drop the term now – it was just a joke. The modern World Wide Web is, in fact, much more than just the three so-called generations – although clearly they are very important. I can identify three main concepts (not technologies) which are facilitating the current evolution of the web:

  • Interactivity (2.0)
  • Semantic understanding (3.0)
  • Commoditization (the Cloud)

Nothing ground breaking there. And we, as users, are certainly seeing more and more of these big three in our daily use of the web.

Interactivity is fairly obvious. I think the biggest revolution in interactive content came about as Wikipedia took off. Undoubtedly the most expansive (centralized) base of knowledge the world has ever seen – and written by volunteers, members of the public. It really is a staggering collaborative achievement. Then there’s blogging, micro-blogging, social networking, professional networking, content discovery (digg, etc), pretty much anything you might want to contribute, you can.

Semantic understanding is a little trickier to see. That’s hardly suprising as it is so much newer and far less understood. Believe the hype, though. The sematic web is coming and it will change everything (everything web related, that is). If you don’t believe me try googling for “net income IBM”. You should see something like this:

Google results using RDF infoThat top result is special. It’s special because it’s the answer; it’s what you were looking for. No need to trawl through ten irrelevant pages to find the data – it’s just there. Google managed to display this data because IBM published it as part of an RDF document. If you search for the same information about Amazon – who don’t, no such luck. (That particular example was given by Ellis Mannoia in a great Web 3.0 talk at Internet World this week – so thanks Ellis.)

That leaves us with commoditization. Specifically, the commoditization of functionality from a developers point of view. This concept is largely, although not exclusively, linked to the Cloud. The term “the Cloud” is used broadly to describe services make avalible over the internet. GMail, for example, is email functionality in the cloud. Users don’t need to install anything to use GMail (bar a web client) they just use it when they want, from any computer. Many of the Cloud services out there are available as APIs, and that leads to the commoditization of functionality. Say I want to add a mapping application to my web site to show my audience where I am. A few years ago that would have been a significant amount of development work. These days it’s trivial – you just make a call to the GoogleMaps API. And so map functionalities become a commodity.

The point of this post, however, is that these are not mutually exclusive concepts. There is no reason why you cannot combine semantic understanding with Cloud computing, or UGC, or both. Quite the opposite: combining the three should be the goal.

There are problems, however. Utilizing Cloud computing requires a certain amount of adherence to standards – fitting in to an API. And semantic understanding (and meta data, in general) takes time to accrue. In general those two constraints don’t work well with Web 2.0 functionality.

Let me give an example: If a user contributes a comment to an article they probably won’t take the time to add the meta data required for semantic understanding to be achieved. In the same way if they don’t give their location you can’t show them as a pin on GoogleMaps.

However semantic understanding is (IMHO) more than just the use of RDF documents. Tools like Nstein’s Text Mining Engine can be used to create a semantic footprint describing a piece of text. I’ve talked, in previous posts, about using the data gleaned by the TME in imaginative and experimental ways. Take the example above. If a user were to post a comment about a talk they attended the TME could extract, not only the concepts of the comment, but also data like the location of the subject. That semantic understanding can be used to programatically call the GoogleMaps API to add a new pin in your map.

And there you have it. Semantic understanding of interactive content used to harness the power of Cloud computing. One of the most important benefits of the TME, for me, is the flexibility it affords you. If you know that you can get access to that kind on information it opens up all kinds of possibilities. Exploring some of these possibilities has to be the focus for making a brand stand out against the plethora of content suppliers and aggregators available; for improving the users experience and gaining their loyalty.

So it’s time to stop thinking about Web 2.0 or Web 3.0 and start thinking about the technology and techniques available and how they can be used to the greatest effect.

Author: chris Categories: Semantic web, Social Media Tags: , , ,

How long is a (piece of) string?

April 24th, 2009

I recently posted an article about a workflow script I cooked up for automatically tweeting about an article when it gets published via Nstein’s WCM (here). Basically, the script to which the article referred was leveraging data from Nstein’s Text Mining Engine (TME) to create concise but still descriptive tweets. As a brief reminder of that post, the script was using a computer generated summary and adding hash-tags extracted from the text to create a micro-blog like this:

I’ve made use of the TME’s concept and entity extraction features to create hash-tags. #tweet #nsteinswcm http://tinyurl.com/d3ozzn

It seems to be an idea which the industry finds interesting (judging by my Twitter account and the comments on the article). Sarah Bourne’s (@sarahebourne) offer – in particular – I could not pass up. Sarah, who is the Chief Technology Strategist for the Commonwealth of Massachusetts (@massgov), had suggested that I try my micro-blogging bot on some of the MassGov content from their Twitter stream. So I did…

Well, as one comment in the last entry (by “Rob”) alluded to, no matter how relevant my tweet is it still needs to comply to the 140 character limit set by Twitter. This seemed to be presenting some problems with the MassGov content. A big part of the problem was that the subjects of the Massachusetts articles were often political; they tend to have long sentences with complex subject matters and feature lots of relatively long words (“Massachusetts” for example). So although pertinent hash-tags and relevant teasers were being generated some times these were still over the limit.

The way my bot dealt with this situation was by using progressively more aggressive truncation techniques. At the light end of the scales it might swap all occurrences of “with” for “\w”, “and” for “&”, etc. After each pass the tweet’s character count gets remeasured, if it’s still to long the next truncation technique is applied. Ultimately, if all else fails, the tweet is truncated by removing words from the end until it no longer exceeds the limit.

Obviously, this can lead to the very problem the original post was discussing: ending up with automatically generated tweets which do not describe the article they are plugging. Now, the bot I created makes this situation far less common, no doubt – but not impossible. Adding hash-tags guarantees a level of meaning which would otherwise be impossible to achieve with an automated system and that makes up for truncated sentences to some extent, however I was not satisfied. Here’s an example of a tweet which was too long:

Attorney General Martha Coakley Sponsors Legislation to Enhance Victim compensation Assistance. #massachusetts #compensation http://tinyurl.com/6ht573

In fact it’s 9 characters too long. Now the bot would have truncated it to this:

Attorney General Martha Coakley Sponsors Legislation to Enhance Victim compensation. #massachusetts #compensation http://tinyurl.com/6ht573

As it turns out, that wasn’t too destructive but I may not have been so lucky.

That tweet had given me an idea, though. The inspiration? TinyURL.

I don’t use TinyURL.com when I’m tweeting. These days who does? Twhirl (or Seesmic) is my twitter client and when I want to shorten a URL it offers me a list of services to use. I always make the same choice: “is.gd”.The reason is pretty obvious – their domain name is 6 characters shorter.

Okay, so a bit of a no-brainer there then. Switch my bot’s shortening service to “is.gd”, save at least 6 characters per tweet. But that wasn’t really the point. I would never have used TinyURL so why had I programmed my bot too? What was I thinking?

Well the truth of the matter is this: I wasn’t. I’d used the TinyURL API before and so just stuck it into the code. So I started thinking about what else I might have done wrong. Or, more specifically, I started to think about how I tweeted (in the flesh, as it were) and if my bot was doing as good a job.

Once I started down that trail-of-thought one big difference struck me: Where possible I use inline hash-tags. If the keyword you are tagging already exists in the post then you are not adding meaning, per se. You may be emphasizing that word and you may also be starting a trend for replies and retweets. Therefor, it stands to reason that you can use the hash-tag inline and not waste space by duplicating the word.

So, having made those changes to the program I republished the MassGovs article. This time my bot tweeted:

Initiative will help municipalities pursue clean #energy projects make best use of federal stimulus funds. #massachusetts http://is.gd/uggP

Much better. It actually transpires that (perhaps unsurprisingly) these inline tags occur pretty frequently in the tweets. I’ve republished a selection now, here are the tweets:

Officials “flex” highway stimulus funds to support “net zero” transit center. #transportation #greenfield http://is.gd/ugns

Attorney General Martha Coakley Sponsors Legislation to Enhance Victim #compensation Assistance. #massachusetts http://is.gd/ugtL

Patrick Administration Credits Dropout Prevention Efforts for Improvement. #student #malden http://is.gd/ugs7

#patrickadministration Receives $1 Million Grant to Support Expanded Services for People with #traumaticbraininjuries. http://is.gd/ugie

Welcome to DCR Park Server Day. #volunteer #capecod http://is.gd/ugqD

Costs to Employers ThirdLowest #oregon Survey Reports Under Patrick Administration Rates Have. #compensationrates http://is.gd/ugwn

The results there are, I think, pretty good. Out of the seven articles I’ve republished only the last one has needed to be to truncated.

My bot isn’t perfect and it won’t create faultless tweets every time, however, it is a huge improvement over the traditional blind truncation. My conclusion – from the previous post, the discussion around it and the experiments I have carried out – is that Twitter automation has too many benefits for it not be used by online publishers but will (probably) never be perfect 100% of the time. What we’ve accomplished here, so far, is a much higher and more consistent level of readability and relevancy and a much reduced frequency of the need to truncate teasers. I’m sure there are many techniques I could implement to improve the results (and I may do in the future) but for now there is just one more change I’m going to make…

As I mentioned at the beginning of this article (and in the previous one) this experiment has be done using the workflow engine in Nstein’s WCM. It’s a scripted state transition engine, so when I published articles they were also passed to the Twitter-bot for it to create a tweet. The change I am going to make is this: create a new, “Needs tweeting”, workflow state. Then in the minority of cases where the bot cannot tweet about an article without truncating the teaser it passes the responsibility onto a human twitterer.

There are a huge (really, really huge) number of things that we can accomplish with the TME. Some of the key ones, like SEO, have already been taken to very high standards, but we are only scraping the surface of possible uses. Ideas and experiments, such as this one, are key to our industries growth. From my point of view accomplishing automation in 85% of cases and a high level of quality in 100% would be a fantastic acomplishment. Let’s face it: in this day and age information has been commoditized so quality become the only differentiator between publishers. Quality is what attracts an audience and certainly what keeps them… even on Twitter.

Author: chris Categories: CMS, Twitter Tags: , ,

Asimov’s 4th law: A robot will not tweet.

April 22nd, 2009

Well, that might be a bit extreme. At least if they do they should put in a bit more effort.

Perhaps I need to explain my problem here. The complaint I have concerns automatic tweets – popular with bloggers and online publshers in general. Extremely unpersonal, often unhelpful clipits drawing the audiences attention to a new article or blog entry. Here’s an example:

[news] Pepsi drinkers join the dots: Anyone buying a Pepsi Max soft drink over the next few w.. http://tinyurl.com/5qu3w3

- @guardianmedia

Ok, so it’s pretty obvious what’s wrong with this tweet. The article the Guardian Media is trying to promote is about a campaign by Pepsi which uses QR codes on the side of their cans – not that you’d have known from the tweet.

The problem is they’ve used a witty headline not a descriptive one. In itself that is fine. Like many online publishers, however, the Guardian have opted against manually tweeting and have integrated (presumably) their CMS with Twitter. More specifically, the tweet is a concatination of the articles title and the begining of the text. It just so happens that neither of those blocks of text mension QR codes.

There is a lot to be said for automation, though. It’s not just that this system saves the author of the article or blog time. It also ensures consistency – all articles get posted. And, to be fair, most of the time these posts are okay…

…not always though. Personally, I’ve stopped following the Guardian Media on twitter (and Scientific American) because these badly formed tweets annoy me way too much. Take the article above, for example. A human author might tweet something like this:

Pepsi launch campaign using QR codes on cans. Drinkers get access to secret content through phone browser.

That sums up the article much better, with 33 characters spare for the URL. I’d be far more likely to read the article having read that tweet, as I think QR codes are interesting (I’m a bit of a geek) and appreciate imaginative marketing.

So what’s the answer? Is there a way to achieve the normalization and efficiency of an automated system while being a good Twitterer? Well yes, I think there is.

I’ve been playing with the workflow engine in Nstein’s WCM and have written a nifty little Twitter-bot. It’s secret is it’s ability to understand content. Nstein also produce a text mining engine (TME) which is ingrained into the WCM right down to the core. This means that semantic data about an article is always easily accessible. I’ve used this automatically extracted meta data in two ways for my bot.

Firstly, I’ve made use of the TME’s concept and entity extraction features to create hash-tags. For those who don’t know, a hash-tag is a peice of meta-data associated to a tweet. They are prefixed with a hash (#) character and generally are alpha numeric. A lot of automated tweets now use hash-tags with vary degrees of success. @northamptonrfc (the rugby team I support), for example, tags all tweets with “#rugby”. Well I never. The correct use of hash-tags (IMHO) is to:

  1. Add relevant meta data to a tweet which adds meaning.
  2. Create a trend to follow (essencially a thread accross all Twitter users).

In order to meet those criteria the tag needs to be meaningful. It stands to reason. In the Pepsi example above two tags spring to mind: “#pepsi” and “#qrcode”. Including 2 spaces that makes an extra 15 characters which can (relatively) easily be fitted in before the TinyURL. Nstein’s TME would, undoubtedly, have picked these concepts out.

“QR Code” is what the TME refers to as a complex concept, that is, a phrase. “Pepsi” is an entity, specifically an organisation name. A simple regex can transform these strings into hash-tags. Using this technique the bot imediately adds a great deal of meaning to the tweet.

The second way in which I’ve leveraged the meta data extracted by the TME is using NSummarizer. This cartridge takes a document, splits it into sentence components, rates each component on its relevance to the article and returns the best scoring one(s) as a brief summary of the document. This is a really useful tool for getting around the issue of having a first sentence which is not (particularly) descriptive of the article as a whole.

So, does it work? Well I’ve used this blog as a test, here’s the resultant tweet:

I’ve made use of the TME’s concept and entity extraction features to create hash-tags. #tweet #nsteinswcm http://tinyurl.com/d3ozzn

Personally, I count that as a success.

Author: chris Categories: CMS, Twitter Tags: , , ,