Archive

Archive for the ‘Software’ Category

Publishing from a Content Hub

March 30th, 2010
Web CMS

Working as part of a sales team, one of the questions that I’m asked again and again – by my management as well as the Marketing department – is “who are your biggest competitors?” For a Web content management system or text analytic tool (Nstein’s WCM and TME respectively), that’s a fairly easy question to answer. In the DAM space, however, because of Nstein’s particular focus upon the Publishing industry the answer is less clear.

A simplified publishing workflow.
A simplified example of a publishing workflow.
Content Hub workflow

With assets stored in a central repository all systems and processes have direct access to them.

In fact, over the last couple of years Nstein has been positioning its DAM offering as a strategic centre-point for publishing workflows – Content Hub seems to be the prevailing (if slightly uninspired) label for this kind of system. Essentially, a Content Hub is a DAM with integration points so that all assets which come into the wider system (the company, publication, etc) are ingested straight into it; all content which is created internally is written directly into it; and then, all systems which utilize, display, edit or distribute content do so from the Hub directly. This is not a new model – it is sometimes referred to as a single version of the truth – however it often represents significant change and significant challenges in environments which have naturally developed around a (fairly) linear workflow. Magazines, in particular, as well as any breaking news publications, tend to have a from A to B style workflow which involves filtering incoming media, bring it together as a publication of some description and then publishing it out. By repositioning the processes and applications along such a workflow around a central Hub, dependencies and bottlenecks are broken down and assets, and access to them, become standardized. As a symptom of this shift, efficiency improves, asset re-use is encouraged and assets, their rights and usage information are better tracked. And by creating packages of content, independent of both source and output channel, features can be efficiently published on multiple channels (such as Print and Web) and new properties can be created cheaply with lower risk.

So, coming back to the original question, the DAM space doesn’t present that many competitors for Nstein (although there are, of course, a few) as few DAM systems have the out-of-the-box capabilities required by the vertical – handling extended metadata, transforming images, re-encoding video, printing contact sheets, managing page content, &c. In fact, the biggest competition in these cases comes squarely from Print Editorial System vendors who would, like us, endorse a Content Hub approach except with their CMS at the centre of the publishing universe.

In some ways both sets of vendors – DAM and Editorial System – are using the same arguments. One version of the truth, certainly. Single workflow and security. To some extent the multiple-channel publishing argument would also be used by both, certainly most Print Editorial Systems come with some option to publish a Web site as well.

These two approaches to the same Content Hub strategy raise a couple of key questions: what is the difference between the two solutions and how do those differences affect the buyer?

The former question is the simplest to answer: A DAM based Hub disassociates itself from the editing and creation of products whereas an Editorial System is strongly tied in to the production process. Take the creation of a newspaper, for example. The collaborative effort needed to construct a modern edition in an efficient and reliable manner relies heavily upon Editorial Systems to manage the agglomeration of the content and design in real time. The question is; should that System be the hub or a spoke?

How do these differences affect the buyer? What are the relative merits of the approaches? These questions are the ones which are being debated and rely upon strategic visions that the publisher may just not share. However, from my point of view, here are the main points.

On the plus side for the Editorial Systems, as they are so connected to the production process, they  can offer advanced and specific functionalities, tying in closely with DTP tools and offering collaborative working features which a DAM cannot compete with.

That strength, however, is also the biggest weakness for the Editorial Systems. By abstracting themselves from the production process the DAMs become far more agile. We can look at a fairly simple example of this in publishing the same content to both print and the Web, a process which should, by now, be a commodity. At its simplest this task should work smoothly in any Print Editorial System; text and images from a print feature are transformed into Web pages and published online. What happens, though, when other media is introduced? Most Print Editorial Systems that I have seen struggle to (or cannot) display and edit video. Maybe they can store them but the advanced features available for print content are gone, as are many simple features such as previewing and usage tracking. Now in many cases, the Print Editorial System may be coupled with a Web CMS (potentially from the same vendor) which does feature better handling of video but in that scenario there are now two production points. That means compromised security, more staff training, more convoluted audit trails. Then when you take audio, Software Flash, or any other format of content that the publisher may use – online or elsewhere – and the problem is magnified.

One solution for the Editorial Systems would be to develop the extra functionality required to handle these formats with the same level of functionalities as the print content which they are familiar with. The obvious problem with that is the effort and available resources required to build and maintain such a suite. So by steering clear of the production process the DAM based systems can handle content in a channel-ambiguous fashion.

Particularly when one looks at the creativity in digital media these days, the strength of agility should be clear. There are the obvious ones: Facebook apps, QR codes, iPad channels, etc. There are also some less well adopted mediums.

In October 2008 Hearst released a special edition Esquire (sponsored by Ford) featuring an e-ink, animated front-cover. Bauer last week released an issue of Grazia featuring Florence (and the Machine) dancing in an augmented reality world activated by pointing your webcam/iPhone at the cover. While this was pretty disappointing in comparison with many other AR examples (such as the great GE ones) due to the fact that the real page was not displayed – more on that in a future post. While neither of those examples where particularly well implemented they definitely show signs of what could become mainstream technologies in the future. The question about adding the functionality to manage the production of publications including these kinds of technologies into Editorial Systems is a far-fetched one. Not only is the investment significant and the road to maturity slow but if a technology ultimately fails to gain mainstream accessibility the investment becomes a wasted one. For that reason companies that rely upon an Editorial System at the core of their business have to wait until new technologies reach general acceptance to embrace them and lose the ability to stay ahead of the curve – at least without excessive risk. In those cases, as with more mundane ones, the channel ambiguous and content ambiguous DAM systems project their flexibility directly on to the publications which use them.

That’s not to say that there are not downsides to using the DAM as the Hub. In particular, collaborative working cannot be handled to the depth that the Editorial Systems manage without their level of detail and understanding of the specifics. And in both cases there are overlaps in functionality; most Editorial Systems have some kind of repository, for example, and many top tier DAM systems integrate well with DTP tools.

Inevitably, those two questions, drive towards the ultimate conclusion of the debate: “Which would make a better Content Hub, an Editorial System or a DAM?” I won’t attempt to answer that directly as I’m obviously biased towards the solution I sell and know the most about but will encourage debate from those who have an opinion…

Open Source v traditional Software (ding, ding, ding)

May 10th, 2009

At the tail end of last month I spent two days attending talks at the yearly Internet World exhibition. I always enjoy listening to speakers and the quality was, by and large, very good. On the final day CMS Watch (@cmswatch) hosted a panel discussion in the Content Management theatre entitled: “Open Source v Traditional Software”. It’s was a strange title, I thought, as the line, for many vendors, between open and closed source becomes more and more vague. This blending was, however, represented in the panel, which included Stephen Morgan (@stephen_morgan) of Squiz – a commercial open source vendor.

On the whole the panel was very good and the debate interesting. The open source contingent argued eloquently  the pros of spreading knowledge throughout the community and of the response times to bug fixes compared with the release cycles of proprietary software. One of Stephen’s responses when asked for reasons to go with an open source system, however, struck me as – at best – ill conceived.

Stephen had argued that as a customer of a closed source software retailer you fall, entirely, to their mercy in terms of functional changes. The assertion was that when you – as a customer – have access to source code you can modify it to suit your needs. Conversely, he claimed that changes to a closed source solution could only be requested, may never happen and would be subject to a lengthy release cycle even if they were implemented.

Now I’m sorry but that is just not the case; as I told the panel once the discussion was opened to the audience. The software I work with, Nstein’s WCM, features an expansive and  well designed extension framework to do just what Stephen was referring to. In fact, I went further and put the polemic to the panel that hacking core source code is obviously not desirable and severely hinders an applications upgrade path. Stephen’s countered with the fact that changes made to the code-base can be submitted to Squiz (or almost any other open source software maintainer, for that matter) and may be committed into the core application.

Before I start a holy war here (and a succession of flames in this sites comments) I would like to state my position on open source: I love it. I love the concept. I love free software. I love the freedom to modify and distribute software. Basically, I get it. I’m a huge fan of Linux and at the end of the day a PHP programmer. Just yesterday, I spent my Saturday contributing PHPTs (that’s PHP tests, for non-geeks) with the PHP London user group. I really do dig open source. Also, for the record, I thought Stephen Morgan represented his brand and community very well and I enjoyed his commentary; this is not meant to be a personal attack ;-) .

In fact, this post is not criticizing open source software at all. The discussion here, as far as I am concerned is about best practices. Okay, sure, one can modify the source code to an open source project and that change may be incorporated into the software. May be incorporated; probably won’t be. And with closed source software that option is not available – you have less choice. But that is, I think, a good thing.

At least the prelude to a good thing. Software evolves, like all technology, and the beautiful simplicity of Darwinian evolution applies. It’s survival of the fittest. If we, at Nstein, were to compete with open source CMS projects with a solution which was not customisable, which had no mechanism for modification we would have died out. The fact is we make a vast amount of customisation possible – we’ve had to. Because we don’t encourage customers to delve into the core source code (it’s a PHP app so they can if they really want) we’ve had to employ other methods. Extensible object models built around best practices derived from industry experience. Plug-in frameworks. Generic extension frameworks. If one of our customers cannot extend or change something that they need to the chances out that another client will at some point want that same, absent flexibility. So, through good design practices we have constructed a system which clients can (and do) modify, yet when they decide to upgrade to the next point release it is a trivial process.

Now, I’m not saying that open source software is poorly designed. I’m writing this piece now on WordPress – a fantastic example of an open source project – which features an extremely rich and well documented plug-in framework. The sheer number of plug-ins and themes available for WorldPress is a testament to the system. And, as with Nstein’s software, when I upgrade WordPress all of my extensions still work (at least 95%, or more, of the time).

I doubt anyone would disagree with the merits of a plug-in based system. My interest, however, is in this question: how much of a temptation is there to hack open source software? I know I’ve done it in the past. I’ve heard a number of times that Drupal upgrades are nigh on impossible due to the nature of the inevitible customisations a Web content management system requires. I’m not in a position to answer that question authorititively, and I won’t attempt to. I would like to stir the debate up though. So, thoughts, please….

Author: chris Categories: CMS, Open Source Tags: ,

How long is a (piece of) string?

April 24th, 2009

I recently posted an article about a workflow script I cooked up for automatically tweeting about an article when it gets published via Nstein’s WCM (here). Basically, the script to which the article referred was leveraging data from Nstein’s Text Mining Engine (TME) to create concise but still descriptive tweets. As a brief reminder of that post, the script was using a computer generated summary and adding hash-tags extracted from the text to create a micro-blog like this:

I’ve made use of the TME’s concept and entity extraction features to create hash-tags. #tweet #nsteinswcm http://tinyurl.com/d3ozzn

It seems to be an idea which the industry finds interesting (judging by my Twitter account and the comments on the article). Sarah Bourne’s (@sarahebourne) offer – in particular – I could not pass up. Sarah, who is the Chief Technology Strategist for the Commonwealth of Massachusetts (@massgov), had suggested that I try my micro-blogging bot on some of the MassGov content from their Twitter stream. So I did…

Well, as one comment in the last entry (by “Rob”) alluded to, no matter how relevant my tweet is it still needs to comply to the 140 character limit set by Twitter. This seemed to be presenting some problems with the MassGov content. A big part of the problem was that the subjects of the Massachusetts articles were often political; they tend to have long sentences with complex subject matters and feature lots of relatively long words (“Massachusetts” for example). So although pertinent hash-tags and relevant teasers were being generated some times these were still over the limit.

The way my bot dealt with this situation was by using progressively more aggressive truncation techniques. At the light end of the scales it might swap all occurrences of “with” for “\w”, “and” for “&”, etc. After each pass the tweet’s character count gets remeasured, if it’s still to long the next truncation technique is applied. Ultimately, if all else fails, the tweet is truncated by removing words from the end until it no longer exceeds the limit.

Obviously, this can lead to the very problem the original post was discussing: ending up with automatically generated tweets which do not describe the article they are plugging. Now, the bot I created makes this situation far less common, no doubt – but not impossible. Adding hash-tags guarantees a level of meaning which would otherwise be impossible to achieve with an automated system and that makes up for truncated sentences to some extent, however I was not satisfied. Here’s an example of a tweet which was too long:

Attorney General Martha Coakley Sponsors Legislation to Enhance Victim compensation Assistance. #massachusetts #compensation http://tinyurl.com/6ht573

In fact it’s 9 characters too long. Now the bot would have truncated it to this:

Attorney General Martha Coakley Sponsors Legislation to Enhance Victim compensation. #massachusetts #compensation http://tinyurl.com/6ht573

As it turns out, that wasn’t too destructive but I may not have been so lucky.

That tweet had given me an idea, though. The inspiration? TinyURL.

I don’t use TinyURL.com when I’m tweeting. These days who does? Twhirl (or Seesmic) is my twitter client and when I want to shorten a URL it offers me a list of services to use. I always make the same choice: “is.gd”.The reason is pretty obvious – their domain name is 6 characters shorter.

Okay, so a bit of a no-brainer there then. Switch my bot’s shortening service to “is.gd”, save at least 6 characters per tweet. But that wasn’t really the point. I would never have used TinyURL so why had I programmed my bot too? What was I thinking?

Well the truth of the matter is this: I wasn’t. I’d used the TinyURL API before and so just stuck it into the code. So I started thinking about what else I might have done wrong. Or, more specifically, I started to think about how I tweeted (in the flesh, as it were) and if my bot was doing as good a job.

Once I started down that trail-of-thought one big difference struck me: Where possible I use inline hash-tags. If the keyword you are tagging already exists in the post then you are not adding meaning, per se. You may be emphasizing that word and you may also be starting a trend for replies and retweets. Therefor, it stands to reason that you can use the hash-tag inline and not waste space by duplicating the word.

So, having made those changes to the program I republished the MassGovs article. This time my bot tweeted:

Initiative will help municipalities pursue clean #energy projects make best use of federal stimulus funds. #massachusetts http://is.gd/uggP

Much better. It actually transpires that (perhaps unsurprisingly) these inline tags occur pretty frequently in the tweets. I’ve republished a selection now, here are the tweets:

Officials “flex” highway stimulus funds to support “net zero” transit center. #transportation #greenfield http://is.gd/ugns

Attorney General Martha Coakley Sponsors Legislation to Enhance Victim #compensation Assistance. #massachusetts http://is.gd/ugtL

Patrick Administration Credits Dropout Prevention Efforts for Improvement. #student #malden http://is.gd/ugs7

#patrickadministration Receives $1 Million Grant to Support Expanded Services for People with #traumaticbraininjuries. http://is.gd/ugie

Welcome to DCR Park Server Day. #volunteer #capecod http://is.gd/ugqD

Costs to Employers ThirdLowest #oregon Survey Reports Under Patrick Administration Rates Have. #compensationrates http://is.gd/ugwn

The results there are, I think, pretty good. Out of the seven articles I’ve republished only the last one has needed to be to truncated.

My bot isn’t perfect and it won’t create faultless tweets every time, however, it is a huge improvement over the traditional blind truncation. My conclusion – from the previous post, the discussion around it and the experiments I have carried out – is that Twitter automation has too many benefits for it not be used by online publishers but will (probably) never be perfect 100% of the time. What we’ve accomplished here, so far, is a much higher and more consistent level of readability and relevancy and a much reduced frequency of the need to truncate teasers. I’m sure there are many techniques I could implement to improve the results (and I may do in the future) but for now there is just one more change I’m going to make…

As I mentioned at the beginning of this article (and in the previous one) this experiment has be done using the workflow engine in Nstein’s WCM. It’s a scripted state transition engine, so when I published articles they were also passed to the Twitter-bot for it to create a tweet. The change I am going to make is this: create a new, “Needs tweeting”, workflow state. Then in the minority of cases where the bot cannot tweet about an article without truncating the teaser it passes the responsibility onto a human twitterer.

There are a huge (really, really huge) number of things that we can accomplish with the TME. Some of the key ones, like SEO, have already been taken to very high standards, but we are only scraping the surface of possible uses. Ideas and experiments, such as this one, are key to our industries growth. From my point of view accomplishing automation in 85% of cases and a high level of quality in 100% would be a fantastic acomplishment. Let’s face it: in this day and age information has been commoditized so quality become the only differentiator between publishers. Quality is what attracts an audience and certainly what keeps them… even on Twitter.

Author: chris Categories: CMS, Twitter Tags: , ,

Asimov’s 4th law: A robot will not tweet.

April 22nd, 2009

Well, that might be a bit extreme. At least if they do they should put in a bit more effort.

Perhaps I need to explain my problem here. The complaint I have concerns automatic tweets – popular with bloggers and online publshers in general. Extremely unpersonal, often unhelpful clipits drawing the audiences attention to a new article or blog entry. Here’s an example:

[news] Pepsi drinkers join the dots: Anyone buying a Pepsi Max soft drink over the next few w.. http://tinyurl.com/5qu3w3

- @guardianmedia

Ok, so it’s pretty obvious what’s wrong with this tweet. The article the Guardian Media is trying to promote is about a campaign by Pepsi which uses QR codes on the side of their cans – not that you’d have known from the tweet.

The problem is they’ve used a witty headline not a descriptive one. In itself that is fine. Like many online publishers, however, the Guardian have opted against manually tweeting and have integrated (presumably) their CMS with Twitter. More specifically, the tweet is a concatination of the articles title and the begining of the text. It just so happens that neither of those blocks of text mension QR codes.

There is a lot to be said for automation, though. It’s not just that this system saves the author of the article or blog time. It also ensures consistency – all articles get posted. And, to be fair, most of the time these posts are okay…

…not always though. Personally, I’ve stopped following the Guardian Media on twitter (and Scientific American) because these badly formed tweets annoy me way too much. Take the article above, for example. A human author might tweet something like this:

Pepsi launch campaign using QR codes on cans. Drinkers get access to secret content through phone browser.

That sums up the article much better, with 33 characters spare for the URL. I’d be far more likely to read the article having read that tweet, as I think QR codes are interesting (I’m a bit of a geek) and appreciate imaginative marketing.

So what’s the answer? Is there a way to achieve the normalization and efficiency of an automated system while being a good Twitterer? Well yes, I think there is.

I’ve been playing with the workflow engine in Nstein’s WCM and have written a nifty little Twitter-bot. It’s secret is it’s ability to understand content. Nstein also produce a text mining engine (TME) which is ingrained into the WCM right down to the core. This means that semantic data about an article is always easily accessible. I’ve used this automatically extracted meta data in two ways for my bot.

Firstly, I’ve made use of the TME’s concept and entity extraction features to create hash-tags. For those who don’t know, a hash-tag is a peice of meta-data associated to a tweet. They are prefixed with a hash (#) character and generally are alpha numeric. A lot of automated tweets now use hash-tags with vary degrees of success. @northamptonrfc (the rugby team I support), for example, tags all tweets with “#rugby”. Well I never. The correct use of hash-tags (IMHO) is to:

  1. Add relevant meta data to a tweet which adds meaning.
  2. Create a trend to follow (essencially a thread accross all Twitter users).

In order to meet those criteria the tag needs to be meaningful. It stands to reason. In the Pepsi example above two tags spring to mind: “#pepsi” and “#qrcode”. Including 2 spaces that makes an extra 15 characters which can (relatively) easily be fitted in before the TinyURL. Nstein’s TME would, undoubtedly, have picked these concepts out.

“QR Code” is what the TME refers to as a complex concept, that is, a phrase. “Pepsi” is an entity, specifically an organisation name. A simple regex can transform these strings into hash-tags. Using this technique the bot imediately adds a great deal of meaning to the tweet.

The second way in which I’ve leveraged the meta data extracted by the TME is using NSummarizer. This cartridge takes a document, splits it into sentence components, rates each component on its relevance to the article and returns the best scoring one(s) as a brief summary of the document. This is a really useful tool for getting around the issue of having a first sentence which is not (particularly) descriptive of the article as a whole.

So, does it work? Well I’ve used this blog as a test, here’s the resultant tweet:

I’ve made use of the TME’s concept and entity extraction features to create hash-tags. #tweet #nsteinswcm http://tinyurl.com/d3ozzn

Personally, I count that as a success.

Author: chris Categories: CMS, Twitter Tags: , , ,