Journal tags: wikipedia

2

Play me off

One of the fun fringe events at Build in Belfast was The Standardistas’ Open Book Exam:

Unlike the typical quiz, the Open Book Exam demands the use of iPhones, iPads, Androids—even Zunes—to avail of the internet’s wealth of knowledge, required to answer many of the formidable questions.

Team Clearleft came joint third. Initially it was joint fourth but an obstreperous Andy Budd challenged the scoring.

Now one of the principles of this unusual pub quiz was that cheating was encouraged. Hence the encouragement to use internet-enabled devices to get to Google and Wikipedia as quickly as the network would allow. In that spirit, Andy suggested a strategy of “running interference.”

So while others on the team were taking information from the web, I created a Wikipedia account to add misinformation to the web.

Again, let me stress, this was entirely Andy’s idea.

The town of Clover, South Carolina ceased being twinned Larne and became twinned with Belfast instead.

The world’s largest roller coaster become 465 feet tall instead of its previous 456 feet (requiring a corresponding change to a list page).

But the moment I changed the entry for Keyboard Cat to alter its real name from “Fatso” to “Freddy” …BAM! Instant revert.

You can mess with geography. You can mess with measurements. But you do. Not. Mess. With. Keyboard Cat.

For some good clean Wikipedia fun, you can always try wiki racing:

To Wikirace, first select a page off the top of your head. Using “Random page” works well, as well as the featured article of the day. This will be your beginning page. Next choose a destination page. Generally, this destination page is something very unrelated to the beginning page. For example, going from apple to orange would not be challenging, as you would simply start at the apple page, click a wikilink to fruit and then proceed to orange. A race from Jesus Christ to Subway (restaurant) would be more of a challenge, however. For a true test of skill, attempt Roman Colosseum to Orthographic projection.

Then there’s the simple pleasure of getting to Philosophy:

Some Wikipedia readers have observed that clicking on the first link in the main text of a Wikipedia article, and then repeating the process for subsequent articles, usually eventually gets you to the Philosophy article.

Seriously. Try it.

Using socially-authored content to provide new routes through existing content archives

Rob Lee is talking about making the most of user-authored (or user-generated) content. In other words, content written by you, Time’s person of the year.

Wikipedia is the poster child. It’s got lots of WWILFing: What Was I Looking For? (as illustrated by XKCD). Here’s a graph entitled Mapping the distraction that is Wikipedia generated from a greasemonkey script that tracks link paths.

Rob works for Rattle Research who were commissioned by the BBC Innovation Labs to do some research into bringing WWILFing to the BBC archive.

Grab the first ten internal links from any Wikipedia article and you will get ten terms that really define that subject matter. The external links at the end of an article provide interesting departure points. How could this be harnessed for BBC news articles? Categories are a bit flat. Semantic analysis is better but it takes a lot of time and resources to generate that for something as large as the BBC archives. Yahoo’s Term Extractor API is a handy shortcut. The terms extracted by the API can be related to pages on Wikipedia.

Look at this news story on organic food sales. The “see also” links point to related stories on organic food but don’t encourage WWILFing. The BBC is a bit of an ivory tower: it has lots of content that it can link to internally but it doesn’t spread out into the rest of the Web very well.

How do you decide what would be interesting terms to link off with? How do you define “interesting”? You could use Google page rank or Technorati buzz for the external pages to decide if they are considered “interesting”. But you still need contextual relevance. That’s where del.icio.us comes in. If extracted terms match well to tags for a URL, there’s a good chance it’s relevant (and del.icio.us also provides information on how many people have bookmarked a URL).

So that’s what they did. They called it “muddy boots” because it would create dirty footprints across the pristine content of the BBC.

The “muddy boots” links for the organic food article links off to articles on other news sites that are genuinely interesting for this subject matter.

Here’s another story, this one from last week about the dissection of a giant squid. In this case, the journalist has provided very good metadata. The result is that there’s some overlap between the “see also” links and the “muddy boots” links.

But there are problems. An article on Apple computing brings up a “muddy boots” link to an article on apples, the fruit. Disambiguation is hard. There are also performance problems if you are relying on an external API like del.icio.us’s. Also, try to make sure you recommend outside links that are written in the same language as the originating article.

Muddy boots was just one example of using some parts of the commons (Wikipedia and del.icio.us). There are plenty of others out there like Magnolia, for example.

But back to disambiguation, the big problem. Maybe the Semantic Web can help. Sources like Freebase and DBpedia add more semantic data to Wikipedia. They also pull in data from Geonames and MusicBrainz. DBpedia extracts the disambiguation data (for example, on the term “Apple”). Compare terms from disambiguation candidates to your extracted terms and see which page has the highest correlation.

But why stop there? Why not allow routes back into our content? For example, having used DBpedia to determine that your article is about Apple, the computer company, you could an hCard for the Apple company to that article.

If you’re worried about the accuracy of commons data, you can stop. It looks like Wikipedia is more accurate than traditional encyclopedias. It has authority, a formal review process and other tools to promote accuracy. There are also third-party services that will mark revisions of Wikipedia articles as being particularly good and accurate.

There’s some great commons data out there. Use it.

Rob is done. That was a great talk and now there’s time for some questions.

Brian asks if they looked into tying in non-text content. In short, no. But that was mostly for time and cost reasons.

Another question, this one about the automation of the process. Is there still room for journalists to spend a few minutes on disambiguating stories? Yes, definitely.

Gavin asks about data as journalism. Rob says that this particularly relevant for breaking news.

Ian’s got a question. Journalists don’t have much time to add metadata. What can be done to make it easier — it is an interface issue? Rob says we can try to automate as much as possible to keep the time required to a minimum. But yes, building things into the BBC CMS would make a big difference.

Someone questions the wisdom of pushing people out to external sources. Doesn’t the BBC want to keep people on their site? In short, no. By providing good external references, people will keep coming back to you. The BBC understand this.