Tagging is an inherent part of blogging. You not only write but also tell people its associations. This is pretty convenient for the readers, but not very convenient for me as a blogger. Initially, including the tags was manual, then Lorelle’s Tagging Bookmarklet article showed a better way of doing it. It also points to a couple of GreaseMonkey scripts for Mozilla Firefox users.
Today I came across a post about Yahoo! Term Extractor API by Nate Koechley. This can result into something that will not only benefit the readers but also the bloggers. In addition to ensuring that no terms are missed, it can fully automate discovery of related posts/articles on tag-based services like Technorati. And coming from Yahoo! it is very much usable in PHP, and so compatible with Wordpress!?
UPDATE
Term extraction is currently offered as a web service by Yahoo!. This means that we can now programmatically submit the content of our post to this web service and get back terms in our post. For example, in Wordpress we can do this on the publish_post action. Since the identification of terms has been delegated to the Term Extractor web service, it is completely automated. Once we get these we can now tell the different tag-based services, which can be another web service, to associate the post with those terms. Its power is in the automation of the entire activity of tagging a post.
To be able to carry this out, we have to get an applicaton ID, using which Yahoo! web services can identify us, and then use Representational State Transfer (REST) to form query to submit our content to the Term Extractor web service. The output is in the JavaScript Object Notation (JSON) format. We can then parse this reply and submit the tags to the tag-based services.
Technorati tags: tag, blogging, bookmarklet, wordpress, greasemonkey, firefox, term extractor, yahoo, technorati, php, web+service, rest, json, representational state transfer, javascript object notation
Copyright Abhijit Nadgouda


February 14th, 2006 at 9:31 pm
The Term Extractor sounds great. I’ll take a look at it, but can you explain how to get it to work with wordpress.com blogs? That will help us understand it’s “power”.
February 15th, 2006 at 12:38 am
Lorelle, I have updated the post to answer your question.
November 17th, 2006 at 3:51 am
Hi everybody!
TermExtractor, my master thesis, is online at the
address http://lcl2.di.uniroma1.it.
TermExtractor is a FREE and high-performing software package for Terminology
Extraction. The software helps a web community to
extract and validate relevant domain terms in their
interest domain, by submitting an archive of
domain-related documents in any format.
TermExtractor extracts terminology consensually
referred in a specific application domain. The
software takes as input a corpus of domain documents,
parses the documents, and extracts a list of
“syntactically plausible” terms (e.g. compounds,
adjective-nouns, etc.).
Documents parsing assigns a greater importance
to terms with text layouts (title, bold, italic,
underlined, etc.). Two entropy-based measures, called
Domain Relevance and Domain Consensus, are then used.
Domain Consensus is used to select only the terms
which are consensually referred throughout the corpus
documents. Domain Relevance to select only the terms
which are relevant to the domain of interest, Domain
Relevance is computed with reference to a set of
contrastive terminologies from different domains.
Finally, extracted terms are further filtered using
Lexical Cohesion, that measures the degree of
association of all the words in a terminological
string. Accept files formats are: txt, pdf, ps, dvi,
tex, doc, rtf, ppt, xls, xml, html/htm, chm, wpd and
also zip archives.
I’d like if you partecipate in the TermExtractor
evaluation task. The result of your evaluation will be
put in a paper (I enclose a draft). Please contact me
if you want to partecipate (this is very important for
me!).
MANY THANKS!!!
–
Francesco Sclano
home page: http://lcl2.di.uniroma1.it/~sclano
msn: francesco_sclano@yahoo.it
skype: francesco978