Term Extractor For Tags

Tagging is an inherent part of blogging. You not only write but also tell people its associations. This is pretty convenient for the readers, but not very convenient for me as a blogger. Initially, including the tags was manual, then Lorelle’s Tagging Bookmarklet article showed a better way of doing it. It also points to a couple of GreaseMonkey scripts for Mozilla Firefox users.

Today I came across a post about Yahoo! Term Extractor API by Nate Koechley. This can result into something that will not only benefit the readers but also the bloggers. In addition to ensuring that no terms are missed, it can fully automate discovery of related posts/articles on tag-based services like Technorati. And coming from Yahoo! it is very much usable in PHP, and so compatible with WordPress!?


Term extraction is currently offered as a web service by Yahoo!. This means that we can now programmatically submit the content of our post to this web service and get back terms in our post. For example, in WordPress we can do this on the publish_post action. Since the identification of terms has been delegated to the Term Extractor web service, it is completely automated. Once we get these we can now tell the different tag-based services, which can be another web service, to associate the post with those terms. Its power is in the automation of the entire activity of tagging a post.

To be able to carry this out, we have to get an applicaton ID, using which Yahoo! web services can identify us, and then use Representational State Transfer (REST) to form query to submit our content to the Term Extractor web service. The output is in the JavaScript Object Notation (JSON) format. We can then parse this reply and submit the tags to the tag-based services.

Technorati tags: , , , , , , , , , , , rest, , ,

Copyright Abhijit Nadgouda

Discussion [Participate or Link]

  1. Lorelle VanFossen said:

    The Term Extractor sounds great. I’ll take a look at it, but can you explain how to get it to work with wordpress.com blogs? That will help us understand it’s “power”.

  2. Abhijit Nadgouda said:

    Lorelle, I have updated the post to answer your question.

  3. Francesco Sclano said:

    Hi everybody!
    TermExtractor, my master thesis, is online at the
    address http://lcl2.di.uniroma1.it.

    TermExtractor is a FREE and high-performing software package for Terminology
    Extraction. The software helps a web community to
    extract and validate relevant domain terms in their
    interest domain, by submitting an archive of
    domain-related documents in any format.

    TermExtractor extracts terminology consensually
    referred in a specific application domain. The
    software takes as input a corpus of domain documents,
    parses the documents, and extracts a list of
    “syntactically plausible” terms (e.g. compounds,
    adjective-nouns, etc.).
    Documents parsing assigns a greater importance
    to terms with text layouts (title, bold, italic,
    underlined, etc.). Two entropy-based measures, called
    Domain Relevance and Domain Consensus, are then used.
    Domain Consensus is used to select only the terms
    which are consensually referred throughout the corpus
    documents. Domain Relevance to select only the terms
    which are relevant to the domain of interest, Domain
    Relevance is computed with reference to a set of
    contrastive terminologies from different domains.
    Finally, extracted terms are further filtered using
    Lexical Cohesion, that measures the degree of
    association of all the words in a terminological
    string. Accept files formats are: txt, pdf, ps, dvi,
    tex, doc, rtf, ppt, xls, xml, html/htm, chm, wpd and
    also zip archives.

    I’d like if you partecipate in the TermExtractor
    evaluation task. The result of your evaluation will be
    put in a paper (I enclose a draft). Please contact me
    if you want to partecipate (this is very important for


    Francesco Sclano
    home page: http://lcl2.di.uniroma1.it/~sclano
    msn: francesco_sclano@yahoo.it
    skype: francesco978

Say your thought!

If you want to use HTML you can use these tags: <a>, <em>, <strong>, <abbr>, <code>, <blockquote>. Closing the tags will be appreciated as this site uses valid XHTML.



Abhijit Nadgouda
iface Consulting
+91 9819820312
My bookmarks


This is the weblog of Abhijit Nadgouda where he writes down his thoughts on software development and related topics. You are invited to subscribe to the feed to stay updated or check out more subscription options. Or you can choose to browse by one of the topics.