text mining

Qualified as an Advanced GATE Certified Text Analyst

In May I attended the GATE (General Architecture for Text Engineering) training course.  I attended the advanced Track 3 and Track 4.


  • Track 1: Introduction to GATE and Text Mining

    • Module 1: Introduction to GATE Developer: GATE's application development
      environment
    • Module 2: Information Extraction and ANNIE: Our open-source IE system
    • Module 3: Introduction to JAPE: GATE's powerful rule-writing engine
    • Module 4: GATE Teamware: Web-based Collaborative Annotation Environment


  • Track 3: Advanced GATE

    • Module 9: Semantic Annotation with GATE
    • Module 10: Advanced GATE Applications: complex applications, writing
      multilingual IE, business intelligence, evaluation
    • Module 11: Machine Learning with GATE
    • Module 12: Sentiment Analysis


  • Track 4: Semantic Technology

    • Module 13: Semantic Technology and Linked Open Data: Basics, Tools, and Applications


To become GATE Certified, one has to pass at least 3 modules, and to become GATE Advanced Certified, you have to pass at least 6 modules. I've now passed all 7 of the modules I have taken, qualifying me with the title of GATE Advanced Text Analyst of GATE Version 6.2.

My name will soon appear on GATE's Hall of Fame.

Report: Web classification of Digital Libraries using GATE Machine Learning

This is the report that accompanies the work I did in Machine Learning using GATE to discriminate Digital Library web sites from all other web content.  The presentation for the same work can be found here.

Web classification of Digital Libraries using GATE Machine Learning

This is the talk for a project completed for a graduate course on data mining at the Syracuse University School of Information Studies. The professor was Howard Turtle, and he guided me in completing a project in text mining using the GATE software.

The talk has a twofold purpose: 1) to learn and teach my colleagues about the natural language processing suite known as GATE (General Architecture for Text Engineering), especially with regards to its Machine Learning (ML) capabilities; and 2) to utilize the GATE architecture in order to classify web documents into two groups: those sites that function as digital library sites (DL) distinguished from all other non-digital library sites (non-DL).

I also wrote a paper discussing the natural language processing capabilities and the machine learning algorithms for web classification using GATE. The paper also discusses the details and results of the studies completed in more detail.

Additionally, I will be attending a training course to become an Advanced GATE Certified Text Analyst from May 16 to 20 within the University of Sheffield's Computer Science Department.