This is the talk for a project completed for a graduate course on data mining at the Syracuse University School of Information Studies. The professor was Howard Turtle, and he guided me in completing a project in text mining using the GATE software.
The talk has a twofold purpose: 1) to learn and teach my colleagues about the natural language processing suite known as GATE (General Architecture for Text Engineering), especially with regards to its Machine Learning (ML) capabilities; and 2) to utilize the GATE architecture in order to classify web documents into two groups: those sites that function as digital library sites (DL) distinguished from all other non-digital library sites (non-DL).
I also wrote a paper discussing the natural language processing capabilities and the machine learning algorithms for web classification using GATE. The paper also discusses the details and results of the studies completed in more detail.
Additionally, I will be attending a training course to become an Advanced GATE Certified Text Analyst from May 16 to 20 within the University of Sheffield's Computer Science Department.