3.4 Techniques/Tools Used
3.4.5 Development Tools
A major part of this research involved programming, thus one of the very first tasks was to select a programming language, as well as appropriate data storage for the results.
The decision on data storage was made easier by the fact that one of the project requirements was to integrate the results of the new system with the database behind the CRM package used, which is a Microsoft (MS) SQL Server 2000 database.
In relation to programming languages, out of the many languages available, it was decided that the final choice was between PHP, the supervisors‟ recommendation, and the .NET framework, with which the author was most familiar. PHP was initially chosen for the development of the search engine, because of the following reasons:
PHP is free and open-source, which means that new releases and patches come out regularly. Languages that are not open-source, such as VB.NET, follow rigid processes in fixing bugs, thus roll-outs or patch releases can be considerably delayed.
PHP was designed with the Web in mind, therefore it is very efficient and it contains a rich selection of inbuilt commands to deal with various aspects of the Web. Some of these commands, e.g. get_meta_tags(), do not exist in .NET and need to be implemented through many lines of code.
Both the search engine and the classification system were developed in PHP, however, despite both systems being efficient at the beginning, limitations of PHP started to become apparent, particularly when handling connections and communications with MS SQL Server.
An error message (Figure 3.2) occurred during the crawling of a large site, which comprises over 4800 web pages. The initial concern for the error was a potential memory leak that was consuming the overall Random Access Memory (RAM) and causing the system to resort to using the Virtual Memory to store the temporary data during execution. However, after many tests and many error catching procedures, the idea of a memory leak was rejected.
Figure 3.2: Server Error Message
Further research into the problem showed that many other programmers had experienced similar problems and that it could be an MS SQL Server – PHP connectivity issue, where PHP does not close connections well, resulting in memory not being released after use. Some of the solutions posted on Internet forums were unsuccessful. It was noticed however that after the error occurred, a simple browser refresh resumed the program and because the program was written to record all visited sites, the refresh restarted the program exactly where it was interrupted. It was also observed that the error message was repeated after the visiting of around 700 links. An automatic refresh was, therefore, built into the program, which refreshed the web page at a rate determined in the database settings (e.g. every 700 links).
Extensive testing of the new functionality resulted in no more errors and the uninterrupted running of the program.
The above does not work out the cause of the problem, however it does resolve the issue. This
solution also strengthens the program, as it means that whatever the number of web pages being crawled, the system will always complete the crawling process, even for machines with limited RAM, as each refresh cleans all the previous memory allocations.
The final stage of the project – the development of the automated extraction system – asked for a revisit of the strengths and weaknesses of PHP and the .NET framework. The decision to enhance genetic programming principles to optimise web extraction required a language that was powerful enough to manage thousands of population generations and large volumes of data. The above PHP problem and other reasons listed below, justified the need to change programming languages and opt for the .NET framework, specifically VB.NET, for the last part of the project:
VB.NET is fully Object Oriented, and as such it supports inheritance, encapsulation, polymorphism etc. PHP is a scripting language, thus despite it being powerful, it is not a pure Object-Oriented language.
The .NET Class library (of Base Class Library) is available to all the .NET languages, resulting in a consistent model regardless of the programming language used.
Debugging is very efficient as .NET supports runtime diagnostics, which help to track down bugs and determine how well an application is performing. VisualStudio.Net also provides excellent features for debugging applications such as: breakpoints, tracing of sections in the code etc., making programming easier and faster. PHP offers none of these, so debugging is time consuming.
PHP, unlike VB.NET, is less consistent and lacks standardised structures for catching exceptions or for error handling, leaving programmers to have to code error handling techniques themselves. (PHP 5 has touched upon this issue and it now includes the ability to use some form of the Try…Catch and Throw structure).
.NET offers straightforward Application Deployment and Maintenance. The installation process requires only that the application and its components are copied into a directory in the target machine.
.NET has improved the way that code is shared between applications, introducing the concept of assembly, which replaces the traditional DLL. Assemblies are the .NET unit of deployment, versioning and security and different versions of assemblies can exist side by side.
The development of the WIE part of the project in VB.NET, specifically the part of the system dealing with the extraction of training course names, required the reuse of the NB approach, developed previously, for the fitness evaluation of the Regular Expressions evolved (see section 4.4.2.5). This made it necessary to convert the NB system to VB.NET to be compatible with the GP system. Results from testing both the PHP and the VB.NET versions of the classification part of the prototype are compared in chapter 5.
3.5 SUMMARY
This chapter investigated the different methodologies currently available and justified the decision made on the methodologies most appropriate for this project. The specific research methods and development tools adopted in this research were also discussed in this chapter.
The main aspects of the project are explained in detail in the following chapter.
4 RESEARCH AND DEVELOPMENT
4.1 INTRODUCTION
This chapter concentrates on the design and development work, focusing on both the back end (database structure) and front end (program) of the prototype. Technical details of the two main approaches chosen, Naïve Bayes Networks and Genetic Programming, are also presented. Diagrams are used better to illustrate the system‟s functionality.
Database
Stored Procedures
CRAWLER CLASSIFIER
Stage-1
TRAINER
Stage-3
Stage-2
INDEXER OUTCOMES OF WIR STAGES
1 – LINKS
2 – FEATURES + PROBABILITIES 3 – CATEGORISED LINKS
GP
OUTCOME OF WIE STAGE – SUCCESSFUL REs
– SUCCESSFUL GENOTYPES – COURSE DETAILS
The WEB
Stage-4
Figure 4.1: Overall System
ATM currently uses its CRM package as the front-end to all details about their clients including their needs and behaviours as well as the different courses available. Therefore, the prototype developed as part of this research serves as a mediator between the Web and CRM, collecting course information from training websites and feeding them to CRM through ATM‟s database, whilst guaranteeing the accuracy of the courses already stored in the database. Figure 4.1 shows the different stages involved in the prototype together with the outcomes from each stage.