3. Personal Name Occupation: Each occupation can belong to one or more personal name concepts.
4. Personal Name Surface Form: Each personal surface form can belong to one or more person.
The database contains five tables: Entity_Data, EntityTree, Category, Entity_SurfaceForm and Entity_Relations as described below.
1. Entity_Data(ENid, ENname) is a collection of personal names that are extracted from a knowledge base.
2. EntityTree(ENid, root, Tree) is a collection of personal name concept. A tree is a set of category IDs(Cids). For example, personal name concepts = (1,0,{0,4,41,44}) which means a personal name id 1 has a root node concept 0 and a set of categories 0,4,41,44.
3. Category(Cid,Cname, Cparent, lft, rgt) is a collection of occupations. Cid is a primary key, Cname is an occupation’s name, Cparent is a parent category of this node and lft and rgt are calculated using the MPTT algorithm.
4. Entity_SurfaceForm(Pname,ENids) is a collection of terms that are used to refer to personal names. Pname is a term and ENids is a set of people who used this reference. For example, personal surface form = (Aaron Brown, {21, 22, 81754, 90459}) where Aaron Brown is Pname and {21, 22, 81754, 90459} is a set of ENids.
5. Relations(ENid,relation, objID) is a collection of personal name relationships includ- ing hasChild and IsMarried. objID is a set of ENids.
The database is used to identify and disambiguate personal names mentioned within a web document. Table Entity_SurfaceForm and Entity_Data are used to generate a set of candidate entities in each mentioned name in the searcher component. The four tables: Entity_Data, EntityTree, Category and Relationsare used to disambiguate lexical ambiguity in the disambiguator component. The data that are used in the searcher component and the disambiguator component are separated.
6.3
Work Flow of the Process
The flowcharts show in Figure 6.4 and Figure 6.5 explain the sequence of steps and logic to handle our problems. The system starts by taking a URL as an input and then passing
132 Software Specification
multiple processes and the final result is personal name linking to the real-world entity or the NIL value. After that, the NIL value is processed for browsing possible people using the BingAPI. The system work flow is described below.
6.3 Work Flow of the Process 133
134 Software Specification
6.3 Work Flow of the Process 135
1. The process starts when a user submits a URL.
2. The extractor passes the URL to AlchemyAPI through the Internet. AlchemyAPI returns the personal name in XML format. The extractor queries the mentioned names in the XML document. If the extractor returns more than two people, the system goes to the personal name transformation process, but otherwise the process is finished. 3. A set of personal names will be transformed into a uniform format under CFG rules
and a personal name dictionary. A set of transformed names are passed to generate a set of candidate entities. The unrecognised mentioned names go through the NIL value predicting process.
4. The searcher matches mentioned names over the personal name surface form under the Jaro-Winkler text similarity function. The process makes two decisions: 1) if a mentioned name matches a personal name surface form go to check the total number of candidate entities process 2) if it does not match personal surface form go to match standard name with a personal name surface form process.
5. The searcher matches a standard name over a personal name surface form by calcu- lating the similarity score between a transformed name and a set of term collections in personal surface form using the Jaro-Winkler function. The candidate that has a similarity score of more than 0.97 is generated for each standard name. The similarity score of 0.97 is from our experimental results because our aim of generating a set of candidate entities is to balance between precision and recall. Therefore, the similarity score 0.97 is the effective point to allow a single letter error in a personal name to be a candidate entity.
6. After the candidate entities are assigned to the mentioned name, the generate candi- date process makes two decisions: 1) if a mentioned name does not have a candidate entity going to generate the NIL value 2) if a mentioned name has a set of candidate entities go to check the total number of candidate entities process.
7. In counting the number of candidate entities in each mentioned name, the process makes two decisions: 1) if a candidate entity = 1, go to the entity linking process 2) if the candidate > 1 go to the compute SPTM process.
8. The SPTM process calculates the similarity score using the SPTM algorithm. The candidate who has the highest score is selected and goes to the entity linking process.
136 Software Specification
9. A set of linking entities are evaluated by considering the root node in each real-world entity. The system returns the NIL value to a mentioned name whose root node is different from the collection people. The system displays the final results.
10. The NIL mentions are processed to prediction possible people using BingAPI and displays the top ten links that are related to possible people.