Page | 1 ScanR evaluation and reference guide version 1.0.0.3
ScanR
TMPage | 2 Contents INTRODUCTION ... 4 INSTALLATION ... 5 HARDWARE REQUIREMENTS ... 5 SOFTWARE PREREQUISITES ... 5 ACCOUNT REQUIREMENTS ... 6
WINDOWS SERVER INSTALLATION ... 6
WINDOWS CLIENT INSTALLATION ... 6
ADDING OR UPDATING A LICENSE... 8
RUNNING A SCAN OF A FILE SHARE ... 9
CREATE A NEW CONFIGURATION ... 9
CHOOSE THE RULES FOR THE CONFIGURATION... 10
UNDERSTANDING THE TYPES OF RULES AND HOW THEY WORK ... 11
MATCH A TERM RULE... 11
MATCH A PATTERN RULE ... 13
MATCH AND TERM AND PATTERN WITHIN A PROXIMITY ... 14
NATURAL LANGUAGE PROCESSING (ARTIFICIAL INTELLIGENCE) RULES... 15
RUNNING A SCAN ... 16
EXAMINING THE RESULTS ... 16
EXPORTING THE RESULTS ... 17
RUNNING AN INCREMENTAL SCAN ... 17
RESETTING SCAN... 19
ADDING A NEW RULE ... 19
UNDERSTANDING RULE WEIGHTS ... 20
ADDING RULE EXCLUSIONS ... 22
EXCLUDING A DOMAIN FROM THE EMAIL RULE ... 22
CHANGING THE PROXIMITY OF A RULE ... 23
SCANR OPTIONS ... 27
FILE SIZE THRESHOLD ... 27
FILE CHARACTERS THRESHOLD ... 27
OCR LANGUAGE ... 27
OCR IMAGE FILES ... 28
WORD BREAKERS ... 28
SCANNING SHAREPOINT CONTENT ... 29
PERFORMANCE OPTIMIZATION CONSIDERATIONS ... 31
ENVIRONMENT... 31
CONFIGURATIONS ... 31
RULES ... 31
OPTIONS ... 31
REPORTING ... 32
SUBJECT ACCESS REQUEST (SAR) REPORTING ... 32
TO LOG AN INCOMING SAR... 33
TO CREATE A SAR SEARCH ... 34
RUNNING A SAR REPORT ... 35
Page | 3
APPENDICES ... 37
APPENDIX A – SUPPORTED FILE TYPES... 37
APPENDIX B – SUPPORTED DOCUMENT SYSTEMS ... 37
APPENDIX C – RULES ... 38
Page | 4
Introduction
Welcome to the evaluation and reference guide for ScanR. ScanR is an application that identifies sensitive
information inside electronic documents. It connects to document repositories, extracts the text from each file and passes the text through a set of rules. Once a scan is completed the results can be examined within the application, exported to a file, or an analysis tool can connect directly to the database containing the results.
This document will show you the features and usage of ScanR using a sample set of documents and provides reference information for future use.
For help or technical questions please contact [email protected]
Page | 5
Installation
ScanR is installed on a local machine in your environment, it does not require a connection to the internet and does not send any information outside of your organisation. ScanR stores configuration and reporting data in a local instance of SQL Compact Edition. The software can be installed on any modern operating system (including
Windows 10) for evaluation purposes. For a production installation the following minimum hardware requirements are recommended running Windows Server 2008 R2 or later.
Hardware requirements
Deployment type and scale RAM Processor Hard disk space
Up to one million documents stored on File Shares or SharePoint.
8 GB 64-bit, 4 cores
80 GB for system drive
Over one million documents stored on File Shares or SharePoint. 16 GB 64-bit, 4 cores 200 GB for system drive
Software prerequisites
The following software packages are required prior to the installation of ScanR.
Package Download location Size
.Net 4.5.2 or later https://www.microsoft.com/en-gb/download/details.aspx?id=42642
66.8 MB
SQL CE 4.0 x64 https://www.microsoft.com/en-gb/download/details.aspx?id=17876
Page | 6
Account requirements
For audit and security purposes it is recommended that the ScanR software uses a dedicated account.
Scenario Requirement
Installation Local administrator privileges are required to install the software.
File Share Scanning Read access to all content that is required to be scanned.
SharePoint Scanning Read access to all content that is required to be scanned.
The software connects to SharePoint using the client-side object model (CSOM) and does not require any software to be installed on the SharePoint farm or tenant. SharePoint online (Office 365) and Server 2010, 2013 and 2016 are supported.
Windows Server Installation
The software is provided as a single MSI file. Installation steps: • Ensure the minimum hardware requirements are met • Ensure the prerequisite software packages are installed
• Using an account with local administration rights unzip and run the provided ScanR.msi file.
A shortcut to the application will added to the desktop
Windows Client installation
The application is designed to run on a server but can be installed on a Windows client machine for evaluation purposes. There are two issues to be aware of when testing:
You need to manually run the application as the administrator • Ensure the prerequisite software packages are installed • Unzip and run the ScanR.msi file
• Open file explorer and browse to ..\Program Files (x86)\TermSet\ScanR • Right click on ScanR.exe and select Run as Administrator
Page | 7
You may not be able to see mapped drives without changing a setting
There is a known issue with some Windows clients where programs are not able to access mapped drives by default (https://support.microsoft.com/kn-in/help/3035277/mapped-drives-are-not-available-from-an-elevated-prompt-when-uac-is-co)
You can ensure mapped drives are visible in ScanR by taking the following steps: • Run Registry Editor (regedit.exe), locate the following key:
• HKEY_LOCAL_MACHINE/SOFTWARE/Microsoft/Windows/CurrentVersion/Policies/System • Create a new DWORD entry with the name EnableLinkedConnections and value 1
Page | 8
Adding or updating a license
If you are evaluating the software, then a trial license will already be installed. To run the software in production you will need to add the license key issued to you. There are two types of ScanR license:
Type Duration Volume of content * Restrictions
Trial 7 days Unlimited 1 in 3 files will be randomly
skipped from the scanning process
Standard 1, 2 or 3 years 1TB – 10000TB depending on
licence
None
* The volume of content is the total size of all the content that has been Scanned by the application. You can run unlimited scans of the same content source and only new or modified files since the last scan would be added to the total.
You can verify your license type and usage by selecting File -> License from the top menu bar of the application. To add a production license, select File -> License from the top menu bar of the application and click Upgrade or Register then:
(1) Enter your Serial number (2) Click Activate by Entering a code (3) Click copy to clipboard
(4) E-mail the details to [email protected] (5) You will receive your activation code
The activation code will bind your installation to the machine you installed it on. If you need to move the licence to a different machine e-mail [email protected]
When running a production license, you will be notified when over 80% of the duration of the license or the volume of content is used.
Page | 9
Running a scan of a file share
In this section of the guide we shall detail all aspects of scanning files stored on a file share.
Important! Please don’t skip the following exercises - we know how tempting it is to jump in and scan your own documents. These steps will give you a full understanding of the using ScanR.
Download the set of sample documents that we shall use for the rest of this guide. To download the documents: (1) Download from http://bit.ly/2BNseZB
(2) unzip the documents to a folder called SampleDocuments.
Create a new configuration
A configuration tells ScanR the location of the files you wish to Scan. Here are the steps: (1) On the application home page click the Scan icon
(2) Click on the + Add Config button.
Page | 10
(3) Define the configuration
• Give your configuration a name • Leave the platform as Fileshare
• Click Browse and find the SampleDocuments folder where you unzipped the sample files • Click Save
• Click Cancel
Note: If the files you wish to scan are located on a network drive you will need to map a local drive before ScanR can browse to the location or you can enter a UNC path
Choose the rules for the configuration
Creating the configuration defines where the files to be scanned are located, we now need to add some rules to define what information we should look for within the text of the files.
Here are the steps to add a rule:
(1) Click on the edit icon (the pencil icon on the configurations screen) of the configuration you defined (2) Click on Edit Rules
Page | 11
On the left-hand panel are all the available rules and on the right-hand side are the rules selected for the configuration. For a full description of each rule please see appendix C.
(3) Add the following rules to the configuration by clicking the rule on the left panel and then clicking the >> Button: • BLOODGROUPS • BANKACCOUNTS • SORTCODES • CREDITCARDNUMBERS • CREDITCARDCCV • CREDITCARDEXPIRE • EMAIL • PERSON • ORGANIZATION (4) Click Save (5) Click Update
Understanding the types of rules and how they work
ScanR has four types of rules to help you identify sensitive information within a document: • Match a term
• Match a pattern
• Match and term and pattern within a proximity
• Natural language processing (artificial intelligence) rules
Match a term rule
Rules that look for a specific word or phrase are the simplest type of rule. The rule has a list of words or phrases to look for and if they are found within the document then the rule is considered true. When a rule is considered true, each of the matches will be added to the report.
The rule BLOODTYPE is an example of a rule that matches a list of terms. To examine the rule: (1) Edit the configuration
(2) Click Edit Rules
Page | 12
Figure 3. A term matching rule
We can see the list of terms to be matched and it is straightforward to add or remove terms to the list.
Note in the Patterns section the value -ANY- is specified, this indicates that the rule is only looking to match from the specified list of terms.
Note: Terms are not case sensitive. You can add and remove terns from the list as required, including multiple language terms and synonyms.
Page | 13
Match a pattern rule
Rules that match a pattern within the text of a document are a highly flexible method for discovering sensitive information that is in a structured format (for example passport or driving license numbers). By describing the format of the matching content, the rule is true for any information that matches the pattern. When a rule is considered true, each of the matches will be added to the report.
Note: ScanR uses regular expressions to define patterns. Although they appear initially quite cryptic there or lots of useful tools and tutorials to help. We recommend using http://www.ultrapico.com/expresso.htm which is a free tool to build and test your own regular expressions
The rule EMAIL rule is an example of a rule that matches a list of terms. To examine the rule: (1) Edit the configuration
(2) Click Edit Rules
(3) In the right-hand panel click on the EMAIL rule
Figure 4. A pattern matching rule
We can see this time there is a value defined in the pattern section (this is a regular expression that matches all valid e-mail addresses). Note in the Terms section the value -ANY- is specified, this indicates that the rule is only looking to match from the specified pattern.
Page | 14
Match and term and pattern within a proximity
You may have guessed already that this type of rule is a combination of the previous two rules, that is a rule that requires both a term and a pattern to be matched for the rule to be true. We also introduce another concept, proximity, into this type of rule. Proximity defines how near the term and the pattern have be to each other. We will explore this in more detail a little later.
The rule BANKACCOUNT is an example of a rule that requires both a term and pattern to match for the rule to be true. To examine the rule:
(1) Edit the configuration (2) Click Edit Rules
(3) In the right-hand panel click on the BANKACCOUNT rule
Figure 5. A term and pattern matching rule with proximity
We can see this time there is a value defined in the pattern section (this is a regular expression that matches all valid bank account numbers) and a list of terms that also need to be within the text of the documents.
In the example of the BANKACCOUNT rule, the pattern will match any nine-digit number (there is no further validation the pattern can do as bank account numbers are simply nine-digit numbers. To reduce the number of false-positive matches we add terms to provide further clues that is a bank account number.
Page | 15
By adding proximity (in this case 200 characters) we further reduce the amount of false-positive results by specifying that the term and the pattern need to be within 200 characters of each other.
Natural language processing (artificial intelligence) rules
ScanR includes three special rules that do not use patterns or terms to discover information. Instead, these rules use a form of artificial intelligence known as natural language processing.
Rue name Description Example matches
PERSON Matches the names of people Bill Gates Dr J. Smith ORGANIZATION Matches the names of companies and
organizations
Microsoft Limited
World health Organisation LOCATION Matches location type data such as towns,
cities and countries
Seattle
United Kingdom
Natural language processing uses a machine learning model that has been trained to recognize entities, in our case people, organizations and locations. Whilst the results will produce some false positives these three rules
effectively discover many potentially sensitive types of information that would be very difficult to discover using other methods
Page | 16
Running a scan
We shall now run the scan of the SampleDocuments folder using the configuration and rules that you selected. (1) Simply click the run icon on the main configuration screen.
Examining the results
Once the scan has completed click the Exit button and then from the main configurations screen click on Report. You will see the results as follows:
Figure 6. The results grid
(1) Find a document where the score is not zero and double click on the row (2) You will see all the matching rules for the rule
Note: If you are using a trial version 1 in 3 files are randomly skipped. You can reset the scan if you wish to see a different set of results
Page | 17
A description of the columns in the results are as follows:
Column Description
Status icon A green tick indicates the file has been read successfully. Other icons indicate that a file has been skipped as it is unchanged since the last scan or that the file encountered an error Folder path The full path of the folder
File name The file name
Message The details of the scanning process for each file
Score The total score of the file. The score is the sum of each rule that was true for each document Duration The total time it took to process the file
You can access the results at any time by clicking on the Report icon from the main configurations screen.
Clicking on the column headers allows you to order the results by any value you wish.
Clicking on the View Diagnostics link will show details of any files that ScanR was unable to process.
Exporting the results
Clicking the export button from the results grid will export the file to a .CSV format. This can be loaded directly into Excel or a text editor for analysis. You can also export the results from the main configuration screen by clicking the Export icon.
Note: If the scan was against a large volume of data with a large number of rules then the result sets can be very large, and exporting may take a significant amount of time to complete
Running an incremental scan
When re-running a scan of files, ScanR will only re-scan new files or files that have been modified since the last Scan. To test this functionality, follow these steps:
(1) Open one of the documents in the SampleDocuments folder and add some text then save and close the document
Page | 18
You will notice the scanning runs very quickly as only the modified file will be re-read and when completed only one file has been reported as read.
(3) Exit and click on the Report button
Page | 19
Resetting scan
Resetting scan means that the previous scan is removed, and all files are read.
• Click the Reset on the main configurations screen • Once completed click Report icon to view the results You will notice that all the files were re-read.
Important: Anytime you update the rules in a configuration you will need to run a full scan for the files to be re-read taking the new changes into account. As the previous scan is deleted then the reset scan will count against your licence usage.
Adding a new rule
ScanR allows you to add your own rules to your configurations. In this exercise we will define a rule to identify an employee number. In our example an employee number starts with either “A” or “B” followed by six digits. An example employee number would be A123456.
(1) Create a new document called EmployeeNumber in the Sampledocuments folder (2) Add the following text to the document (you can paste from the below)
This is a sample document for our EMPLOYEENUMBER rule. Employee number: A123456 should match
Employee number: C123456 should not match
(3) Save and close the document (4) Edit the configuration (5) Edit the rules
(6) Click New
(7) Set the rule name EMPLOYEENUMBER (8) Set weight to 10
(9) For the pattern enter [AB]\d{6} (10) Click the Add Term button
(11) Click Add Term and add the term Employee (12) Leave all other settings and click Save
(13) Add the rule by highlighting EMPLYOENUMBER and clicking the >> button (14) Click Save
Page | 20
Note: Although this guide does not cover creating your own regular expressions let’s explain the pattern you entered in step 9. Putting characters between two square brackets means Match any of these characters, so [AB] means either a capital A or B will match. \d means match a digit, and the {6} means match it six times. So, the whole pattern with match any text that starts with an A or B followed by six digits.
To check your new rule has worked: (1) Run the scan
(2) Examine the results (notice only the new document was scanned)
(3) Click on the EMLOYEENUMBER row and you should see your matched value
Figure 7. Matching employee number from our custom rule
Understanding rule weights
Each rule has an associated weight which is an integer that allows you to create an overall score of the sensitive data contained within a document. When a rule is true against a document the rule weight is added to the overall score of the document. The weight is only added once regardless of how many times the rule matches within a document.
Tip: Although it is optional, if you’re the total of the weights of all rules within a configuration equals 100 then it will be easy to report on the overall scores as they will range from zero to one hundred.
Page | 22
Adding rule exclusions
An exclusion to a rule is where the conditions of a rule match but you wish to exclude a value from the results. In other words, they are the exceptions to a rule. There are two common uses for exclusions:
• You wish to exclude data that relates to your own organization (for example your postcode)
• The artificial intelligence has incorrectly identified a PERSON, LOCATION or ORGANIZATION and you wish to exclude it from the results
To test using a rule exclusion, do the following: (1) Edit the configuration
(2) Edit the rules (3) Edit the EMAIL rule (4) Click Add Exclusion (5) Enter [email protected] (6) Click OK
(7) Click Update (8) Reset the scan
If you now open the results and click on the results row for the EMAIL document, you will see that the [email protected] is no longer returned as a match despite its meeting the criteria of the rule.
Excluding a domain from the EMAIL rule
A common scenario is that an organisation will like to match all e-mail addresses in a document but excluding their own domain. You can achieve this by editing the EMAIL rule.
(1) Edit the configuration (2) Edit the rules
(3) Edit the EMAIL rule
The pattern below is the regular expression that captures all email addressed \w+@(?!excluded.com)(\w+\.\w+(\.\w+)?)
(4) Replace excluded\.com with the domain that you wish to be excluded from the scanning results for example contoso\.com or yourdomain.\co\.uk
Page | 23
Changing the proximity of a rule
In the sample document set there are two documents for testing proximity using the CREDITCARDCCV rule. If we edit the configuration rules and select the CREDCARDCVV rule, we can see the following settings
Figure 9. CREDITCARDCVV rule has a default proximity of 50
Notice that currently, the proximity value is set to 50. This means that for the rule to be true then pattern and term must be within 50 characters of each other. In this case it means that one of terms “CCV”,”CVC” or “CCV” needs to be found within 50 characters of a three digit number.
Page | 24
In our SampleDocuments folder we have two files that test the CREDITCARDCVV rule. The first file, _CCVPROXTESTPASS contains the following text.
Figure 10. The contents of the _CCVPROXPASS file
In this case, as the terms CCV, CVC and CVV are all within the 50 characters of the pattern matches 123,321 and 987, they should all match.
The second file, _CCVPROXTESFAIL contains the following text.
Figure 11. The contents of the _CCVPROXFAIL file
With a default proximity setting of 50 characters the rule will be false as term CVC is 400 characters away from the matching pattern 123.
Testing the default proximity rules
Step 1. Reset the scan and re-run the configuration
Step 2. Click the Report Icon and find the results for the two _CCVPROXTEST rules
Page | 25
You will notice the first document CCVPROXTESTFAIL has a score of 0 meaning that no matches found whereas CCVPROXIMITYTESTPASS has a score of 10 meaning the rule was true.
Step3. Click on the second result to see the matching entries for the rule (note that each valid CCV number is separated by a # character).
Figure 12. The matching CVV values Changing the proximity of a rule
In this exercise we will change the proximity of the CREDITCARDCCV rule so that the file will now score the rule as true.
Step 1. Edit the configuration rule for the CREDITCARDCVV Step 2. Change the proximity value to 500 using the slider Step 3. Reset the scan and re-run the configuration
Step 4. Click the Report Icon and find the results for the two _CCVPROXTEST rules Step 5. Click on each filename and notice that both results now match the rule.
Page | 26
Page | 27
ScanR options
You can open the ScanR options screen from File -> Options on the top menu.
Figure 14. The ScanR options screen
File Size Threshold
Specifies the maximum size of files to be scanned. Any files over the specified limit will not be read and will be logged as “Slipped due to exceeding maximum file size”. The default maximum file size is 2MB. Please refer to Performance optimization considerations for more details on this option.
File Characters Threshold
Specifies the maximum amount of characters to extract from the file. The default value is 500,000. Please refer to Performance optimization considerations for more details on this option.
OCR Language
Page | 28
OCR Image files
Selecting this switch to ON will result in image files (BMP, TIF, TIFF, JPG) being passed through an optical character recognition method to attempt to extract text from the image. This feature is useful for extracting text from areas where scanned documents are stored in image formats. By default, this feature is not selected as there is a performance overhead in processing image files.
Word breakers
Word breakers are characters that define the start or end of words in a block of text. Although most words are separated by spaces there are occasions where other characters can denote the start or end of a word. Both terms and patterns use the word breakers maximize the accuracy of matches.
For example, consider the following text:
Credit card number:1234-1234-1234-1234.
Both the colon and full stop are defined as valid work breaking characters and so we would match the credit card pattern in the above example.
Page | 29
Scanning SharePoint content
Scanning documents stored in SharePoint online or a SharePoint server is very similar to file share scanning. The only differences are in the definition of the configuration.
(1) Click on New Configuration from the main screen (2) In the platform select 365 or 2010, 2013 or 2016.
Page | 30
The following fields are available:
Column Description
Config Name The name for the configuration
URL Any valid URL for a site collection, site or subsite
Include subsites (checkbox) The Scan will iterate through all subsites and libraries starting for the top URL suppled.
When this option is selected you cannot specify a library name. Username For SharePoint online, a username in the format
For SharePoint servers the username can be in the format
[email protected] or domain\username
Password A valid user password
Page | 31
Performance optimization considerations
When scanning hundreds of thousands or several million files there are several considerations to improve
performance and balance the speed and accuracy of the results. The overall speed of the scanning will depend on many elements including:
• The hardware and network
• The type and size of the documents • The type and number of rules you run
Environment
• More memory and processors will speed up the overall scanning time.
• Faster disk I/O and LAN or Internet speeds will speed up the overall scanning time
Configurations
Breaking large document repositories into many configurations is best practice. For example, if you have a file share with one million documents, consider creating several configurations, perhaps by department or folder structure. This affords several significant advantages:
• Each configuration will complete quicker and are less likely to fail due to external influences (network outages, power outages, maintenance etc.)
• You can run incremental scans quicker and at different times and frequencies • Each configuration can have different rules
• The results are faster and easier to work
• The results can be securely shared with the correct teams
• BI \ Analysis tools will be more effective with subsets of data that can be aggregated
Rules
The more rules added to a configuration, the longer it will take to process each document. Some guidelines are: • Only add rules for information you need to discover (adding all of the rules to see what we can find will
result in slow processing times and a lot of noise in the results)
• Test the rules on a small set of documents until you are confident they are working as desired
• When creating your own rules make them as specific as possible and wherever possible include both terms and patterns
Options
Using the Maximum file size option to exclude very large files will reduce the chance of 1% of your files taking up 90% of the time it takes to complete the scan.
Page | 32
Reporting
The reporting section of the application allows you to produce reports with a full audit trail. To access the reporting from the scanning screens:
(1) Click the HOME PAGE button (on the bottom of the screen) (2) Click the REPORTS Icon
Subject Access request (SAR) reporting
This section of the guide describes how ScanR can assist in responding to Subject Access requests (SARs). The ICO define a subject access request as follows:
This right, commonly referred to as subject access, is created by section 7 of the Data Protection Act. It is most often used by individuals who want to see a copy of the information an organisation holds about them. However, the right of access goes further than this, and an individual who makes a written request and pays a fee is entitled to be:
• told whether any personal data is being processed;
• given a description of the personal data, the reasons it is being processed, and whether it will be given to any other organisations or people;
• given a copy of the information comprising the data; and given details of the source of the data (where this is available).
Source: https://ico.org.uk/for-organisations/guide-to-data-protection/principle-6-rights/subject-access-request/
Page | 33
To log an incoming SAR
(1) Click on the Add SAR buttonFigure 16. Add a new Subject Access request
(2) Give the SAR a unique name
(3) Record the date the SAR was received
(4) If you know the data usage and third party sharing details you can enter them now, or alternatively you can run the SAR search and review the data before completing these details
(5) Click Update
Page | 34
To create a SAR search
From the main SAR Dashboard:(1) Click Edit
(2) Select which configurations (scanning sources) that you would like to search from – clicking the very top selection box will select all configurations
(3) Click Add Search Criteria
Figure 17. Create search criteria for a SAR
The Add Search Criteria screen allows you to define the parameters of the SAR. Typically, it would begin with the name (and perhaps alternate ways the name may be stored) being used for the search. Although the default is set to PERSON, any rule could be chosen for searching.
The search criteria can be combined with AND / OR statements to build more complex queries. In the example shown in figure 17, we are looking for documents that contain a PERSON named ‘GARETH MOON’ or a PERSON named ‘G MOON’ that also has a POSTCODE of ‘WR14 1NA’
Page | 35
Running a SAR report
From the main SAR dashboard:(1) Click the Run Icon
Figure 18. Create search criteria for a SAR
The SAR report will load a list of all documents that match the search criteria. Clicking on the file name allows you to view the source file. Clicking on the details shows you the data discovered during the scan.
Page | 36
Exporting the SAR report
Once you are ready to export the SAR report, you have two options:
• Export to Excel – this report is designed for internal use (for example to review the documents identified) and contains a list of the files and links to each
• Export to Word – this report is designed to be sent to the person who requested the information (1) From the Run SAR screen, select the export format and click EXPORT.
If you wish to customise the report template you can edit the following document ...\Program files (x86)\TermSet\ScanR\SAR Template.docx
Page | 37
Appendices
Appendix A – supported file types
Extension Description
BMP Image file (OCR)
CSV Comma separated format
DOC \ DOCX Word Files
HTML \ HTM Web pages
JPG Image file (OCR)
MSG Outlook message file (including imbedded attachments)
PDF PDF files (all versions)
PPT \ PPTX PowerPoint files
TXT Text files
TIFF \ TIF Image file (OCR)
XLS \ XLSX Excel files
XML XML files
Appendix B – supported document systems
SystemFile Shares
SharePoint server 2010 SharePoint server 2013 SharePoint server 2016 SharePoint Online (Office 365)
Page | 38
Appendix C – Rules
Rule Description
AUSTRAILIANMEDICAL Australia Medical Account Number AUSTRAILIAPASSPORT Australia Passport Number
AUSTRIALTAX Australia Tax File Number
AUSTRIAVAT Austria VAT number
BANKACCOUNT Banking account numbers nine digits
BELGIUMVAT Belgium VAT number
BELGUIMNI Belgium National Identify Number
BLOOD GROUPS Blood groups
BRAZILIDCARDNEW Brazil National ID Card (RG) New Format BRAZILIDCARDOLD Brazil National ID Card (RG) Old Format BRAZILLEGALENTITY Brazil Legal Entity Number (CNPJ)
BUGARIAVAT Bulgaria VAT Number
CANADAPASSPORT Canada Passport Number
CANADASOCIALINS Canada Social Insurance Number
CORATIAPERSONALID Croatia Personal Identification (OIB) Number
CREDITCARD Credit Card numbers
CREDITCARDCVV CVV number for credit cards CREDITEXPIREDATE Expire date for credit cards
CROATIAID Croatia Identity Card Number
CROATIOVAT Croatian VAT Number
CYPRUSVAT Cyprus VAT Number
CZECHID Czech National Identity Card Number
DATEOFBIRTH Date of Birth
DEANUMBER Drug Enforcement Agency (DEA) Number
DENMARKVAT Denmark VAT Number
EMAIL Email Addresses
ETHNICITY Ethnic groups
FINLANDNATID Finland National ID
FINLANDPASSPORT Finland Passport Number
FINLANDVAT Finland VAT Number
FRANCEINSEE France Social Security Number (INSEE)
FRANCEVAT France VAT Number
FRENCHPASSPORT French Passport
GERMANDRIVINGLICIENCE German Driving licence number
GERMANID German Identity Card Number since November 2010
GERMANPASSPORT German Passport number
GERMANVAT German VAT Number
GREECENATIONALID Greece National ID Card (Old) GREECENATIONALIDNEW Greece National ID Card (New)
GREECEVAT Greece VAT Number
HONGKONGID Hong Kong Identity Card (HKID) Number
IBAN International Banking Account Number (IBAN)
Page | 39
INDIAUNIQUE India Unique Identification (Aadhaar) Number
INDOID Indonesia Identity Card (KTP) Number
IPADDRESS IP Addresses
IRELANDPNEW Ireland Personal Public Service (PPS) Number (New) IRELANDPSOLD Ireland Personal Public Service (PPS) Number (Old)
IRELANDVAT Ireland VAT number
ISRAELNATID Israel National ID
ISREALBANKACCOUNT Israel Bank Account Number ITALYDRIVINGLICENCE Italy Driver's License Number
ITALYVAT Italy VAT number
JANPANDRIVING Japan Driving Licence Number
JAPANSIN Japan Social Insurance Number (SIN)
LOCATION Address and location information
LUXVAT Luxembourg VAT number
NETHBSN Netherlands Citizen's Service (BSN) Number
NETHVAT Netherlands VAT Number
NORWAYIDNUMBER Norwegian citizen ID number
ORGANIZATION Company and organisation names
PERSON Names of people
PHILIPID Philippines Unified Multi-Purpose ID Number
POLANDID Poland Identity Card
POLANDPASSPORT Poland Passport
PORTUGALCITZ Portugal Citizen Card Number
PORTUGALVAT Portugal
SAFRICAID South Africa Identification Number
SAUDIID Saudi Arabia National ID
SEXUALORIENTATION Sexual Orientation descriptions
SINGANRIC Singapore National Registration Identity Card (NRIC) Number
SKORREARN South Korea Resident Registration Number
SORTCODE Banking sort codes in the format NN-NN-NN"
SPAINSSN Spain Social Security Number (SSN)
SPAINVAT Spanish VAT Number
SWEDENID Sweden National ID
SWEDENPASS Sweden Passport Number
SWEDENVAT Sweden VAT number
TAWAINARC Taiwan Resident Certificate (ARC/TARC) Number
TAWAINNATID Taiwan National ID
UKCARREGISTRAIONPOST2001 UK Vehicle registrations from 2001 onwards
UKCELLPHONE UK Cell phone number
UKCHILDBENEFITREFERENCE UK Child Benefit number UKDRIVINGLICENCE UK Driving licence number
UKELECROLL UK Electoral Roll Number
UKGOVSEC UK government security classifications
UKNATIONALINUSRANCENUMBER UK National Insurance number
UKNHS UK National Health Service Number
Page | 40
UKVAT UK VAT Number
UKVIN UK Vehicle Identification Number
USSSN US Social Security Number (SSN)
USTIN US Individual Taxpayer Identification Number (ITIN)
USVIN US Vehicle identification
UKSTREETADDRESS UK street addresses
USTELEPHONE US Telephone numbers
USSTATEANDZIP US Abbreviated State name and ZIP code
Page | 41
Appendix D – Regular expression definitions
Symbol Meaning
c Match the literal character c once, unless it is one of the special characters. ^ Match the beginning of a line.
. Match any character that isn't a new line. $ Match the end of a line.
| Logical OR between expressions. () Group sub-expressions.
[] Define a character class.
* Match the preceding expression zero or more times. + Match the preceding expression one or more times. ? Match the preceding expression zero or one time. {n} Match the preceding expression n times.
{n,} Match the preceding expression at least n times.
{n, m} Match the preceding expression at least n times and at most m times. \d Match a digit.
\D Match a character that is not a digit.
\w Match an alpha character, including the underscore. \W Match a character that is not an alpha character. \s Match a whitespace character (any of \t, \n, \r, or \f). \S Match a non-whitespace character.
Page | 42
\n New line. \r Carriage return. \f Form feed.