sh config param /module/config/sourceone/removeduplicates
CLI Command Line
Interface
The CLI is a traditional command line interface that allows direct communications with the IS1200 “backend” using a the set of commands defined in the IS1200 Command Line Interface Reference Guide.
Concepts Search The standard IS1200 software supports keyword exploration.
However, in the initial stages of the legal discovery process (often called eDiscovery), keyword search alone may not be as concise or as time-efficient as required by standard legal timetables.
Concepts augments standard keyword searching by automatically suggesting filters based on the results of a current search. By default it looks for concepts based on persons, countries, noun groups,
organizations, company names, and products.
Concepts Search is an optional module that requires an additional license key for each IS1200 cluster node. See the IS1200 Concepts Search User and Configuration Guide for complete details.
conceptfinder Ruleset The conceptfinder ruleset is an assignment ruleset that extracts the concepts listed in the Review/Analysis Results Grouping Concepts pane, which is only available when a valid Concepts license is installed on the IS1200. The conceptfinder ruleset must be used in deep
classifications to get the best results in Review/Analysis
from the
Concepts heading of the Results Grouping pane.The ConceptFinder_DWF assignment ruleset combines both the conceptfinder ruleset and the DocsWithoutFullText ruleset. See
“DocsWithoutFullText Assignment Ruleset” on page 64 for more details.
connectors Connectors are IS1200 optional modules that allow an IS1200 to work with repository types beyond the standard CIFS and NFS
repositories. See “optional modules” on page 70 for more details.
Optional module connectors require separate licenses to be purchased and installed on all nodes of an IS1200 cluster. For a complete list of optional modules available, see the Introduction chapter of any IS1200 User Guide.
Some connectors, such as the Microsoft Exchange Server Connector, require agents. Agents are additional server platforms, usually Windows servers, that provide the additional CPU cycles and network staging the IS1200 needs to work with the repository types they connect to.
All connectors have their own user guides which can be accessed from the Kazeon Documentation link on the IS1200 Manager page
(https://<yourIS1200Name>/manager).
Container file/object A file (object) that contains other files (sub-objects), such as a ZIP, TAR, JAR, and PST or NSF files. The container file is often called the
“parent” and the contained objects are called “children”. Container objects should not be confused with files that have embedded objects, such as Microsoft Word files that have embedded charts or graphics (OLE).
Custodian A legal term used by Legal Service Providers (LSP) and other legal personnel to describe the owners or responsible parties for electronic documents pertinent (responsive) to a legal matter.
D
Data
A file of any type and size such as a short email, a word processor document, or a large spreadsheet.Datamap A report that lists the electronic storage locations of all possible sources of relevant ESI. This can include standard file servers, groupware servers, email servers—and their backup and archive systems—as well as custodian’s desktop and laptop computers.
Data-Mount The NFS file system that is accessed by the IS1200 to parse data and extract metadata.
Data Server The file server that exports an NFS or CIFS file system so that the
Data-Share The CIFS file system to be accessed by the IS1200 to extract metadata.
Data Repository A networked file system registered with the IS1200 so it can be classified, searched, and reported on. Data repositories created on the IS1200 itself (sometimes called localdatafs) are strongly
discouraged!
Data Verification Builds on Auditing and is only available when system auditing is enabled. For job services like Actionable Services Copy or Move, Legal Hold Copy, and Single Step Collections, Data Verification generates an audit trail proving that files were not altered during these actions. This is especially valuable in eDiscovery situations.
Complete details are available in the Auditing and Data Verification chapter of any IS1200 User Guide
Deduplication A process that identifies file or email object and sub-object duplicates based on their digest values (See “Digest Values” on page 63 for details).
In the 4.7.0 and prior versions of the IS1200 software, deduplication was only available for export actions (Actionable Services such as Download, Legal Export, and Copy). This allowed exporting only the unique files and email objects from a set of search results. With IS1200 version 4.8.0, deduplication's functionality is expanded and is automatically applied during case collections and processing to allow displaying deduplicated search results. Note that when deduplication is applied to display of search results, duplicates are only suppressed from display, however duplicates are physically removed from exported file sets.
Deduplication is available only in the ECS version of IS1200 and is applicable only in case context.
DeDuplication view is configurable as deduplication and
non-deduplication view. This allows to view whether any object has got duplicates in search results and the duplicate of the Original (in the search results).
Besides the automatic deduplication of collections and processing, deduplication may also be started manually from the IS1200's case dashboard.
Deduplication reports describing how a particular job or service applied deduplication are available. The reports can be accessed from the IS1200 case dashboard as well as from web search. Reports can list all results, only unique (deduplicated) results, or percentages of unique and duplicates.
Reduplication is a process that allows the duplicates of unique files to be identified so tagging processes can apply metadata tags to the unique files as well as all its copies. Legal Tags reduplication can be done after documents are added to the case.
Differential Classifications
Differential classifications do not re-classify all file objects in the selected repositories. Instead, they examine the metadata from previous crawls, and if there is no previous metadata (indicating the object is new since the last classification) or the metadata has changed (based on atime, or mtime changes), then the object is parsed and its metadata re-populated in the database.
Note: System classification configuration settings default to using mtime to determine if files have changed for differential classifications. If atime is desired instead, see the Using atimes for Differential Crawls section of the Configuration Files and Utilities appendix of any IS1200 User Guide for details on resetting the default to atime.
Additionally, atime may be applied only to selected classifications by initiating them from the Command Line Interface, see the add service deep-classification command and the crawl-atime-check-enabled option in the IS1200 Command Line Interface Reference Guide for details.
Digest Values Digests are numerical values calculated based on file and email content and are unique for all unique objects. Digest values allow file objects to be compared very quickly. Digests are calculated during basic and deep classifications or during collections or processing when indexing is enabled.
Digests are calculated differently for standard files, emails, and container objects. For standard files, a physical digest is computed for the entire file much like a hash value.
For email objects, just the subject, the message content (including attachments), and certain specific addresses are combined and an email digest value is calculated from the combination. Container objects, like ZIP or PST files, and their sub-objects have digests calculated both as complete objects and as individual sub-objects.
Note: Calculating email digests requires access to the email object's fullText and only classifications that include the fullText rule can produce email digests. Emails classified without the fullText rule receive the same physical digest that other files do. Consequently, identical emails on different repositories, one classified with and one without the fullText rule, will not be identified as duplicates.
Domino Sever (Lotus) A Lotus server providing groupware solutions and storage.
Domino XML Language (DXL)
A Lotus version of eXtensible Markup Language (XML) used to import and export Lotus email files.
DocsWithoutFullText Assignment Ruleset
Some file objects, such as graphics files (examples are.jpeg, .gif, or .bmp files) contain no text, and hence will have no fullText extracted by the FullTextRuleset, see “fullText” on page 66 for more details. In legal cases, these files may still contain responsive information, but not textual information that can be located by text searches. The DocsWithoutFulltext assignment rules identifies these files and adds the metadata tag and value
“DocWithoutFulltext=true” to all files that contain no searchable text. This allows these files to be easily searched for later, and inspected for legal responsiveness by non-search methods.
The ConceptFinder_DWF assignment ruleset combines both the DocsWithoutFullText ruleset with the conceptfinder ruleset. See
“conceptfinder Ruleset” on page 60 for more details.
Note: Parent file objects that don’t contain text (such as .zip, .tar, and .pst files) are not tagged with the DocWithoutFulltext tag.
Documentum Sever (EMC)
The EMC Documentum server manages business content including documents, photos, video, medical images, e-mail, Web pages, fixed content, XML-tagged documents, and so on. The Documentum core is a repository that stores content securely under compliance rules and appears as a unified environment, even though content may reside on multiple servers and physical storage devices within a distributed environment.
E
eDiscovery The process of reviewing electronic files to determine their relevances and responsiveness to a legal matter or case.
eDiscovery Case Manager
An IS1200 tab that facilitates eDiscovery for Legal Service Providers.
Electronic Discovery Reference Model (EDRM)
The EDRM was a Project created to provide standards and guidelines for the electronic discovery market. The model defines a common, flexible and extensible framework for the development, selection, evaluation and use of electronic discovery products and services.
Enterprise Vault A Symantec networked repository for archived email.
eth1, eth2 Most IS1200 platforms require two ethernet connections for proper deployment. These connections are called eth1 and eth2, must each have unique IP addresses, and must be GigaBit, or 1GB/sec or faster, connections. Additionally, all network segments between eth1 and all registered metadata and data repositories must be gigabit
eth1 is used to communicate between the IS1200 and its registered repositories. The IS1200 hostname should be DNS mapped to the eth1 IP address.
eth2 must be connected to a private network between the IS1200 nodes and is used to coordinate and balance system wide operations.
eth2 IP address should not be DNS mapped.
Extended Attributes User-defined keywords that are extracted during data classification.
Extraction Rules Extraction rules are a type of classification rule. They extract
user-defined keywords (custom metadata) to add to the metadata file.
Extraction rules are grouped into Extraction Rule Sets (ERSs). See the Policies: Classification, Extraction and Assignment Rules chapter of any IS1200 User Guide for more details.
Exchange Server (Microsoft)
A Microsoft server designed to store and manage email.
F
Federation A defined group of member-clusters on a Federation server that can be managed, searched, and reported on as a group. Member-clusters are referred to as Federated clusters.
Federation Server A single-node IS1200 server, with a Federation license, that allows consolidated searching and reporting of up to eight Federated member-clusters of its defined Federation.
Filer A file server that exports its file systems using NFS or CIFS protocol.
fullText fullText is the “content” portion of a file, for example this is the textual content of word processing files and the message body of emails.
fulltext is an extraction rule that is used to save file textual content as metadata to the Search Index during classifications. It saves up to 10 megabytes of content by default. This default may be changed, but it is not recommended. Fulltext extraction is required by
Review/Analysis for the Previewer pane to work and to generate Concepts in the Results Grouping pane.
fulltext, is extracted differently for container objects and sub-objects, and for files with embedded objects.
Container objects (such as ZIP or PST files) and their sub-objects are classified individually and the fulltext of the parent container file, and for each child sub-object, is extracted and added to the relevant metadata repository separately.
Files with embedded objects (such as a Microsoft Word file with and embedded spreadsheet), are classified together. The fulltext of the embedded object is included in the fulltext of its parent object and not collected separately.
For more details on fullText, see Chapter 1 of the IS1200 Metadata Reference Guide.
G
Groupware Collaborative software designed to help people involved in common tasks achieve their goals. Incorporates services such as email,
calendaring, text chat, wiki, web-sharing, document control, and advanced search.
H
Hash Values Hash values are used to compare one file with another for duplicates.
An extremely simplified description of hashing is that the numeric values of all bytes in a file are added into a grand total. The chances of two different files yielding the same result (hash value) are remotely small, so hash values can be used to identify duplicate files, or compare files with the same name to decide if they have been modified.
Computing hash on an entire file is called a full-hash, and computing hash on a portion of the file is called a partial-hash. A “partial hash”
may also be used to increase classification speed and “hashing” can be turned on, or off to increase classification speed.
I
identity A single entry in the Identity Vault database. The identity contains a single username and password that the IS1200 can retrieve when it needs to access a registered data or metadata repository or other server like and authentication service.
Identity Vault An encrypted database of usernames and passwords the IS1200 uses to store the credentials used to access registered data repositories, send email notifications, and work with authentication services.
Information Center Server
The standard IS1200 server offers clustering as a scalable solution for classifying, searching, and reporting on registered network
repositories. While clustering is ideal for scaling to large numbers of files on a LAN, it is not a viable solution for WANs. Enterprises with multiple IS1200 clusters deployed, or IS1200 clusters deployed in remote offices need the ability to setup and manage unified reports and searches across all their clusters. The IS1200 Information Center server provides this solution.
Each Federation server supports one federation. A Federation may have up to eight clusters (with four nodes each) included in it. Once a federation is established, it becomes a central management point allowing classifications, search, and reports to be setup or managed on all the federations members from the Information Center server.
See the IS1200 Information Center User and Configuration Guide for complete details.
Intelligent Platform Management Interface (IPMI)
IS1200 clusters may contain more than one node. Normally each node communicates with the others to share information and workload.
The IS1200 appliance includes an Intelligent Platform Management Interface (IPMI) to shut down nodes when individual nodes or software errors would degrade the overall cluster performance. The IPMI is an autonomous micro-controller—installed in all cluster nodes—used by the cluster’s “leader” node to power down nodes with errors or performance problems. The IPMI requires its own unique IP address, but communicates over the eth1 port, see “eth1,
K
Kazeon EVAgent An IS1200 service, installed on the Enterprise Vault server, that allows the IS1200 to directly open and access Enterprise Vault email for classification services.
Kaz-mount The NFS file system that is the IS1200 metadata repository. on which the IS1200 stores metadata.
Kazeon Query Language (KQL)
A programming language used in classification and assignment rules to identify files that should receive specified metadata tags.
KQL Reserved Words The KQL language reserves the following words. Consequently, they are not allowed to be searched for, or used as tags or aliases.
"ADD", "ALL", "ALTER", "AND", "ANY", "AS", "ASC", "AVG",
"BETWEEN", "BY", "CASCADE", "CHECK", "COLUMN", "COUNT",
"DESC", "DISTINCT", "ESCAPE", "EXISTS", "FROM", "FULL",
"GRANT", "GROUP", "HAVING", "IN", "INTO", "IS", "JOIN", "KEY",
"LEFT", "LIKE", "MAX", "MIN", "NOT", "NULL", "ON", "OR",
"ORDER", "OUTER", "REVOKE", "RIGHT", "SELECT", "SET", "SUM",
"UNION", "UNIQUE", "UPDATE", "VALUES", "VIEW", "WHERE"
Kaz-server The file server where the metadata repository is located.
Kaz-share The CIFS file system on which the IS1200 stores metadata.
Kaz Schema Defines the set of metadata fields used to build a Search Index for registered data repositories (file systems).
L
Legal Hold Files placed on legal hold are either copied to a secure secondary location where they can preserved for later use, or are locked in their original locations against further change until a legal matter is resolved.
Legal Service Provider (LSP)
A lawyer or trained legal professional that provides legal services for a fee.
Local Refers to the local resources (usually the metadata repository) of the Federation server.
localdatafs A data repository created on the IS1200 itself. This practice is not recommended.
localkazfs A metadata repository created on the IS1200 itself. This practice is not recommended.
Logging rule Logging rules audit user actions on files such as file access, creation, modification, and deletion.
M
Manifest Reports Manifests are reports that summarize the results of an IS1200 job or service. Manifests are produced for Collections (from either
Administration or the Case Mgmt) and for some Actionable Services.
Collection Manifests summarize what files were, or were not collected during a collection. Actionable Service Manifests reconcile Actionable Services object-counts with the search result object-counts they are performed on because processes such as deduplication can result in the two counts not matching. The reports details the count of differences and the reasons for the differences. For more information, see Manifests in the IS1200 Web-Search User Guide.
Note: Collection manifests are available ONLY for collections done from v4.6.0 or later, earlier versions did not generate collection manifests.
Member-cluster Any of the clusters registered to a particular Federation.
Metadata Data about data. Metadata is used to search for information and to create reports. Metadata can be file system or custom metadata that the IS1200 extracts from files during classification. File system metadata includes file type, and file path extracted during basic classification. Custom metadata is generated during deep classification.
Metadata Repository A registered repository the IS1200 uses exclusively to record the metadata extracted during classification services on the registered data repository the metadata repository is mapped to.
The primary metadata repository is the host of the repository registration database, the report results database, Environment Discovery job results, Auditing and Data Verification databases, and miscellaneous databases the cluster requires for routine operation.
Collectively these are called the Cluster Data Base.
Metadata repositories created on the IS1200 itself (sometimes called localkazfs) are strongly discouraged!
N
Namespaces IS1200 software, versions 4.0 and higher, organize metadata fields into hierarchy defined by namespaces. Namespaces group similar sets of tags, for example all the file level tags such as FileType, FileSize, aTime, and cTime are grouped together in the System namespace. See the IS1200 Metadata Reference Guide for complete details.
Network File System (NFS)
A protocol used primarily by Unix based computers for accessing computer systems and filers over the internet.
Network Information System (NIS)
A network naming, administration, and authentication system for smaller networks that was developed by Sun Microsystems and is
A network naming, administration, and authentication system for smaller networks that was developed by Sun Microsystems and is