Confidential documents detection is a key activity in data leakage prevention methods. Once the document is marked as confidential, then it is possible to prevent data leakage from that document. Confidential terms are significant terms, which indicate confidential content in the document. This paper presents confidential terms detection method using language model with Dirichlet prior smoothing technique. Clusters are generated for training dataset documents (confidential and non-confidential documents). Language model created separately for confidential and non-confidential documents. Expand non-confidential language model in a cluster using similar clusters, which helps to identify the confidential content in the non-confidential documents. …show more content…
In the year 2009, Verizon Business RISK team submitted a Data Breach Investigation Report [2], this report analyzed 90 data breaches occurring in the year 2008. According to nonprofit consumer organization Privacy Right Clearing house [3], the United states have a total of 227,052,199 records of confidential personal information between the year January 2005 and May 2008.Organizations require a set of laws and rules to protect their confidential information. Some of the laws are Sarnes-Oxley Act (SOX) [4], HIPPA [5], and Gramm-Leach Bliley act [6]. All these laws focus on specific type of business information. Some recent leakage incidents selected from [4].All these data leakage incidents point out that, organizations should focus more on their security …show more content…
The existing confidential detection methods cannot check the non-confidential documents but the intruder may send confidential data through non-confidential documents [8]. Fig 1 represents the pictorial representation of the confidential terms detection method [8].
Fig.1 Confidential Terms Identification. The detailed confidential terms identification is presented in the Algorithm [8].
Algorithm for Confidential Terms Identification.
Input:
C -Confidential documents.
N - Non- confidential documents.
Output:
CR- Clusters and confidential terms. 1. TCUN (combination of C and N documents).
2. CRApply Clustering on T.
3. For each cluster C in CR
4. Find_ Confidential_Terms.
5. End for.
6. Return C.
Steps required finding the confidential key terms.
1. Unsupervised Clustering.
2. Find Confidential Score for Each Term.
2.1 Unsupervised Clustering
For unsupervised clustering, two types of documents are required, C confidential documents and N, non-confidential documents. Clustering process, grouping documents (confidential and non-confidential) into different