PDF

A Comparative Analysis of Information Hiding Techniques for
Copyright Protection of Text Documents

Milad Taleby Ahvanooey1*, Qianmu Li 1*, Hiuk Jae Shim2
1
School of Computer Science and Engineering, Nanjing University of Science and Technology, China
2
School of Computer and Software, Nanjing University of Information Science and Technology, China
Taleby@njust.edu.cn, qianmu@njust.edu.cn, waitnual@gmail.com
Abstract:
with the ceaseless usage of web and other online services, it has turned out to be amazingly simple to copy, share,
and transmit digital media over the Internet. Since the text is one of the main available data sources and most widely
used digital media on the Internet, the significant part of websites, books, articles, daily papers, etc. are just the plain
text. Therefore, copyrights protection of plain texts is still a remaining issue that must be improved in order to
provide proof of ownership and obtain the desired accuracy. During the last decade, digital watermarking and
steganography techniques have been used as alternatives to prevent tampering, distortion, media forgery and also to
protect both copyright and authentication. This paper presents a comparative analysis of information hiding
techniques, especially on those ones which are focused on modifying the structure and content of digital texts.
Herein, various text watermarking and text steganography techniques characteristics are highlighted along with their
applications. In addition, various types of attacks are described and their effects are analyzed in order to highlight
the advantages and weaknesses of current techniques. Finally, some guidelines and directions are suggested for
future works.
Keywords: Information hiding; text watermarking; copyright protection, copy control; forgery prevention.
*Corresponding Authors
Milad Taleby Ahvanooey, Prof. Qianmu Li, School of Computer Science and Engineering, Nanjing University
of Science and Technology, Nanjing, P.O. Box 210094, China.
Email: Taleby@njust.edu.cn, Qianmu@njust.edu.cn.
Digital text watermarking is a data hiding technique
1. Introduction which conceals a signature or a copyright information called
Following the progressive growth of Internet and watermark inside a cover text in an imperceptible way.
Separating the hidden watermark from the cover text is very
advancement of online services, digital publishing has
become an essential topic and in the next-generation difficult due to the watermarked text is invisible for
organizations, offices (e.g., institutions, publishers, etc.) everyone except the original owner. Recently, text
watermarking has not drawn much attention from cyber
seem to be paperless. Nowadays, various studies are in
process to execute and organize some ideas such as e- security experts and researchers. There are some reasons for
commerce, e-government, and online libraries. Digital that: the much lower capacity of text to retain data might be
one of the main reasons (i.e., compared to other digital
publishing has many privileges, but it has some fundamental
threats such as illegal use of copyrighted documents, media such as image, audio, and videos). However, there
manipulating the data and redistributing such information are a number of reasons why we should pay more attention
to it. Firstly, the text is still a major form of universally
[1-3]. In this case, some protective solutions consisting of
copyright protection, integrality, authenticity, and applicable digital media. In other words, text is an important
confidentiality are essential to prevent forgery and part of communication between people compared to other
media. Secondly, text watermarking has no clear evaluation
plagiarism problems. Downloading and manipulating a
copyrighted text and thus reusing it without any control is criteria to analyze its efficiency [6-10].
easy these days, hence, copyright management is very The different categories of information security systems
are depicted in figure 1. The cryptography and information
necessary to protect such information against modifying and
reproducing processes [4-5], [10]. hiding are security systems that are used to protect data
1
from deceivers, crackers, hackers, and spies. Commonly, steganography and watermarking. The aim of
most of the malicious users want to leave traces from cuts, steganography is to hide a secret message in a cover media
manipulations and infections [6]. The cryptography in order to transmit the secret information, therefore, the
scrambles a plain-text into cipher text which is reversible main concern is how to conceal the secret information
without data loss. The goal of cryptography is to prevent without raising suspicion, i.e., steganography needs to
unauthorized access to the secret information by scrambling conceal the fact that the message is hidden. Watermarking is
the content of information. On the other hand, information concerned with hiding a small data in digital files such that
hiding is a powerful security technique which hides a secret the hidden data is robust to alterations and adjustments. In
data in a cover media (e.g., text, image, audio, or video) so other words, watermarking aims to protect intellectual
that the trace of embedding hidden data is completely property of digital media against unauthorized copy or
unnoticeable. The cryptography and information hiding are access by embedding a watermark (visible or invisible) in
similar in a way that both are utilized to protect sensitive the cove media which can remain beside the data, and it can
information. However, the imperceptibility is the difference be used whenever there is any query about the originality of
between both techniques, i.e., information hiding concerns media (e.g., the hidden watermark refers to the original
how to hide information unnoticeably. Generally, the owner) [2-10].
information hiding can be further categorized into
Information Security Systems
Cryptography Information Hiding
With-Box Cryptography Steganography
Watermarking
Black-Box Cryptography
Linguistic Technical
Gray-Box Cryptography
Working Domain Type of Documents
Special Frequency Text Image Video Audio
Structural Linguistic
FIGURE 1: Different categories of information security systems
Over the last two decades, many information hiding  We summarize some information hiding techniques
techniques have been proposed in terms of text which are focused on altering the structure and content
watermarking and text steganography for copyright of the cover text in order to hide secret information.
protection [16-19], proof of owner ship [20-28], copy  We provide a comparative analysis of the summarized
control and authentication [29-36]. Although, the aim of techniques and evaluate their efficiency based on the
steganography is different, but it also can be used for the specified criteria.
copyright protection of digital texts like watermarking.
The main contributions of this paper are summarized as The rest of the paper is organized as follows. Section 2, we
follows. review text watermarking literature and related studies.
 We present a brief overview of existing literature on Section 3, we introduce individual text watermarking
text watermarking categories, architecture, methods and analyze them based on evaluation criteria, and
applications, attacks, and evaluation criteria.
2
we give some suggestions for the future works. Finally,  Watermark Embedding: the embedding phase of text
section 5 draws the conclusions. watermarking algorithm includes three stages. The first
stage is generating a watermark string which includes the
owner’s name or other information (e.g., author, publisher,
2. Literature review etc.). In the second stage, the watermark string is converted
In what follows, we describe the existing literature on text to a binary string, which is modified by a hash function
watermarking including, architecture, the Unicode standard, according to an optional key, and then an invisible
text watermarking categories, applications, evaluation watermark string is generated for embedding it into special
criteria and attacks. locations in the cover text. Finally, it is inserted into special
locations where the watermark string will not be affected by
2.1 Text Watermarking Architecture attacks [2], [4], [6] [10].
As shown in figure.2, digital text watermarking includes
two main phases, namely, watermark embedding and
watermark extraction.
Key Digital Watermark Digital Watermark Key

Embedding Extraction
The watermarked text is sharing
 Watermark Generation  Watermark extraction,
on the Communication Channel
 Watermark Securing Rehash and detection
 Watermark inserting  Watermark Authentication
Attacks
 Alterations
 Distortions
Original or Cover
Text Document  Redistributions No
Watermarking Tampered
Authenticator Text Document
Yes
Authenticated
Text Document
FIGURE 2. Digital text watermarking (Embedding & Extraction) architecture
 Text Documents Attacks: nowadays, most of the users document and authenticates its integrity, while the
can easily utilize various digital text files such as articles, watermark detection verifies the existence of the watermark
books, online news, etc. Due to availability of open access string from the watermarked text [4], [6], [10], [11].
to these text documents, unauthorized attacks such as copy, 2.2 Unicode Standard
alterations, distortions, and redistributions, etc. are In the digital text processing system, there has been
simultaneously raising. Therefore, text watermarking can be defined the Unicode standard to process and display digital
used as a security tool to prove the originality and the texts from 1987 until now. Basically, all operating systems
and writing software systems have to support the Unicode
accuracy of text documents [10], [36].
standard for representation of digital texts. The Unicode
 Watermark Extraction: generally, watermarked text standard is a universal character encoding system which
documents are shared via communication channels such as designed to support the worldwide display, processing, and
Web, email, or social media over the Internet. Obviously, it interchange of the texts with different languages and
is essential to authenticate the originality of the text technical disciplines. In addition, it also supports the
documents. Two different terms are used for this phase, i.e., historical and classical letters in many languages. This
extraction, and detection. Although authors often regarded standard is compatible with the latest version of ISO/IEC
both as similar functions in some literature, we can 10646-1:2017 and has the same characters and codes of
ISO/IEC 10646. As of June 2017, the latest version of
distinguish them in this way: whereas the extraction
Unicode is 10.0.0 is maintained by the Unicode Consortium.
discovers the watermark string from the watermarked
3
It includes three encoding forms such as UTF-8, UTF-16, non-joiner, special spaces (or white spaces), etc. Practically,
and UTF-32 which the Unicode allows for 17 planes, each they have no written symbol (i.e., non-printing characters)
of 65,536 possible characters (or 'code points'). This gives a in the digital text processing systems. In the social media, if
total of 1,114,112 possible characters in different formats it employs the Unicode standard in order to process digital
such as digits, letters, symbols, and a huge number of texts in different languages, then the Unicode control
current characters in various languages around the world. characters have transparent written symbols, otherwise they
Currently, the most commonly used encoding forms are may generate some unconventional symbols [11].
UTF-8, UTF-16 and now- outdated UCS-2. UTF-8 provides
one byte for any ASCII character, all of which have the In some existing literature, the researchers have utilized
same code values in both ASCII and UTF-8 encoding, and the Unicode control characters in order to hide the secret
up to four bytes for other characters. UCS-2 provides a 16- data into a cover text, where they provide the imperceptible
bit code unit (two 8-bit) for each character but cannot embedding or a few change in the cover [12], [16-19], [24],
encode every character in the current Unicode standard. [27], [64].
UTF-16 extends UCS-2, using one 16-bit unit for the As depicted in Table 1, all the Unicode special spaces
characters which were representable in UCS-2 and two 16- have different width and no written symbol (color) in digital
bit or (4 × 8-bit) units to process each of the further text processing (i.e., we inserted these spaces between
characters [16], [33-39]. double quotation marks and changed color to show their
In the Unicode standard, there are special characters width).
used to control special entities such as zero-width joiner,
TABLE 1: Unicode special space characters [27], [39]
Unicode Hex HTML Code Name Written Symbol

Code
U+0020 Space “”
U+00A0 No-Break-Space “”
U+200A Hair-Space “ ”
U+2000 En-Quad “ ”
U+2002 En-Space “ ”
U+2003 Em-Space “ ”
U+2001 Em-Quad “ ”
U+2004 Three-Per-Em-Space “ ”
U+2005 Four-Per-Em-Space “”
U+2006 Six- Per-Em-Space “ ”
U+2007 Figure-Space “ ”
U+2008 Punctuation-Space “ ”
U+2009 Thin-Space “ ”
U+202F   Narrow-No-Break-Space “ ”
U+205F   Medium-Mathematical-Space “ ”
U+3000 　 Ideographic-Space “ ”
TABLE 2: Unicode zero-width control characters [12], [16], [39]
Unicode HTML Name Text Written Symbol

Hex Code Code
U+200B Zero-Width-Space No symbol and Width
U+200C ‌ Zero-Width-Non-Joiner No symbol and Width
U+200D ‍ Zero-Width-Joiner No symbol and Width
U+200E ‎ Left-To-Right-Mark No symbol and Width
U+202D ‭ Left-To-Right-Override No symbol and Width
U+202E ‮ Right-to-Left-Override No symbol and Width
U+202A ‪ Left-to-Right-Embedding No symbol and Width
U+202B ‫ Right-to-Left-Embedding No symbol and Width
U+202C ‬ Pop-Directional-Formatting No symbol and Width
U+180E ᠎ Mongolian-Vowel Separator No symbol and Width
As shown in Table 2, the zero-width characters are data in the cover text, the default encoding of the cover text
totally invisible. We have tested all of these characters by must be defined as one of the Unicode encodings like UTF-
Java programming in the Docx, txt, and HTML files, (i.e., 8, UTF-16, or UTF-32. In case of attack, if a malicious user
some of the zero width characters are blocked in G-mail, but copies a text which is included some zero-width characters
they can be used in Web watermarking). In practice, when in the new host file, then these characters will consider as the
the zero width characters are used in order to hide a secret Unicode encoding and provide invisible text trace.
4
Otherwise, they will show some unsupported characters and the original meaning of cover text. Obviously, text documents
raise suspicions to the existence of hidden information. consist several sentences, words, verbs, nouns, prepositions,
adverbs, adjectives and so on. There are various syntactic
2.3 Text Watermarking Categories compositions in sentences of text, which is determined by the
During past two decades, many types of research have language and its particular conventions [41]. Semantic text
been carried out based on structural (format-based), linguistic, watermarking is a language-based technique which focuses on
scanned-image watermarking and frequency of words in the the semantic arrangement of cover text such as the spelling of
cover text. Herein, we consider those methods which are words, synonyms, acronyms, etc. in order to conceal a
focused on modifying the structure and content of the cover watermark string. The advantage of this technique is that it
text. In case of text processing, watermarking techniques are provides protection retyping attacks or using OCR programs,
divided into two main categories, linguistic and structural. however, it alters the original meaning of text content [43].
The linguistic technique concerns with the special features of  Structural based (Format based) involves altering the
the text content that can be changed in a specific language, layout of text based on the Unicode or the ASCII encoding
and moreover, the structural technique concerns with the without changing the sentences or words. The structural
layout or format of the cover text that can be modified [10], approach alters word spacing, line spacing, font style, text
[23]. Although, some researchers have classified the text color and anything similar [44], [45].
watermarking techniques based on the features of methods Fig.3 depicts two examples of hiding the watermark bits into
such as robust, fragile, invisible, and visible [9], [40]. an example sentence by using the linguistic and the structural
 Linguistic (Natural Language) is divided into two types: approaches.
syntactic and semantic. Syntactic text watermarking involves
altering the structure of text without significantly changing
Linguistic Structural
Original Text: I Love an Apple. Original Text: I Love an Apple.
1 or 0 V.S “01 10 11”
Watermarked Text: I Like an Apple. Watermarked Text: I love an Apple.

(Topkara et al., 2006 [43]) (Por et al., 2012 [27])
FIGURE 3. Comparison between linguistic and structural techniques
As Fig.3 shown, the linguistic technique changes the technique on the online systems that provide access control
text content and the structural technique alters the layout of to prevent illegal copy or restrict the number of times of
text. In addition, in section 3, we will explain two copying the original text [7], [46].
approaches in detail.
 Tamper Proofing: these days, a huge number of text
2.4 Text Watermarking Applications
documents are available online for selling or reading for
Text watermarking techniques are applicable in many
users. Therefore, these documents are prone to be exposed
applications. The following points are the most important
to a number of attacks (e.g., unauthorized access, copy,
watermarking applications.
redistribution and so on). In this case, text watermarking
 Digital Copyright Protection (Proof of Ownership):
can be used as a fragile tool for tamper proofing of the
text watermarking provides passive protection tools for
watermarked texts against attacks. In general, a fragile
digital documents so that the text content cannot be
watermark is embedded into text documents, and if any
illegally copied or replicated. For example, if someone
type of alterations has been made, then it fails to detect the
copies a watermarked document/file (e.g., PDF, Docx,
watermark [10], [23].
Latex, RTF, and so on), then the reversibility of
 Text Content Authentication: the online publishing of
watermarking techniques can be used to prove the
articles and newspapers in form of plain text documents,
ownership of the copied documents [7], [10].
has brought several issues related to authenticating the
 Access Control (Copy Control): Currently, the
integrity of these documents. Text watermarking can be
publishers and the content providers are seeking more
applied as an authentication tool to verify the integrity of
reliable ways to control copy or access to their valuable
plain text documents [46].
documents, and simultaneously, they want to make the
 Forgery Detection (Prevention): plagiarism and
documents accessible on the Internet in order to obtain
reproduction of text documents are serious forgery
more revenue. The text watermarking is a desirable
5
activities and are rapidly increasing. Text watermarking deliberately or even unintentionally. A robust text
can be used as a forgery detection tool by embedding a watermarking algorithm makes it extremely hard to be
watermark in the original text before the online publishing. altered or removed. The distortion robustness (DR) can be
Thus, it can prove the plagiarism and reproduction of the measured numerically by distortion probability [4], [5].
watermarked texts [7], [10]. - Distortion Probability (DP): This is the probability of
how much proportion of watermark bits (WB) has been lost.
2.5 Text Watermarking Evaluation Criteria The malicious users can manipulate or alter the
There are many things to be considered when the watermarked text such that the WB may not be extracted.
researchers design a watermarking algorithm. However,
common criteria can be easily found in recently proposed The lower rate of DP leads to greater the robustness of
algorithms: those are invisibility, robustness, embedding watermarking algorithm. There is no specific formula for
capacity and security, which explains that an ideal calculating the probability of distortion in the existing
watermarking algorithm should be secure and robust against literature. We aim to provide a benchmark analysis of
attacks. In order to achieve high rates criteria, the watermarking techniques which is dependent on the
researchers should consider the application of the method locations of embedding method in the cover text [17], [48].
(e.g. fragile or robust). However, a suitable algorithm could
Let suppose that, the number of embedding positions (e.g.
be provided optimum trade-offs between the evaluation
criteria according to the application requirements of method space characters used to embed a watermark string) in the
[4], [6-10]. cover text is D, and the total number of characters in the
In the following, we introduce five evaluation criteria cover text is considered as T, and S is the number of sample
that include some formula for analyzing the efficiency of files, then the average probability of distortion robustness
watermarking algorithms. can be calculated as follows.
 Invisibility (Imperceptibility): the trace of embedding a
watermark in the cover text must be invisible and the DR = [∑Si=1(1 − 𝐷𝑃𝑖 )]/S )2(
watermark must be able to extracted by the corresponding
Where, 1<D< T, T∈N, D∈N
watermarking algorithm. In other words, invisibility refers
to how much perceptual changes are made in the cover text 𝐷
DP (WB) =
𝑇
after embedding a watermark. Practically, it cannot be
 Security: it prevents attackers from detecting the
measured numerically. The best way of measuring the
watermark visually or from deleting the watermark from the
degree of imperceptibility is to compare the difference
watermarked text by providing a certain level of security
between the original cover text and watermarked text [10],
[6], [7]. In fact, this measure depends on three other criteria,
[16].
including invisibility, embedding capacity, and robustness.
 Embedding Capacity (EC): a number of watermark
The text watermarking algorithm should provide optimum
bits which can be concealed in a cover text, is termed as
trade-offs among these criteria. Moreover, if it provides a
embedding capacity. This criterion can be measured
large capacity and the trace of embedding be totally
numerically in units of bit-per-locations (BPL). Location
imperceptible then the security of the algorithm is equal to
means a specific position in the cover text where the
the above robustness formula [10], [16].
algorithm can embed the watermark string (e.g. spaces
 Computational Cost: This is one of the least
between words, after special punctuations, etc.). Even
significant criteria for the next-generation computers.
though a watermarking algorithm provides a large
However, there can be many pages in some text documents,
embedding capacity, it is not desirable for copyright
therefore, the text watermarking approaches are preferred to
protection, if it alters the cover text profoundly [7], [26].
be computationally less complex. It is obvious that the long
documents require more software or hardware resources,
EC = 𝐵𝑃𝐿 × 𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠 )1(
i.e., higher computational complexity. In general, less
 Robustness: Many attacks may happen on the
complex algorithms are exploited for resource-limited
watermarked texts while are shared on the communication
systems such as mobile devices, embedded
channel and are referred as hazard which could distort (or
microprocessors, etc. [7], [10].
damage) the watermark [10]. In general, malicious users
also may randomly manipulate or distort the embedded 2.6 Text Watermarking Attacks
watermark in the watermarked text, rather than destroy or Currently, the availability of open access and online
delete it. Moreover, any kind of distortion may occur publishing of valuable documents (e.g., books, articles,
6
newspapers, etc.), have caused to be exposed new breeds of approximately estimated, then the attacker can remove the
plagiarism and forgery attacks. Therefore, malicious users watermark from the watermarked text [51].
can access to the plain text or even protected documents by  Copy attacks: in this attack, the aim of attacker is to
unauthorized software tools. The attacks on watermarked estimate a watermark and extract/perform it on the target
text documents can be divided into three categories: watermarked text by claiming the ownership of the copied
tampering attacks, estimation based attacks, retyping text. As the definition implies, it needs to perform a removal
attacks, and reformatting attacks [4], [10], [36], [48-51]. attack by extracting the estimated watermark (e.g. using
I. Tampering Attacks: this kind of attack includes three previous statistical knowledge of watermark locations) to
types of attacks: removal, insertion, and reordering attacks. extract the watermark [7], [51].
 Removal (Deletion): In this attack, the malicious user III. Retyping Attacks: sometimes, the content providers
attempts to delete the watermark string completely from the protect their text documents so that the text content is read-
watermarked text without affecting the original text content. only and no one can copy even a small part of the text
Moreover, if it cannot remove the watermark string content. In this case, malicious users used to retype the
completely, then it almost destroys it [10], [50]. target text [10].
 Insertion or Distortion: after copy, attacker’s aim is to IV. Reformatting attacks: most of the malicious users copy
alter the original text by random removal of some words the target texts from Web sites or articles into their own
and manipulate the copied text. Sometimes, malicious users files and may try to change the font style, font color, and so
try to remove the authors’ names or related information on. Some of these modifications that modify the layout of
from the original text. Afterward, they insert new text without changing its content, is called reformatting
information in the copied text in order to show their attack [6].
ownership. In some literature, this type of attacks is called V. Copy & Paste Attacks: this is one of the most common
as geometric attacks [48]. attacks in that the malicious users copy the whole of text
 Reordering (Re-building): another way of tampering and paste into their own files (e.g., a simple copy of the
attacks is that the malicious users change the order of words watermarked text into another file).
and sentences to produce a new version of document by
paraphrasing its content. thus it may lose the watermark 3. Text Watermarking Existing Techniques
string and fails to detect or extract it [10], [49].
II. Estimation based Attacks: in this kind of attacks, the Text watermarking techniques have various strategies
attackers must know some preliminary knowledge about and schemes which are dependent on the applications of the
text watermarking and the characteristics of text processing. methods. In other words, the aim of watermarking
Estimation based attacks include removal attacks, ambiguity determines whether the algorithm should be a fragile or
attacks and copy attacks [28], [51]. robust tool, thus, it can be used to prove the integrity or
originality of the text accordingly. Practically, developing a
 Estimate of the original text: since, the watermark
robust algorithm is not easy and requires considering the
string is an extra independent data in the watermarked text,
balance among multiple criteria that must be taken into
attackers may design an extraction algorithm to produce a
account. Currently, there are a few text watermarking
new document without a watermark string. Thus, they try to
techniques have been introduced, hence it is hard to find
estimate the relation between the watermark string and
literature addressing its limitations. In this section, we
original text, and in addition, write an algorithm to extract
introduce some related works which are focused on altering
the original text without the watermark such that it does not
the structure and content of text in order to embed the
change the original text content [10], [36].
hidden information [6], [10]. From the text processing point
 Ambiguity (Reverse): this attack aims to puzzle the
of view, text watermarking algorithms can be classified into
detector by estimating a forged watermark from the
one of the categories in figure 4, namely: linguistic (natural
watermarked text. Therefore, it causes to ambiguity in the
language) and structural (format based).
ownership of the watermarked text. If the watermark is
Semantic
Linguistic
Syntactic
Zero-Width
Text Watermarking
Open Space
Structural
Features
Word Shift
7 Line Shift
FIGURE. 4. Different types of text Hiding techniques
3.1 Linguistic (Natural Language): current bit does not match with the movement bit of
This type of watermarking techniques modifies the target constituent, the method moves the syntactic
the content of a text document to hide a watermark constituent in the syntactic tree). Finally, the algorithm
binary string. In recent years, a few natural languages produces the watermarked text from the modified
based algorithms have been introduced. As explained syntactic tree. The disadvantage of this method is that
in section (2.3), the semantic or syntactic analysis of it is only to agglutinative languages such as Korean
the text contents is used to embed the watermark bits and Turkish, moreover, text reordering may change the
in the Natural Language (NL) watermarking. It meaning of the original text [41].
generally changes the structures of text including Kim et al. (2010) proposed another NL
nouns, adjectives, verbs, prepositions, pronouns, watermarking algorithm based on syntactic
idioms, synonyms, and any available objects to displacement and morphological division in the
conceal the watermark. Moreover, this type of Korean language text documents. The authors utilized
techniques is designed to maintain the original displacing syntactic adverbials attribute that most
meaning of the cover text by change the semantic or languages allow displacement of syntactic adverbials
the syntax of its content [6], [32]. within its part. Moreover, they claim that proposed
Topkara et al. (2006a) provided a new syntactic method does not change the general meaning of the
watermarking technique based on the syntax of the sentences, but practically it alters the meaning of text
cover text especially in English language, which slightly [54].
performs syntactic sentence-paraphrasing. In this Halvani et al. (2013) proposed four methods to
work, the original sentence is analyzed by XTAG hide the watermark bits either by lexical or syntactic
parser and then send for feature verification. Finally, alteration, and those are designed for the German
the embedding algorithm inserts the watermark bits language. The first syntactic transformation applies
into the sentences by paraphrasing their contents [42]. enumeration modulation (EM) to embed the
Topkara and et al. (2006b) presented another watermark bits using the grammatical rules
semantic watermarking method by embedding “constituent movements”. The second method uses
synonym substitutions in the English language text conjunctions modulation (CM) method which is based
documents. This method utilizes heuristic measures of on grammar rule (constituent movement) and focuses
quality based on conformity to a language model. only on two nouns separated by an optimal
While there are many ways to produce a substitution conjunction. The third method is based on prefix
on a word, the algorithm prioritizes the means expansion which modifies the negations of the words.
according to a quantitative resilience measure and use The fourth method utilizes a lexical transformation to
them from the priority list. In this research, the authors insert the watermark bits by altering words. The
attempted to increase the capacity and reduce the alteration is based on three grammatical rules such as
distortions against attacks in the watermarked text repeated letters, connected anglicisms, and inflected
[43]. adjectives. The advantage of these methods is
Meral et al. (2009) proposed a morphosyntax- compatibility to some other languages such as
based NL technique which embeds a watermark Spanish, English or French. However, these methods
(binary string) based on a syntax tree in the Turkish also have the same problems as other NL algorithms:
text documents. The algorithm embeds watermark bits those methods almost change the original meaning
under the control of Wordnet (or dictionary) to prevent [55].
semantic drops (e.g. altering the meaning of the Mali et al. (2013) introduced a novel NL
original text). In this technique, the watermark bits are watermarking method. In this approach, English
embedded by altering the changeable sentences in the grammatical rules are applied to produce watermark
cover text. These alterations include conjunct order bits. The algorithm generates watermark bits based on
change, adverb displacement, and so on. The direction a combination of the total conjunctions, pronouns,
of words (forward or backward) indicates the modal verbs, and author ID found in the cover text.
watermark bit either “1” or “0”. However, the capacity Then, the watermark bits are encrypted with AES. This
of this technique is low, its achievable capacity is algorithm was designed for web pages’ text
almost one bit per sentence [52], [53]. verification. In addition, a receiver can authenticate the
Kim (2008) suggested a syntactic text watermark by the extraction algorithm. Since this
watermarking for the Korean language text documents. method modifies grammatical rules, it also changes the
This work consisted of four stages. First, it creates a original meaning of sentences [56].
syntactic dependency tree of the cover text. Second, it LU et al. (2009) introduced a new watermarking
selects target syntactic constituents to move words. technique, which embeds the watermark bits into the
Third, the algorithm embeds watermark bits (If the pragmatics properties of cover text by rewriting
8
sentences. The method avoids syntactic and semantic In this study, we analyze the efficiency of
analysis of text content and utilizes a transformation summarized techniques in terms of evaluation criteria
templates based on special pragmatic rules by part-of- that are explained in section 2. In addition, we
speech (PaS) tags order in the Chinese language. It considered a rating factor for the NL techniques in
classifies sentences into subsets for embedding the terms of their criteria: for example, low, medium, and
watermark bits. For example, if the current subset is high scale for the capacity; low, modest, and high for
even, then the sentence represents a bit “1”, otherwise, the robustness; imperceptible, middle, and visible for
i.e., the current subset is odd, then sentence indicates a the invisibility. The rate of each technique is estimated
bit “0”. The authors aimed to paraphrase sentences based on its embedding method. Language
without altering the original meaning of the Chinese compatibility refers to the specific language to which
text. However, this work is relatively weak against the corresponding method can be applied.
tampering attacks [57]. To demonstrate the hiding process of above
Practically, the NL watermarking is complicated techniques, we implemented them on highlight
since not every language supports syntactic or examples (e.g., this process only embeds one bit in the
semantic alterations. Moreover, most of algorithms cover text), which are depicted in Table 3. Moreover,
cannot be applied to sensitive documents because the the comparative analysis of the evaluated techniques in
original meaning or even word choice of text can be terms of criteria, are summarized in Table 4.
altered to some extent.
TABLE 3. Implementation of NL techniques on the highlight examples

Algorithm Original Cover Text Watermarked Text
(Topkara et al., 2006a) [42] I love an apple. My favorite fruit is apple.
(Topkara et al., 2006b) [43] I love an apple. I like an apple.
(Meral et al., 2009) [53] I love an apple. I like an apple.
Bir elmayı seviyorum Bir elmayı severim
(Kim, 2008) [41] I like apples in autumn. (in autumn) (I) (apples) (like)
나는 가을철에 사과를 좋아한다. 가을철에 나는 사과를 좋아한다.
(Kim et al, 2010) [54] (the departure) (was delayed) (the departure) (was delayed)
출항이 지연되었다. 출항이 지연이 되었다.
(Halvani et al., 2013) [55] I love an apple. I like an apple.

Ich liebe einen Apfel Ich mag einen Apfel
(Mali and et al., 2013) [56] You could go to school. You should go to school.
(LU et al, 2009) [57] Tom's leg is injured by falling. Tom fell down and his leg is injured.
汤姆的腿[被]摔伤了。汤姆[被]摔了，他的腿摔伤了。
TABLE 4. A comparative analysis of NL techniques

Algorithm Embedding Capacity Invisibility DR Language
(1 bit Per) Compatibility
(Topkara et al.,2006a) [42] Sentences Imperceptible Low English
(Topkara et al., 2006b) [43] synonym of words Imperceptible Low English
(Meral et al., 2009) [53] synonym of words Imperceptible Low Turkish
(Kim, 2008) [41] synonym of words Imperceptible Low Korean
(Kim et al, 2010) [54] displacement of adverbs Imperceptible Low Korean
(Halvani et al., 2013) [55] synonym of words Imperceptible Low German
(Mali and et al., 2013) [56] grammatical words Imperceptible Low English
(LU et al, 2009) [57] Sentences Imperceptible Low Chinese
As shown in Table 4, all the NL based techniques 3.2 Structural (Format based):
modify the cover text contents to hide the watermark This type of text watermarking alters the
bits. Thus, this type of alteration needs some rules or structural layouts or properties of the text in order to
locations to search target words for paraphrasing the hide the watermark bites. As we already explained in
text. In each technique, the authors use predefined section (2.3), the structural layouts consist of spaces
dictionaries or ‘Wordnet’ to find and replace the in between paragraphs, lines, words, curved letters,
target words. It causes high computational cost due to letter extensions, and characters with diacritical
requiring an extra dictionary. Moreover, the NL marks. Any other property can be utilized to change
based techniques are mostly incompatible with the layout or the format of cover text in an
different languages. However, they perfectly protect unnoticeable way. Recently, various techniques have
the watermarked texts against retyping attacks, but been introduced by researchers which are employed
have low robustness against tampering attacks. the modification of the text layouts to carry the
embedded watermark bits.
9
Bender et al. (1996) presented the first open space method considers three different structural layouts
data hiding technique which uses white space in text such as line-shift, word-shift, and text formatting. In
documents. White space based embedding algorithm this work, the embedding algorithm inserts
considers three different locations: inter-word spaces, watermark bits by shift (line or word) such that it
inter-sentence spaces, and end-of-line spaces. In case moves a word (line) to downward (left or right) or
of inter-word spacing, the algorithm inserts additional upward (top or down) and changes the height of
spaces between two words: for example, two spaces corresponding character. The extraction algorithm
between words represent a bit value of ‘1’ in the analyzes the lines or words of the watermarked text
watermark bits, while a single space represents a bit (scanned image) to detect the orientation of
value of ‘0’. For inter-sentence spacing, the ‘0’ can movements. Even though reformatting of the digital
be represented by inserting one space between watermarked text causes the watermark bits to be lost,
sentences, and the ‘1’ by inserting double spaces. In it provides a new perspective for structure-based text
case of end-of-line spaces, two spaces are inserted to watermarking techniques.
represent one bit (per line), and four spaces represent Lee and Tsai (2008) introduced a data hiding
two bits, (e.g. six spaces three bites) and so on. This approach for secure communication through Web
technique is completely applicable for different text documents. The algorithm embeds a secret
languages, however, the disadvantage of the method message between words using special spaces. In this
is low capacity since only one or two bits per location study, the method converts a secret message to a
can be embedded. Moreover, this method is not able binary string based on ASCII codes. Therefore, it
to preserve the embedded bits against tampering and embeds the binary string into the Web document by
retyping attacks [44]. replacing nine special spaces between words
Brassil et al. (1999) proposed another according to a 3-bit group coding, which the spaces
watermarking technique based on modifying the are listed in Table 5.
appearance of different elements of cover text. The
TABLE 5. Special spaces based 3-bits group coding [58]

No. Name Reference type Code type Code inserted in HTML Bits-Encode
1 (Normal) space Normal Space ASCII Typed space (with 000
20h inserted)
2 (Normal) space Numeric character reference Unicode 001
3 (Normal) space Numeric character reference Unicode 2 010
4 (Normal) space Numeric character reference Unicode &#x32 011
5 Non-break space Numeric character reference Unicode 100
6 Non-break space Numeric character reference Unicode 101
7 Non-break space Numeric character reference Unicode &#160 110
8 Non-break space Character entity reference HTML 111
9 Non-break space Character entity reference HTML &nbsp Unused
The authors provided a secret communication on strategy. This technique has advantages such as
web pages by using undefined characters in Unicode improved robustness, capacity, and invisibility,
or ASCII (e.g., “&#x32”,”&#160”), which makes an however, it is vulnerable to highlighting words in the
unpleasant text in the output text. For example, let pdf or the MS word files since highlighted words on
“Apr. 21, 2017”be as the original text and “&#x32” the watermarked text will definitely change the text
be the special space for hiding “010” bits, then the color, consequently, it does not provide robustness
watermarked text might be “Apr. &#x3221, 2017” against reformatting attacks [59].
and also a browser will show us “Apr.㈡, 2017” (i.e., Gutub et al. (2007) suggested a new text
we tested the sample by the HTML language). Lee and watermarking method by using Kashida or extension
Tsai presented a new way of data hiding scheme to feature of the Arabic language. A Kashida or extension
replace the between-word locations by using different character is utilized to adjust text by changing word
coding, however, the algorithm was performed on the length, however, it does not change the meaning at all
Internet Explorer version 6, hence the result of output (e.g., ‫)ســالم = سالم‬. A Kashida is added before or after
message was invisible. During our test, we utilized the the characters containing points (pointed characters) to
latest version of common web browsers such as hide a bit ‘1’, and is added before or after the
Chrome, Firefox, etc., therefore, this algorithm showed characters without points (un-pointed characters) to
us unpleasant characters between words in the output hide a bit ‘0’ [60]. Gutub et al. (2010) also introduced
text after embedding [58]. a new Kashida based watermarking method which
Cheng et al. (2010) proposed a robust employs a special pattern for embedding the
watermarking algorithm. The method utilizes the color watermark bits. It improves embedding capacity by
feature of cover text in order to embed the watermark adding one Kashida after any compatible character,
bits based on the watermark fragments and regrouping which represents a bit ‘0’, and double Kashidas for a
bit ‘1’. This algorithm is designed to provide proof of
10
ownership and authentication for Web text documents imperceptibility, optimum capacity and are able to
[61]. apply to text documents in the Arabic, Persian and
Alginahi et al. (2013) presented a new Kashida Urdu languages, but they cannot retain the watermark
based watermarking approach for hiding a secret data bits against tampering and retyping attacks.
through the Arabic text. In this work, one Kashida Chou et al. (2012) suggested a reversible data
represents a bit ‘1’, and ‘0’ by omitting it before hiding scheme for HTML files based on adding
specific characters (‫ ؤ‬،‫ و‬،‫ ز‬،‫ ر‬،‫ ذ‬،‫د‬،‫ ئ‬،‫ آ‬،‫ إ‬،‫ أ‬،‫ا‬،‫[ )ء‬25]. specific space characters between words in the cover
Later, Alginahi et al. (2014) utilized two set of text. In this method, English sentences are divided into
characters according to their frequency of repetition in several textural segments and every textural segment
the Arabic digital texts. In this study, the authors includes some blank characters (between-word
proposed two methods (A and B) in order to embed locations). Then, Cartesian production strategy is
the watermark bits by adding one Kashida for a bit ‘1’ utilized to create the pairs of spaces, and the blank
and omitting a Kashida for a bit ‘0’ in special locations characters are replaced by the new pairs of spaces
into the text. The embedding locations are after 14 (according to watermark bits) [62]. This method
characters with high repetition improves the embedding capacity (i.e., five bits per
(‫ک‬،‫س‬،‫د‬،‫ه‬،‫ع‬،‫ب‬،‫ر‬،‫ت‬،‫ن‬،‫م‬،‫و‬،‫ي‬،‫ل‬،‫ )ا‬for method A and after location) of the method introduced in [58]. However, it
15 characters with low repetition generates large gaps between words and is vulnerable
(‫ة‬،‫ف‬،‫ق‬،‫ح‬،‫ج‬،‫ص‬،‫ذ‬،‫خ‬،‫ض‬،‫ط‬،‫ظ‬،‫ش‬،‫ث‬،‫ز‬،‫ )غ‬for method B to tampering and retyping attacks.
[26]. Por et al. (2012) introduced a data hiding method
Al-Nofaei et al. (2016) proposed a Kashida based called UniSpaCh, which employs the Unicode special
steganography technique for Arabic digital texts. This spaces in order to hide secret information into the
method improves the feature of hiding data within the Microsoft word files. The method utilizes specific
Kashida character in the Arabic text documents using locations such as inter-word, inter-sentence, end-of-
whitespaces between words. In practice, this method line, and end of paragraph spaces to conceal a secret
provides high imperceptibility and better capacity message [27]. In addition, a combination of double
compare to other techniques, however, its robustness is spaces is utilized for embedding the secret bits as
low against tampering and retyping attacks [69]. depicted in Table 6.
However, all the Kashida based hiding techniques
[25], [26], [60], [61], [65-70], provide high
TABLE 6. Binary classification model in UniSpaCh [27]
Symbol Spaces Sequence Symbol Spaces Sequence
“” Normal 00 “ ” Hair 00
“ ” Thin + Normal 01 “ ” Six-Per-Em 01
“ ” Six-Per-Em + Normal 10 “ ” Punctuation 10
“ ” Hair + Normal 11 “ ” Thin 11
Representation scheme for inter-word spacing and inter-sentence Representation scheme for end-of-line and inter-paragraph spacing
spacing (Group A). (Group B)
The merit of this method is that a combination of is also vulnerable to tampering, copy & past, retyping
spaces provides more embedding capacity than the attacks [15].
previous methods. However, this method also cannot Taleby Ahvanooey and Tabasi (2014) introduced
avoid generating unpleasant gaps in the watermarked an invisible watermarking technique by adding hidden
text and is vulnerable to tampering, and retyping Unicode characters in Microsoft word files. In this
attacks [27]. method, a watermark string is firstly converted to 8
Mir (2014) proposed another open space based bits ASCII code. Each two bits’ pair of the binary
text watermarking method by using the structural watermark sequence is represented by the zero-width
properties of HTML language. The algorithm utilizes a characters as shown in Table 7. Then, the zero-width
hash function to generate watermark bits and, in characters are embedded after special punctuation
addition, the hashed watermark bits are converted to characters (e.g., dot (.), Comma (,), semicolon (;), etc.).
an invisible string by replacing three special space In this work, the researchers aim to protect the
characters (i.e. u202F, and u205F, and u200A), then it watermark against tampering attacks by hiding many
embeds the invisible string in the <meta> tag of a times of watermark string into the original cover text.
HTML file. The authors claimed that it can be applied In order to point the to point the number of watermark
in multilingual text files and provides high robustness. string, it inserts a zero-width character (LRE or RLE)
However, there are disadvantages: those characters after embedding each watermark string according to
generate unpleasantly large gaps between words, and it the language of text (English or Persian).
TABLE 7. Binary classification model in [20]

Character Name Two Bit Classification The Character written Character Unicode Code
symbol
11
Zero-width Space 00 No symbol and Width U+200B
Zero-Width-Joiner 01 No symbol and Width U+200C
Zero-width-Non-Joiner 10 No symbol and Width U+200D
Mongolian-Vowel-Separator 11 No symbol and Width U+180
LRE or RLE End of the Watermark Message No symbol and Width U+200A or U+200B
Moreover, the extraction algorithm verifies the pointed letters. The Pseudo-space (or ZWNJ: zero width
length of the watermark bits with the extracted non-joiner, “U+200C”) is a zero-width character which
watermark bits and the location of LRE or RLE to separates joined letters in the Persian/Arabic, and does
check the accuracy of watermark detection. This work not have written symbol and width. If it is located
was designed to authenticate e-text and prove the between two joinable letters, then they will be separated
ownership of Microsoft word files. Even though the (e.g., ‫)میخواهم= میخواهم‬. In order to hide the watermark
method provides a high degree of invisibility, optimum bits, this algorithm inserts a combination of the space
robustness and low capacity and however, it has the between words and the letter before it. If the watermark
same disadvantage with other structural methods which bit is zero and the letter is un-pointed (for simplicity,
is vulnerable to retyping attacks [20]. Later, Taleby {0, un-pointed}), then the pseudo-space is embedded. If
Ahvanooey, et al. (2015) proposed a method which {1, un-pointed}, no pseudo-space is embedded. In case
improves the embedding capacity (16-bits per location) of {1, pointed}, the pseudo-space is embedded. If {0,
over the previous work by selecting different locations pointed}, no pseudo-space is embedded. This method
of embedding in to the cover text, e.g., after special provides high invisibility, however, has relatively low
punctuation characters, between blank lines and capacity (one bit per pseudo-space) and also is
paragraphs [21]. vulnerable to tampering and retyping attacks [63].
Alotaibi and Elrefaei (2016) designed a new Another invisible watermarking approach was
watermarking technique to conceal secret information suggested by Taleby Ahvanooey et al. (2016), which
in Arabic text documents. This method groups belongs to the structural watermarking category. In this
characters according to the dotting feature of the Arabic technique, the watermark message (web page’s URL) is
alphabets as depicted in Table 8 [63]. converted to a binary string and the string is further
encoded by a hash function. Then the hashed bits are
TABLE 8. Two grouping of Arabic letters [63] embedded by invisible zero width characters as shown
Pointed Letters Un-pointed Letters in Table 9 and 10. This method hides a watermark
‫شزذخجتث ب‬ ‫صسردحا‬ string at the end of each sentence which can be used as
‫ينقفغظضة‬ ‫وهملکعط‬
a tool to prove the ownership of web text documents
[16].
In this study, the authors utilized a Pseudo-space to
mark the watermark bits according to pointed and un-
TABLE 9. Unicode zero-width control character symbols in [16]

HTML Hex Code Unicode Hex Code Unicode Char Name
Right-to-Left Override U+202E ‮
Left -to- Right Override U+202D ‭
Pop Directional U+202C ‬
Formatting
Right-to-Left Override U+202E ‮
TABLE 10. Unicode groups pattern binary in [16]

Zero-Width Group Embedding HTML Code 3 Bit Classification
‮ ‮ ‮ 000
‭ ‭‮ 001
‮ ‭ ‮ 010
‭‮ ‮ 011
‬ ‮‮ 100
‬ ‮‬ 101
‬ ‭‭ 110
‬ ‭‮ 111
This technique was designed to protect Web pages robust to tampering attacks, but it still is vulnerable to
against forgery and plagiarism attacks that provides against retyping attacks.
high invisibility, high capacity, and optimum Alotaibi and Elrefaei (2017) proposed two open
robustness. Moreover, it is applicable to multilingual space based watermarking algorithms in Arabic texts
text documents. Since the algorithm embeds the [24]. In the first method, the dotting feature presented in
invisible watermark one time after each dot character (.) [63] is utilized to improve the capacity of the previous
or at the end of each sentence in the cover text, it is work. In order to mark the watermark bits, the pseudo-
space (ZWNJ) is employed to embed before and after
12
normal space depending on the character which can be watermark in the Latin-based texts. This technique
pointed or un-pointed. In the second method, as shown blends the original text and a user password through a
in Table 11, the four space characters are used to embed hash function in order to compute the watermark. Then,
beside of normal space. it employs the homoglyph Unicode characters and
special spaces in order to embed the watermark bits in
TABLE 11. Space characters used in [24] the cover text. The authors claimed that, this technique
Character Name Hex Code Space Text-Face can hide a watermark (64bit) into a short text with only
Pseudo-space U+200C No width and no
(ZWNJ) face 46 characters and, moreover, it provides high
Thin Space U+2009 “ ” imperceptibility, and high capacity. However, it is
Hair Space U+200A “ ” vulnerable against reformatting (e.g., changing the font
Zero-Width Space U+200B No width and no type of watermarked text causes to lose the watermark
face
bits), tampering and retyping attacks [34]. Due to
utilizing the homoglyph Unicode characters, this
In addition, each 4 bits from the watermark bits (or method has low robustness against all the conventional
binary sequence) are embedded by corresponding space attacks. Later on, Rizzo et al. [11] utilized the same
characters and order: the 1st bit is represented by method [34] to embed a watermark string in social
pseudo-space, the 2nd bit by thin space, the 3rd bit by media platforms.
hair space, and the 4th bit by zero-width space. Currently, the structural watermarking category is
Therefore, the existence of any four space character not greatly preferred since the watermarked documents
means a bit ‘1’, otherwise a bit ‘0’. For example, if only are not robust enough against conventional attacks such
zero-width space is found between words, then it as insertion, removal, reformatting, and reordering. In
represents ‘0001’. The second method can be applied to addition, sometimes even a simple text converting (e.g.,
multilingual text documents due to the space letter is webpage to doc file) causes to fail the watermark
one of the writing structures. This method suffers from detection by structural techniques. However, it is
low robustness due to embedding four spaces beside of obvious that the structural techniques provide
each normal space in the cover text. For example, if an imperceptibility and higher embedding capacity.
attacker alters or deletes a part of the text (include To demonstrate the hiding process of structural
normal spaces) then it causes to fail the whole techniques, we implemented them on highlight
watermark string by extraction algorithm because of the examples, that are depicted in Table 12. Herein, the
normal space without other space refer to four bits implementation means evaluation of selected
“0000” in the watermark bits. Moreover, the authors techniques based on their embedding methods
claimed that their methods have high imperceptibility
but they used two spaces with the deferent length which Obviously, almost all the structural techniques
makes more gaps between words in the watermarked provide high imperceptibility and better embedding
text [24]. capacity compare to the NL techniques.
Rizzo et al. (2016) presented a text watermarking
technique which is able to embed a password based
TABLE 12. Implementation of structural techniques on highlight examples

Algorithm Original Text Watermarked Text Embedded bits
Bender et al., 1996 [44] Tom's leg is injured by falling. Tom's leg is injured by falling. 5
Lee and Tsai, 2008 [58] Tom's leg is injured by falling. Tom's leg is &#x32injured by 15
&#160falling.
Cheng et al., 2010 [59] 我喜欢一个苹果我喜欢一个苹果 12
I like an apple. I like an apple.
Chou and et al., 2012 Tom's leg is injured by falling. Tom's leg is injured by falling. 25
[62]
Gutub et al. 2010 [61] ‫متى‬, ‫استعبدتم‬, ‫ الناس‬, ‫ وقد‬, ‫ ولدتهم‬, ‫ امهاتهم‬,‫احرارا‬ ‫مـتى‬, ‫اسـتـعـبدتم‬, ‫ الناس‬, ‫ وقـد‬, ‫ ولـدتـهم‬, ‫ امـهاتهم‬,‫احرارا‬ 12
Por et al., 2012 [27] Tom's leg is injured by falling. Tom's leg is injured by falling. 10
Taleby and Tabasi, 2014 Tom's leg is injured by falling. Tom's leg is injured by falling. 4
[20]
Taleby Ahvanooey et al., Tom’s leg is injured by falling. Tom᠎'s leg is injured by falling. 32
2015 [21]
Taleby Ahvanooey et al., Tom’s leg is injured by falling. Tom’s leg is injured by falling. Total bits of
2016 [16] watermark
Alotaibi and Elrefaei, Tom’s leg is injured by falling. Tom’s leg is injured by falling. 20
2017 [24]
Rizzo et al.2016 [34] All the World All the  World 10
Rizzo et al.2017 [11]
To have a fair comparison between structural Kashida based techniques are excepted due to
techniques, we considered those techniques which are focusing on the specific feature of the Arabic
able to apply to multilingual text documents. The
13
language which can be applied only in Arabic, simulated dataset. This dataset is made by copying
Persian, and Urdu texts. randomly two sentences from referenced websites as
In addition, we evaluated the selected techniques in depicted in Table 13. The details structures of copied
terms of criteria by implementing them on a texts are summarized in Table 14.
Table 13. Text document examples

References Name Copied Text Content
ww.yjc.ir Doc1 ‫به گزارش خبرنگار فوتبال و فوتسال گروه ورزشی باشگاه خبرنگاران جوان؛ مهدی طارمی مهاجم محروم پرسپولیس‬
‫از باشگاه الغرافه قطر پیشنهاد دریافت کرده است و این باشگاه قطری طی نامه ای از باشگاه پرسپولیس درخواست‬
‫ طبق برنامه جلسه هیئت مدیره باشگاه پرسپولیس‬.‫کردند که رقم پیشنهادی خود را برای انتقال این بازیکن اعالم کنند‬
.‫امروز برگزار می شود و قرار است در مورد انتقال طارمی به باشگاه قطری تصمیم گیری شود‬
www.nytimes.com Doc2 WASHINGTON — White House officials on Monday mustered a sweeping defense of their
less-is-more public disclosure practices, arguing that releasing information on a wide array
of topics would strike a blow against personal privacy and impede President Trump’s ability
to govern. This stance, critics say, represents a shift from Mr. Trump’s own drain-the-
swamp campaign message and his promise to decrease the influence of lobbyists, special
interest groups and big political donors.
www.chinadaily.com
Doc3 其实，只要从宪法法院法官的“政治倾向”上来看，弹劾现任总统绝不是一件容易的事
。因为宪法法院的九名法官 (现在在任八名) 中需要六名以上持“赞成”意见才能弹
劾总统。
www.spiegel.de Doc4 Der spanische König Felipe VI. hat die politischen Entscheidungsträger in Katalonien zu
verantwortlichem Handeln aufgerufen. "In Katalonien darf der Weg nicht erneut zu
Konfrontation oder Ausschluss führen", warnte Felipe in seiner Weihnachtsansprache, die
am Abend vom spanischen Staatsfernsehen ausgestrahlt wurde.
Table 14. The detail structures of samples copied texts

Name Characters Dots (.) Punctuation Words Spaces Paragraphs Lines Language
Characters
Doc 1 390 2 3 71 70 1 3 Persian
Doc 2 482 3 14 70 70 1 4 English
Doc 3 81 2 10 79 2 1 3 Chinese
Doc 4 316 3 7 40 39 1 3 German
Assuming that, we aim to protect the documents in the indicates the embedding capacity of evaluated
dataset by hiding a watermark binary (60 bits) into their techniques, which are calculated by using equation (1).
text contents. Therefore, we can analyze the efficiency Moreover, Figure.5 illustrates the embedding capacity
of selected techniques in terms of criteria. Table 15 of evaluated structural techniques.
TABLE 15. Embedding capacity analysis of structural techniques (bits per doc)
Algorithm Name Doc 1 Doc 2 Doc 3 Doc 4 Average Summary of Embedding methods (BPL)
Capacity
Bender et al., 1996 [44] 73 74 5 42 49 One bit per locations (Inter-word spaces, end
of the lines and between sentences)
Lee and Tsai, 2008 [58] 210 210 6 117 135 3 bits per locations (Inter-word spaces)
Cheng et al., 2010 [59] 852 840 948 480 781 12 bits per locations (word colors)
Chou et al., 2012 [62] 350 350 10 195 226 5 bits per locations (Inter-word spaces)
Por et al., 2012 [27] 148 150 12 86 98 2bits per locations (Inter-word, inter-
14
sentence, end-of-line, and end of paragraph)
Taleby Ahvanooey and 6 28 20 14 17 2 bits per locations (after Punctuation
Tabasi, 2014 [20] characters)
Taleby et al., 2015 [21] 48 224 160 42 118 16 bits per locations (after Punctuation
characters, between sentences and beginning
of the paragraphs)
Taleby Ahvanooey et al., 120 180 120 180 150 Total bits of watermark string per locations
2016 [16] (one time watermark bits after dots (.))
Alotaibi and Elrefaei, 280 280 8 156 181 4 bits per locations (Inter-word spaces)
2017 [24] (Second)
1000
900
800
700
Number of Bits per doc
600
500
400
300
200
100
0
Bender et Lee and Cheng et Chou and Yee Por et Taleby and Taleby et Taleby et Alotaibi
al. (1996) Tsai (2008) al. (2010) et al. al. (2012) Tabasi al. (2015) al. (2016) and
(2012) (2014) Elrefaei
(2017)
Doc1 (Persian) Doc2 (English) Doc3 (Chinese) Doc4 (German)
FIGURE 5. The embedding capacity of structural techniques (Bits per Doc)
As shown in Table 15 and figure 4, the embedding watermark bits (60bits). For example, in the Por et al.
capacity evaluation results conducted on the dataset (2012), it is able to embed 148-bit into (Doc1), 150-
demonstrate that some techniques provide high bit (Doc2), 12-bit (Doc3), and 86-bit (Doc4).
capacity and others are not able to hide whole of the
TABLE 16. Approximate DR (%) of structural techniques against tampering attacks
Algorithm name Doc 1 Doc 2 Doc 3 Doc 4 Average Robustness
Bender et al., 1996 [44] 81 85 93 86 ≅ 86
Lee and Tsai, 2008 [58] 82 85 97 88 ≅ 88
Cheng et al., 2010 [59] 81 85 02 87 ≅ 64
Chou and et al., 2012 [62] 82 85 97 87 ≅ 88
Por et al. 2012 [27] 81 84 93 86 ≅ 86
Taleby and Tabasi, 2014 [20] 99 97 88 98 ≅ 95
Taleby et al., 2015 [21] 99 97 88 98 ≅ 95
Taleby et al., 2016 [16] 99 99 98 99 ≅ 99
Alotaibi and Elrefaei, 2017 [24] 82 85 97 88 ≅ 88
Assuming that, if a malicious user tampers a To answer this question, we evaluated the
character or a word of the watermarked text content, approximate distortion robustness of each technique
then whether the watermark bits can be detected from based on the embedding locations and the document
the watermarked text by extraction algorithm? features in Table 14, by using equation (2) separately.
The DR evaluation results are shown in Table 16. In
15
addition, Figure 6 illustrates both the average techniques.
capacity and the distortion robustness of evaluated
120 900
800
100
Embedding Capacity (Bits)

700
Approximate DR (%)
80 600
500
60
400
40 300
200
20
100
0 0
Chou Taleby Alotaibi
Bender Lee and Cheng et YeePor Taleby et Taleby et
and et and and
et al. Tsai al. et al. al. al.
al. Tabasi Elrefaei
(1996) (2008) (2010) (2012) (2015) (2016)
(2012) (2014) (2017)
DR(%) 86 88 64 88 86 95 95 99 88
Capacity (Bit) 49 135 781 226 98 17 118 150 181
DR(%) Capacity (Bit)
FIGURE 6. The overlap between the average embedding capacity and DR results
TABLE 17. Structural watermarking techniques Criteria Analysis
Algorithm Name Embedding Invisibility DR Limitations Language
Capacity Compatibility
Bender et al., 1996 [44] Low Middle Low Unpleasant gaps between Multilingual
words
Lee and Tsai, 2008 [58] Low Middle Modest Unpleasant characters Multilingual
between words
Cheng and et al., 2010 [59] Very High Middle Modest Highlight worlds Multilingual
Gutub and et al., 2007 [60] Low Imperceptible Modest Unpleasant wide “Keshida” Exclusive
between words (Arabic/Persian)
Gutub and et al., 2010 [61] Medium Imperceptible Modest Unpleasant wide “Keshida” Exclusive
Alginahi and et al., 2013 [25] Medium Imperceptible Modest Unpleasant wide “Keshida” Exclusive
Alginahi and et al., 2014 [26] High Imperceptible Modest Unpleasant wide “Keshida” Exclusive
Chou et al., 2012 [62] Medium Imperceptible Modest Unpleasant gaps between Multilingual
words
Por et al., 2012 [27] Medium Imperceptible Modest Unpleasant gaps between Multilingual
words
Mir, 2014 [17] High Imperceptible Modest Only applicable to HTML Multilingual
files
Taleby Ahvanooey and Tabasi, Low Imperceptible High Depends to punctuation Multilingual
2014 [20] characters
Taleby Ahvanooey et al., 2015 Medium Imperceptible High Depends to punctuation Multilingual
[21] characters
Alotaibi and Elrefaei, 2016 [63] Low Imperceptible Modest Depends on pointed and Exclusive
unpainted characters (Arabic/Persian)
Taleby Ahvanooey et al., 2016 High Imperceptible High Depends on dots (.) Multilingual
[16] characters
Alotaibi and Elrefaei, 2017 [24] Low Imperceptible Modest First:

Unpleasant wide “Keshida” (Arabic/Persian)
and gaps between words Second:
(Multilingual)
Rizzo et al.2016 [34] high Imperceptible Low Depends to font type Exclusive (English)
Rizzo et al.2017 [11] (homoglyph Unicode
characters) and Unpleasant
16
gaps between words
Al-Nofaei et al. (2016) [69] Medium Imperceptible Modest Unpleasant wide “Keshida” Exclusive
and gaps between words (Arabic/Persian)
Table 17. depicts a comparative analysis of structural difference between the original text and the
techniques in terms of criteria and language watermarked text.
compatibility along with their limitations. Although, the
structural techniques have been improved especially in In order to highlight the merits and demerits of
evaluated techniques, six types of conventional attacks
invisibility and embedding capacity, but, still they have
are considered for evaluating their limitations such as
modest robustness and are vulnerable to tampering and
insertion, removal, re-ordering, reformatting, retyping,
retyping attacks. As shown in Table 4 and Table 17, we copy & paste attacks. Assuming that, a malicious user
analyzed the distortion robustness criterion of evaluated copies a portion (or whole) of watermarked text
techniques according to their limitations against contained the watermark string in a new host file, and
tampering attacks by considering the probability of randomly alters it in terms of conventional attacks. In
losing the embedded watermark bits. Furthermore, we this case, if even one bit of the watermark is changed,
evaluated the embedding capacity of each technique then it causes to fail the detection of the watermark
based on its embedding locations (bits per doc) and the string by the corresponding extraction algorithm. The
invisibility of each technique is rated based on the evaluation results conducted on the watermarked texts
are listed in Table 18.
TABLE 18: A comparative analysis of evaluated techniques against conventional attacks
Algorithm Name Robustness against conventional attacks: Yes () and No(×)
Security Insertion Removal Reformatting Re-ordering Retyping Copay & Paste
Limitations
Topkara et al., (2006a) [42] Medium safety (3) × ×  ×  
Topkara et al., (2006b) [43] Medium safety (3) × ×  ×  
Meral et al., (2007) & (2009) Medium safety (3) × ×  ×  
[52], [53]
Kim, (2008) [41] Medium safety (3) × ×  ×  
Kim et al., (2010) [54] Medium safety (3) × ×  ×  
Halvani et al., (2013) [55] Medium safety (3) × ×  ×  
Mali et al., (2013) [56] Medium safety (3) × ×  ×  
Lu et al (2009) [57] Medium safety (3) × ×  ×  
Bender et al. (1996) [44] Medium safety (3)  ×  × × 
Lee and Tsai (2008) [58] Medium safety (3)  ×  × × 
Cheng et al. (2010) [61] Easy to lose (1)  × × × × ×
Gutub et al. (2007) [60] Medium safety (3)  ×  × × 
Gutub et al. (2010) [61] Medium safety (3)  ×  × × 
Alginahi et al. (2013) [25] Medium safety (3)  ×  × × 
Alginahi et al. (2014) [26] Medium safety (3)  ×  × × 
Chou et al. (2012) [62] Medium safety (3)  ×  × × 
Por et al. (2012) [27] Medium safety (3)  ×  × × 
Mir (2014) [17] Unsafe (0) × × × × × ×
Taleby and Tabasi (2014) [20] Optimum safety  ×   × 
4)
Taleby et al (2015) [21] Optimum safety  ×   × 
(3)
Alotaibi Elrefaei (2016) [63] Medium safety (3)  ×  × × 
Taleby et al. (2016) [16] Optimum safety  ×   × 
(4)
Alotaibi Elrefaei (2017) [24] Medium safety (3)  ×  × × 
Rizzo et al. (2016) [34] Easy to lose (2) × ×  × × 
Rizzo et al. (2017) [11]
Al-Nofaei et al. (2016) [69] Medium safety (3)  ×  × × 
As depicted in Table 18, almost all the evaluated

techniques have some different limitations, however, 4. Suggestions for the Future works
some of them provide more safety than others. In
practice, the programmers must consider the priority of Information hiding is a very powerful and flexible
criteria in case of fragile or robust and, thus, select a technique that can be employed in various ways to
suitable technique based on the security limitations protect valuable information in different areas such as
which can provide more safety in that application. copyright protection, secure communication,
authentication, etc. Although the efficiency of text
17
hiding techniques has drawn attention much from are four common criteria for efficiency analysis, which
academic researchers, it is still lacking a precise are dependent on the way of embedding. In other
analysis modeling that can take the intrinsic criteria of words, the embedding methods generally determine
the text hiding industry into account during the how to analyze the efficiency of the text hiding
evaluating efficiency. As we already pointed out, there techniques.
TABLE 19. A general comparison of two major techniques (Linguistic and Structural)
Factors Linguistic (Natural language) Structural
Language Compatibility Exclusive Special Language Multilingual
Embedding Capacity Low Medium and High
Meaning alteration of text content Alters the meaning No effect on text content
Invisibility Imperceptible Imperceptible
Computational Complexity Very large (due to the search algorithm in dictionary Medium (embedding in special
and replacement of words) locations)
Robustness against conventional attacks Low Modest
Security Medium safety Optimum safety
Therefore, to evaluate the efficiency of a certain distortion probability against tampering attacks (e.g., at
algorithm, it is required to be compared with previous the end of the sentences or first of the paragraphs, etc.)
works within the same category (e.g., linguistic or  Using new binary encoding (lossless compression)
structural). In addition, we have outlined some various algorithms to convert a watermark string to a binary
limitations of two major categories of text hiding string (2-bit, 3-bit, 4-bit, etc.).
techniques in Table.19, which provide a better  Hash functions could be used to secure the
understanding of the-state-of-the-art and hopefully help watermark bits against unpredictable estimate based
in developing future works. attacks.
Practically, the linguistic techniques have more  The structural techniques can be applied as a
limitations compared to structural techniques. Due to security tool in the version control systems (VCS) for
extra dictionaries (Word-Net) and high computational protecting the open source programs against reverse
cost, moreover, a few researchers focused on linguistic engineering.
based methods in recent years. Over the last two  The ideal text watermarking algorithm should
decades, many structural techniques have been provide optimum trade-offs among the three criteria
introduced to improve the efficiency of text hiding (embedding capacity, invisibility, and robustness) to
techniques by considering the optimum trade-offs achieve high-level of security.
between criteria. However, the robustness of these  To sum up, which kind of techniques provides more
techniques needs to be more improved against accuracy for copyright protection of text documents?
tampering attacks in terms of security requirements. In We cannot give a precise and perfect answer to this
the following, we suggest some directions aimed at question. The researchers must take into account many
guiding cyber security researchers on the best options to things like various merits and demerits of text hiding
utilize various types of text hiding techniques techniques, together with the guidelines that we have
depending on the characteristics of the applications. collected. In addition, they should ponder whether the
However, we have to mention that these suggestions are text hiding techniques could be appropriate or not for
general and empirically derived rules of thumb; these their applications. When the researcher realizes that
guidelines must not be considered rigidly or some of the merits of a specific technique can provide a
dogmatically. valuable benefit to the specific needs of the application
 Where the main concern is protecting the valuable at issue, thus it should probably be given a try.
documents against retyping attacks, the NL based
technique is the best tool to provide that requirement. 5. Conclusion
 Wherein the main concern is protecting digital text
documents against tampering, reformatting, and re- This case study provides a comparative analysis of
ordering attacks, the structural techniques can be existing information hiding techniques, especially on
applied as a fragile or robust tool for different those ones that are focused on altering the structure and
applications (e.g., copyright protection, authentication, content of digital texts for copyright protection. We
proof of ownership, etc.). looked at a range of available approaches, and attacks
 Since the zero-width characters provide high over the digital text documents in order to explain
invisibility and compatibility with other languages in current security issues in the copyright protection
different file formats (e.g. Web, Word, PDF, etc.), they industry. Moreover, we outlined two categories of text
can be used as an imperceptible way to hiding secret watermarking techniques based on how to process
information through the Unicode digital texts. digital texts to embed the watermark bits: namely,
 Providing high/low robustness by considering the Linguistic (or natural language) and structural (format
specific locations of the text that have high/low based). Linguistic techniques alter the text content and
sometimes even the original meaning of sentences for
18
embedding the watermark, which is not desirable and Challenges and Limitations”, Journal of Computer
hence, hard to apply. Using this kind of methods is not Sciences 2016, Vol.12, No.2, pp.2079-2094, Doi:
suitable to protect sensitive documents. The structural 10.3844/jcssp.2016.62.80
techniques utilize some characteristics of text such as [7] P. Singh, R. S. Chadha, “A Survey of Digital
layout features (e.g. inter-words spaces, inter-line Watermarking Techniques, Applications and Attacks”,
spaces, etc.), and format (e.g. text color, text font, text International Journal of Engineering and Innovative
height and etc.). Format based methods do not retain the Technology 2013; Vol.2, No.9, pp.165-175.
watermark against reformatting, conversion and even [8] M. Agarwal, “Text Steganographic Approaches: a
sometimes a simple copy of the text into another file. comparison”, International Journal of Network Security
& Its Applications, 2013; Vol.5, No.1, pp.9-25. Doi:
Those structural techniques utilize Unicode control
10.5121/ijcses.2013.4602
characters for embedding (e.g. Zero-width spaces,
[9] J. Guru, H. Damecha, “Digital Watermarking
special spaces, etc.) the watermark bits into the original
Classification: A Survey”, International Journal of
text, and are able to protect the watermarked text Computer Science Trends and Technology, 2014; Vol.2,
against reformatting, tampering, and copy attacks to No.5, pp.122-124.
some extent. This kind of techniques can be applied to [10] M. H. Alkawaz, G. Sulong,S. Tanzila, A. S. Almazyad,
sensitive documents due to having shown a greater A. Rehman, “Concise analysis of current text automation
degree of imperceptibility and optimum robustness. and watermarking approaches, Journal of Security and
Finally, we have suggested some of the guidelines and Communication Networks ,15 Feb 2017, Doi:
directions that could merit further attention in future 10.1002/sec.1738
works. [11] S. G. Rizzo, F. Bertini, D. Montesi, and C. Stomeo, Text
Watermarking in Social Media, In Proceedings of 2017
Acknowledgments IEEE/ACM International Conference on Advances in
This paper was supported by The Fundamental Social Networks Analysis and Mining 2017 (ASONAM
Research Funds of China for the Central Universities '17).
[12] A. M. Alhusban, & Jehad Q. O. Alnihoud, (2017), A
(No. 30916015104); National key research and
Meliorated Kashida Based Approach for Arabic Text
development program of China: key projects of Steganography, International Journal of Computer
international scientific and technological innovation Science & Information Technology (IJCSIT) Vol 9(2),
cooperation between governments (No. pp.99-109.
2016YFE0108000); CERNET Next Generation Internet [13] A. Manikandan, T. Meyyappan, “Watermarking
Technology Innovation Project (NGII20160122). The Techniques”, International Journal of Computers &
project of ZTE Cooperation Research (2016ZTE04-11). Technology, 2012; Vol.2, No.3, pp.122-124.
[14] Y. X. Gu, B. Wyseur, B. Preneel, “Software-Based
Protection Is Moving to the Mainstream”, IEEE
Computer Society, 2011; pp.56-59.
6. References [15] C. P. Sumathi, T. Santanam, G. Umamaheswari, “A
Study of Various Steganographic Techniques Used for
[1] Z. Jalil and A. M. Mirza, “A Review of Digital Information Hiding”, International Journal of Computer
Watermarking Techniques for Text Documents”, Science & Engineering Survey, 2013; Vol.4 No.6, pp.9-
International Conference on Information and Multimedia 25. DOI: 10.5121/ijcses.2013.4602
Technology, 200; pp.230-234. [16] M. Taleby Ahvanooey, H. Dana Mazraeh, & S. H.
[2] P. Mahua, “A Survey on Digital Watermarking and its Tabasi, “an innovative technique for web text
Application”, International Journal of Advanced watermarking (AITW)”, Information Security Journal: A
Computer Science and Applications, 2016; Vol7, No.1, Global Perspective, Vol.25, No.6, 2016, pp.191-196.
pp.153-156. Doi: 10.1080/19393555.2016.1202356
[3] A. H. Abdullah, “Data Security Algorithm Using Two- [17] N. Mir, “Copyright for web content using invisible
Way Encryption and Hiding in Multimedia Files”, watermarking”, Computers in Human Behavior, 2014;
International Journal of Scientific & Engineering Vol. 30, pp.648-653. doi:10.1016/j.chb.2013.07.040
Research 2014; Vol.5, No.12, pp.471-475. [18] K. F. Rafat,M. Sher, “Secure Digital Steganography for
[4] M. A. Qadir and I. Ahmad, “Digital text watermarking: ASCII Text Documents”, Arabian Journal for Science
Secure content delivery and data hiding in digital and Engineering, 2013; Vol.38(8), pp.2079-2094, Doi:
documents”, IEEE Aerospace and Electronic Systems 10.1007/s13369-013-0574-5
Magazine, 2006; Vol.21, No.1, pp.18–21. [19] E. Sruthi, A. Scaria, A. T. Ambikadevi, “Lossless Data
[5] Fabien A. P. Petitcolas, J. Ross Anderson and Markus G. Hiding Method Using Multiplication Property for HTML
Kuhn, “Information Hiding, A Survey”, Proceedings of File”, International Journal for Innovative Research in
the IEEE, special issue on protection of multimedia Science & Technology, 2015, Vol.1, No.11, pp.420-425.
content, 1999; Vol.87, No.7, pp.1062-1078. [20] M. Taleby Ahvanooey, S. H. Tabasi, “A new method for
[6] N. A. A. S. Al-Maweri, A. Roslizah,W. A. W. Adnan, A. copyright protection in digital text documents by adding
R. Ramli and S. M. S. A. Abdul Rahman, “State-of-the- hidden unicode characters in persian/english texts”,
Art in Techniques of Text Digital Watermarking:
19
International Journal of Current Life Sciences 2014; Characters”, Indian Journal of Science and Technology,
Vol.4, No.8, pp.4895-4900. Vol.9, No.48, 2016; pp.1-14.
[21] M. Taleby Ahvanooey, S. H. Tabasi, S. Rahmany, “A [34] S. G. Rizzo, F.Bertini, D. Montesi, “Content-preserving
Novel Approach for text watermarking in digital Text Watermarking through Unicode Homoglyph
documents by Zero-Width Inter-Word Distance Substitution”, IDEAS '16 Proceedings of the 20th
Changes”, DAV International Journal of Science, 2015; International Database Engineering & Applications
Vol.4, No.3, pp.550-558. Symposium, 2016; pp.97-104.
[22] Z. Jalil, A. M. Mirza, “A robust zero-watermarking [35] Y. Zhang, H. Qin, T. Kong, “A Novel Robust Text
algorithm for copyright protection of text documents”, Watermarking for Word Document”, 3rd International
Journal of the Chinese Institute of Engineers, 2013; Congress on Image and Signal Processing (CISP2010),
Vol.36, No.2, pp.180-189. Doi: 2010; Doi: 10.1109/CISP.2010.5648007
10.1080/02533839.2012.734470 [36] H. O. N. Hebah, “Digital Watermarking a Technology
[23] M. Bashardoost, M. S. M. Rahim, N. Hadipour, “A Overview”, International Journal of Research and
Novel Zero- Watermarking Scheme for Text Document Reviews in Applied Sciences, 2011; Vol.6, No.1, pp.98-
Authentication”, Jurnal Teknologi, 2015; pp.49-56. 102.
Doi: 10.11113/jt.v75.5066 [37] “The Unicode Standard”, May 2017,
[24] R. A. Alotaibi, L. A. Elrefaei, “Improved capacity Arabic http://www.unicode.org/standard/standard.html
text watermarking methods based on open word space”, [38] “Unicode”, Wikipedia (the free encyclopedia), May
Journal of King Saud University Computer and 2017, https://en.wikipedia.org/wiki/Unicode
Information Sciences 2017; Vol.29, No.1, pp.1-13. Doi: [39] “Unicode Control characters”, Fileformate, May 2017,
10.1016/j.jksuci.2016.12.007 http://www.fileformat.info/info/unicode/char/search.htm
[25] Y. M. Alginahi, M. N. Kabir, O. Tayan, “An enhanced [40] M. Kaur, K. Mahajan, “An Existential Review on Text
Kashida-based watermarking approach for Arabic text- Watermarking Techniques”, International Journal of
documents”, International Conference on Electronics, Computer Applications, 2015; Vol.120, No.7, pp.29-32.
Computer and Computation (ICECCO) IEEE 2013; pp. [41] M. Y. Kim, “Text watermarking by syntactic analysis”,
301-304. Doi: 10.1109/ICECCO.2013.6718288 Proceedings of the 12th WSEAS International
[26] Y. M. Alginahi, M. N. Kabir, O. Tayan, “An enhanced Conference on Computers, (ICC’ 08), World Scientific
Kashida based watermarking approach for increased and Engineering Academy and Society, Heraklion,
protection in Arabic text-documents based on frequency Greece, 2008; pp: 904-909.
recurrence of characters”, International Journal of [42] M. Topkara, U. Topkara, M. J. Atallah, “Words are not
Computer and Electrical Engineering, 2014; Vol.6, No.5, enough: Sentence level natural language watermarking”,
pp. 381-392. doi: 10.17706/ijcee.2014.v6.857 MCPS’06, Santa Barbara 2006a. Doi:
[27] L. Y. Por, K. Wong, K. O. Chee, “A text-based data 10.1145/1178766.1178777
hiding method using Unicode space characters”, The [43] U. Topkara, M. Topkara, M. J Atallah, “The Hiding
Journal of Systems and Software, 2012, Vol.85, Virtues of Ambiguity: Quantifiably Resilient
No.5, pp.1075-1082. Doi:10.1016/j.jss.2011.12.023 Watermarking of Natural Language Text through
[28] M. D. Preda1, M. Pasqua, “Software Watermarking: A Synonym Substitutions, Proceeding MM&Sec '06
Semantics-based Approach”, Electronic Notes in Proceedings of the 8th workshop on Multimedia and
Theoretical Computer Science, Vol.331, 2017; pp. 71– security 2006b, pp.167-174.
85. DOI: 10.1145/1161366.1161397
[29] J. Gu, Y. Cheng, “A Watermarking Scheme for Natural [44] W. Bender, D. Gruhl, N. Morimoto, A. Lu, “Techniques
Language Documents”, The 2nd IEEE International for data hiding”, IBM Systems Journal, 1996; Vol.35,
Conference on Management and Engineering (ICIME), No.4, pp.313–336.
2010; Doi: 10.1109/ICIME.2010.5477622 [45] J. T. Brassil, S. Low, N. F. Maxemchuk, “Copyright
[30] J. Raj J., P. Nitin N., “Implementation of a New protection for the electronic distribution of text
Technique for Web Document Protection Using documents”, Proceedings of IEEE 1999; Vol.87, No.7,
Unicode”, Information Conference on Communication pp.1181- 1196. Doi: 10.1109/5.771071
and Embedded Systems (ICICES), 2013; Doi: [46] R. Petrović, B. Tehranchi, M. J. Winograd, (2007),
10.1109/ICICES.2013.6508287 Security of Copy-Control Watermarks, 8th International
[31] T. Liu, and W. Tsai “A New Steganographic Method for Conference on Telecommunications in Modern Satellite
Data Hiding in Microsoft Word Documents by a Change Cable and Broadcasting Services, pp.117-126. Doi:
Tracking Technique”, IEEE TRANSACTIONS ON 10.1109/TELSKS.2007.4375954
INFORMATION FORENSICS AND SECURITY, VOL. [47] O. Vybornova, B. Macq, Natural Language
2, NO. 1, 2007; PP. 24-31. Watermarking and Robust Hashing Based on
[32] A.A. Mohamed., “An improved algorithm for Presuppositional Analysis, IEEE International
information hiding based on features of Arabic text: A Conference on Information Reuse and Integration 2007;
Unicode approach”, Egyptian Informatics Journal, pp.177-182. Doi: 10.1109/IRI.2007.4296617
Vol.15, No.2, 2014, pp.79-87. [48] Z. Jalil, A. M. Mirza, T. Iqbal, “A Zero-Watermarking
[33] N. A. S. Al-maweri et al., “Robust Digital Text Algorithm for Text Documents based on Structural
Watermarking Algorithm based on Unicode Extended Components”, IEEE International Conference on
20
Information and Emerging Technologies 2010; pp.1-5. Theory and Information Security (ICITIS) 2010; pp. 600-
Doi: 10.1109/ICIET.2010.5625705 603.
[49] M. Bashardoost, M. S. M. Rahim, T. Saba, A. Rehman, [60] A. A. A. Gutub, L. Ghouti, A. A. Amin, T. M.
“Replacement Attack: A New Zero Text Watermarking Alkharobi, M. Ibrahim, “Utilizing extension character
Attack”, 3D Research, 2017; Vol.8, pp.2-9. Doi: ‘Kashida’ with pointed letters 469 for Arabic text digital
10.1007/s13319-017-0118-y watermarking”, In: SECRYPT, 2007; pp. 329–332.
[50] F. M. Ba-Alwi, M. M. Ghilan, F. N. Al-Wesabi, [61] A. A. A. Gutub, F. Al-Haidari, K. M. Al-Kahsah, J.
“Content Authentication of English Text via Internet Hamodi, “E-text watermarking: Utilizing ’kashida’
using Zero Watermarking Technique and Markov extensions in Arabic language electronic writing”,
Model”, International Journal of Applied Information Journal of Emerging Technologies in Web Intelligence
Systems, 2014; Vol.7, No.1, pp.25-36. 2010; Vol.2, pp.48-55. DOI: 10.4304/jetwi.2.1.48-55
[51] M. Tanha, S. D.S. Torshizi, M. T. Abdullah, F. Hashim, [62] Y. Chou, C. Huang, H. Liao, “A Reversible Data Hiding
“An Overview of Attacks against Digital Watermarking Scheme Using Cartesian Product for HTML File”, Sixth
and their Respective Countermeasures”, IEEE International Conference on Genetic and Evolutionary
International Conference on Cyber Security, Cyber Computing (ICGEC) 2012. Doi:
Warfare and Digital Forensic (CyberSec), 2012; pp.265- 10.1109/ICGEC.2012.30
270. DOI: 10.1109/CyberSec.2012.6246095 [63] R. A. Alotaibi, L. A. Elrefaei, “Utilizing word space with
[52] H. M. Meral, E. Sevinç, E. Ünkar, B. Sankur, A. S. pointed and un-pointed letters for Arabic text
Özsoy and et al., “Natural language watermarking via watermarking”, In: UKSim-AMSS 18th International
morphosyntactic alterations”, Proc. SPIE 6505, Security, conference on Computer Modelling and Simulation
Steganography, and Watermarking of Multimedia 2016; pp. 111–116.
Contents, 2007; Doi:10.1117/12.708111; [64] M. Shirali-Shahreza, “Pseudo-space Persian/Arabic text
[53] H. M. Meral, B. Sankur, A. S. Zsoy, Günger T, Sevinç E, steganography, in Computers and Communications,
(2009), Natural language watermarking via ISCC, 2008; IEEE Symposium on, July. 2008; pp. 864-
morphosyntactic alterations, Computer Speech and 868. Doi: 10.1109/ISCC.2008.4625605
Language, Vol. 23, pp.107-125. Doi: [65] A. A. A. Gutub, and M. M. Fattani, (2007), A Novel
10.1016/j.csl.2008.04.001 Arabic Text Steganography Method Using Letter Points
[54] M. Y. Kim, O. R. Zaiane, “Goebel R. Natural Language and Extensions, International Journal of Computer,
Watermarking Based on Syntactic Displacement and Electrical, Automation, Control and Information
Morphological Division”, Computer Software and Engineering Vol.1 (3), pp.502-505.
Applications Conference Workshops (IEEE [66] A. A. A. Gutub, & A. A. Al-Nazer, (2010), High
COMPSACW) 2010. Doi: Capacity Steganography Tool for Arabic Text Using
10.1109/COMPSACW.2010.37 `Kashida', The ISC Int'l Journal of Information Security,
[55] O. Halvani, M. Steinebach, P. Wolf, R. Zimmermann, Vol. 2(2), pp. 107-118
“Natural language watermarking for German texts”, [67] A. A. A. Gutub, W. Al-Alwani, & A. B. Mahfoodh,
Proceedings of the 1st ACM Workshop on Information (2010), Improved Method of Arabic Text Steganography
Hiding and Multimedia Security, Jun. 17-19, ACM, Using the Extension ‘Kashida’ Character, Bahria
Montpellier, France 2013; pp: 193-202. Doi: University Journal of Information & Communication
10.1145/2482513.2482522 Technology Vol. 3(1), pp. 68-72.
[56] M. L. Mali, N. N. Patil, J. B. Patil, “Implementation of [68] A. Al-Nazer, & A. A. A. Gutub, (2009), Exploit Kashida
text watermarking technique using natural language Adding to Arabic e-Text for High Capacity
watermarks”, Proceedings of the International Steganography, IEEE 2009 Third International
Conference on Communication Systems and Network Conference on Network and System Security, pp.447-
Technologies IEEE, 2013; pp. 482-486. DOI: 451.
10.1109/CSNT.2013.106 [69] S. M. Al-Nofaie, M. M. Fattani, A. A. A. Gutub, (2016),
[57] H. LU, M. Guangling, F. DingYi, G. XiaoLin, “Resilient Capacity Improved Arabic Text Steganography
Natural Language Watermarking Based on Pragmatics”, Technique Utilizing ‘Kashida’ with Whitespaces, The
IEEE Youth Conference on Information, Computing and 3rd International Conference on Mathematical Sciences
Telecommunication, 2009; YC-ICT '09 Doi: and Computer Engineering (ICMSCE 2016), pp.38-44
10.1109/YCICT.2009.5382387 [70] S. M. Al-Nofaie, M. M. Fattani, A. A. A. Gutub, (2016),
[58] I. S. Lee, W. H. Tsai, Secret communication through web Merging Two Steganography Techniques Adjusted to
pages using special space codes in HTML files. Improve Arabic Text Data Security, Journal of Computer
International Journal of Applied Science and Engineering Science & Computational Mathematics, Vol.6, DOI:
2008; Vol. 6, pp.141–149. 10.20967/jcscm.2016.03.004
[59] W. Cheng, H. Feng, C. Yang, “A robust text digital
watermarking algorithm based on fragments regrouping
strategy”, IEEE International Conference on Information
21

PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

PDF

Hochgeladen von

Copyright:

Verfügbare Formate

A Comparative Analysis of Information Hiding Techniques for

Copyright Protection of Text Documents

Information Security Systems

Cryptography Information Hiding

With-Box Cryptography Steganography

Special Frequency Text Image Video Audio

FIGURE 1: Different categories of information security systems

Key Digital Watermark Digital Watermark Key

FIGURE 2. Digital text watermarking (Embedding & Extraction) architecture

TABLE 1: Unicode special space characters [27], [39]

Unicode Hex HTML Code Name Written Symbol

TABLE 2: Unicode zero-width control characters [12], [16], [39]

Unicode HTML Name Text Written Symbol

1 or 0 V.S “01 10 11”

Watermarked Text: I Like an Apple. Watermarked Text: I love an Apple.

FIGURE 3. Comparison between linguistic and structural techniques

TABLE 3. Implementation of NL techniques on the highlight examples

(Halvani et al., 2013) [55] I love an apple. I like an apple.

TABLE 4. A comparative analysis of NL techniques

TABLE 5. Special spaces based 3-bits group coding [58]

TABLE 7. Binary classification model in [20]

TABLE 9. Unicode zero-width control character symbols in [16]

TABLE 10. Unicode groups pattern binary in [16]

TABLE 12. Implementation of structural techniques on highlight examples

Table 13. Text document examples

。因为宪法法院的九名法官 (现在在任八名) 中需要六名以上持“赞成”意见才能弹

Table 14. The detail structures of samples copied texts

FIGURE 5. The embedding capacity of structural techniques (Bits per Doc)

Embedding Capacity (Bits)

DR(%) Capacity (Bit)

Alotaibi and Elrefaei, 2017 [24] Low Imperceptible Modest First:

As depicted in Table 18, almost all the evaluated

Das könnte Ihnen auch gefallen