How would you code an anti plagiarism site?

First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I’m sure there may already be open source implementations.

How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like ‘the’, ‘a’, etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?

My language of choice is PHP.

UPDATE

I’m thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.

Here is an online site that claims to do so: http://www.plagiarism.org/

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).

However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text’s Kolmogorov complexity, but you can approach it be simply compressing the text.

A smaller NCD indicates that two texts are more similar. Some compression
algorithms will give better results than others. Luckily PHP provides support
for several compression algorithms, so you can have your NCD-driven plagiarism
detection code running in no-time. Below I’ll give example code which uses
Zlib:

PHP:

function ncd($x, $y) { 
  $cx = strlen(gzcompress($x));
  $cy = strlen(gzcompress($y));
  return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}   

print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));

Python:

>>> from zlib import compress as c
>>> def ncd(x, y): 
...     cx, cy = len(c(x)), len(c(y))
...     return (len(c(x + y)) - min(cx, cy)) / max(cx, cy) 
... 
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361

Note that for larger texts (read: actual files) the results will be much more
pronounced. Give it a try and report your experiences!

Solution 2

I think that this problem is complicated, and doesn’t have one best solution.
You can detect exact duplication of words at the whole document level (ie someone downloads an entire essay from the web) all the way down to the phrase level. Doing this at the document level is pretty easy – the most trivial solution would take the checksum of each document submitted and compare it against a list of checksums of known documents. After that you could try to detect plagiarism of ideas, or find sentences that were copied directly then changed slightly in order to throw off software like this.

To get something that works at the phrase level you might need to get more sophisticated if want any level of efficiency. For example, you could look for differences in style of writing between paragraphs, and focus your attention to paragraphs that feel “out of place” compared to the rest of a paper.

There are lots of papers on this subject out there, so I suspect there is no one perfect solution yet. For example, these 2 papers give introductions to some of the general issues with this kind of software,and have plenty of references that you could dig deeper into if you’d like.

http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf

http://proceedings.informingscience.org/InSITE2007/IISITv4p601-614Dreh383.pdf

Solution 3

Well, you first of all have to understand what you’re up against.

Word-for-word plagiarism should be ridiculously easy to spot. The most naive approach would be to take word tuples of sufficient length and compare them against your corpus. The sufficient length can be incredibly low. Compare Google results:

"I think" => 454,000,000
"I think this" => 329,000,000
"I think this is" => 227,000,000
"I think this is plagiarism" => 5

So even with that approach you have a very high chance to find a good match or two (fun fact: most criminals are really dumb).

If the plagiarist used synonyms, changed word ordering and so on, obviously it gets a bit more difficult. You would have to store synonyms as well and try to normalise grammatical structure a bit to keep the same approach working. The same goes for spelling, of course (i.e. try to match by normalisation or try to account for the deviations in your matching, as in the NCD approaches posted in the other answers).

However the biggest problem is conceptual plagiarism. That is really hard and there are no obvious solutions without parsing the semantics of each sentence (i.e. sufficiently complex AI).

The truth is, though, that you only need to find SOME kind of match. You don’t need to find an exact match in order to find a relevant text in your corpus. The final assessment should always be made by a human anyway, so it’s okay if you find an inexact match.

Plagiarists are mostly stupid and lazy, so their copies will be stupid and lazy, too. Some put an incredible amount of effort into their work, but those works are often non-obvious plagiarism in the first place, so it’s hard to track down programmatically (i.e. if a human has trouble recognising plagiarism with both texts presented side-by-side, a computer most likely will, too). For all the other 80%-or-so, the dumb approach is good enough.

Solution 4

It really depends on “plagarised from where”.
If you are talking about within the context of a single site, that’s vastly different from across the web, or the library of congres, or …

http://www.copyscape.com/ pretty much proves it can be done.

Basic concept seems to be

  • do a google search for some uncommon
    word sequences
  • For each result, do a detailed analysis

The detailed analysis portion can certainly be similar, since it is a 1 to 1 comparison, but locating and obtaining source documents is the key factor.

Solution 5

                           (This is a Wiki! Please edit here with corrections or enhancings)

For better results on not-so-big strings:

There are problems with the direct uso of the NCD formula on strings or little texts. NCD(X,X) is not zero (!). To remove this artifact subtract the self comparison.

See similar_NCD_gzip() demo at http://leis.saocarlos.sp.gov.br/SIMILAR.php

function similar_NCD_gzip($sx, $sy, $prec=0, $MAXLEN=90000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare. 
# Use $prec=-1 for result range [0-1], $pres=0 for percentual,
#     $pres=1 or =2,3... for better precision (not a reliable)  
# Use MAXLEN=-1 or a aprox. compress lenght. 
# For NCD definition see http://arxiv.org/abs/0809.2553
# (c) Krauss (2010).
  $x = $min = strlen(gzcompress($sx));
  $y = $max = strlen(gzcompress($sy));
  $xy= strlen(gzcompress($sx.$sy));
  $a = $sx;
  if ($x>$y) { # swap min/max
    $min = $y;
    $max = $x;
    $a = $sy;
  }
  $res = ($xy-$min)/$max; # NCD definition.
  
  # Optional correction (for little strings):
  if ($MAXLEN<0 || $xy<$MAXLEN) {
    $aa= strlen(gzcompress($a.$a));
    $ref = ($aa-$min)/$min;
    $res = $res - $ref; # correction
  }
  return ($prec<0)? $res: 100*round($res,2+$prec);
}

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply