Plagscan and Continuous Integration
CI (Continuous integration)
In short, means running code automatically whenever you push new code into a repository. This code can be anything. Most typically, it is test code. Say you've created a program that can give you the name of a plant based on an image of one of its leaves. Somewhere in your project, you'll keep a folder full of images of leaves, and some data saying which is the correct plant name for each. Then your code will have tests, where the tool is run on these known leaves, and the tool's answer checked against the known answer. These tests are run every single time someone pushes code to the project, so you can see if your new feature changes the behaviour of something else in the project. You'll get a notification that tests failed, and you can go and figure out why.
It is possible to have tests that cover every single line of your project, leading to confidence that bugs will not easily enter into your project. README files for many projects hosted on GitHub/GitLab often have badges near the top. For example, if you take a look at the README of Black, a Python code formatter, you'll see a badge showing that it has 96% test coverage, meaning that 96% of all lines of code in the project are run by the suite of tests that run each time someone pushes to the repository. Another badge simply says that tests are currently passing, meaning that the latest didn't introduce any new bug that would have been caught by good tests.
Lives in Transit uses CI. You can see the CI instructions for the backend https://github.com/uzh/marugoto/blob/master/.travis.yml.
Continuous deployment
The other thing you might want to do when pushing code is automatically deploy/release/archive your code. For example, if your project is a website, you can use CD to update your website with the very latest code. If your project is a dataset that people can explore online, you could make sure that every time you push to master branch (and CI tests pass!), then the data gets updated to this latest version.
Therefore, it's common to use GitLab to create a pipeline where the project is built, tested, and then (if tests pass) deployed. For important projects, deployment may be to a test instance, that only developers can see. The developers can then play around with the newest version of their code, and when they're confident in it, they can manually create a production release, or deploy this version as the live (i.e. global) system.
FYI, The test/live setup is in use for Lives in Transit:
Test: https://marugoto.s3it.uzh.ch Live: https://livesintransit.org
plagscan
Enter this poorly named plagiarism detection tool, for which UZH has a license. The tool is pretty neat. I ran it over a paper that cites some old work of mine, and sure enough, it turns up a lot of interesting information. Take a look at the plagiarism analysis here. You can see that it picks up some generally common strings of words:
SUBMITTED PAPER: Risk-taking has always been an integral part of human behaviour.
CHECKED AGAINST: Nature has always been an integral part of our lives
You can quickly see that it's a coincidence, and that the papers have nothing to do with each other. All good.
However, there are some other passages ... :
SUBMITTED PAPER: Normally, speakers and hearers assume a causal structure behind such a correlation: that is, two events which are correlated strongly enough to permit prediction of one based upon knowledge about the other, are typically understood as correlated because of some causal relationships.
CHECKED AGAINST: Normally, speakers and hearers assume a causal structure behind such a correlation: that is, two events which are correlated strongly enough to permit prediction of one based on knowledge about the other, are typically understood as correlated because of some causal relationship.
Uh oh. Clicking around on these examples shows us that this text is duplicated from a book that is not cited, but whose authors deal with the same field. This is extra bad, since if we look in the references section, we can see what while this text is not cited, there are citations for the author of the suspiciously similar material.
I haven't made a complete investigation here ... for all I know, I have things backward, and the source found by plagscan is actually newer, and contains proper citations and so on. But all this is checkable with the tool, and it doesn't take very long, either.
There are a number of other such cases in the text that raise some red flags. Why don't you guys try it out and let me know what you find. And please remember that this repository is public, before you go casting stones.
plagbot?
So at some point in the course, you guys have to submit written assignments, which can be checked with plagscan. What I hope to do is set up this repository so that when you push your assignment, the CI/CD is triggered, and your assignment is checked with plagscan
and a report generated for all to see. I'm doing some checking to see if this integration is possible, but if it is, we're one step closer to my GitLab Academic dream!
Takeaways
- We mentioned how radical transparency can ensure good work in week one. I think you can probably see what we meant by that in this case in particular---attempts at plagiarism will be detected as a matter of course!
- We mentioned how the open source model of software development allows free, global collaboration. If I hook up plagscan to GitLab and it works well, and if I license my code liberally, GitLab could make this a feature of GitLab Academic. Or I could start a company for this exact purpose, a git-based OLAT.
(This issue will track my progress toward plagscan integration too. No guarantees this is possible, but I'll try...)