feat: add cli to anonymize repositories locally

This commit is contained in:
tdurieux
2023-02-06 15:48:21 +01:00
parent d01c839616
commit dcf7f36917
9 changed files with 676 additions and 123 deletions

1
.gitignore vendored
View File

@@ -1,4 +1,5 @@
.env
build
/repositories
repo/
db_backups

View File

@@ -1,37 +1,35 @@
Anonymous Github
================
# Anonymous Github
Anonymous Github is a system to anonymize Github repositories before referring to them in a double-anonymous paper submission.
To start using Anonymous Github right now: **[http://anonymous.4open.science/](http://anonymous.4open.science/)**
Indeed, in a double-anonymous review process, the open-science data or code that is in the online appendix must be anonymized, similarly to paper anonymization. The authors must
* anonymize URLs: the name of the institution/department/group/authors should not appear in the URLs of the open-science appendix
* anonymize the appendix content itself
- anonymize URLs: the name of the institution/department/group/authors should not appear in the URLs of the open-science appendix
- anonymize the appendix content itself
Anonymizing an open-science appendix needs some work, but fortunately, this can be automated, this is what Anonymous Github is about.
Anonymous Github anonymizes:
* the Github owner / organization / repository name
* the content of the repository
* file contents (all extensions, md/txt/java/etc)
* file and directory names
- the Github owner / organization / repository name
- the content of the repository
- file contents (all extensions, md/txt/java/etc)
- file and directory names
Question / Feedback / Bug report: please open an issue in this repository.
Using Anonymous Github
-----------------------
## Using Anonymous Github
## How to create a new anonymized repository
To use it, open the main page (e.g., [http://anonymous.4open.science/](http://anonymous.4open.science/)), login with GitHub, and click on "Anonymize".
Simply fill 1. the Github repo URL and 2. the id of the anonymized repository, 3. the terms to anonymize (which can be updated afterward).
The anonymization of the content is done by replacing all occurrences of words in a list by "XXXX" (can be changed in the configuration).
Simply fill 1. the Github repo URL and 2. the id of the anonymized repository, 3. the terms to anonymize (which can be updated afterward).
The anonymization of the content is done by replacing all occurrences of words in a list by "XXXX" (can be changed in the configuration).
The word list is provided by the authors, and typically contains the institution name, author names, logins, etc...
The README is anonymized as well as all files of the repository. Even filenames are anonymized.
The README is anonymized as well as all files of the repository. Even filenames are anonymized.
In a paper under double-anonymous review, instead of putting a link to Github, one puts a link to the Anonymous Github instance (e.g.
In a paper under double-anonymous review, instead of putting a link to Github, one puts a link to the Anonymous Github instance (e.g.
<http://anonymous.4open.science/r/840c8c57-3c32-451e-bf12-0e20be300389/> which is an anonymous version of this repo).
To start using Anonymous Github right now, a public instance of anonymous_github is hosted at 4open.science:
@@ -42,15 +40,25 @@ To start using Anonymous Github right now, a public instance of anonymous_github
In double-anonymous peer-review, the boundary of anonymization is the paper plus its online appendix, and only this, it's not the whole world. Googling any part of the paper or the online appendix can be considered as a deliberate attempt to break anonymity ([explanation](http://www.monperrus.net/martin/open-science-double-anonymous))
## CLI
How does it work?
-----------------
This CLI tool allows you to anonymize your GitHub repositories locally, generating an anonymized zip file based on your configuration settings.
Anonymous Github either download the complete repository and anonymize the content of the file or proxy the request to GitHub. In both case, the original and anonymized versions of the file are cached on the server.
```bash
# Install the Anonymous GitHub CLI tool
npm install -g @tdurieux/anonymous_github
# Run the Anonymous GitHub CLI tool
anonymous_github
```
## How does it work?
Anonymous Github either download the complete repository and anonymize the content of the file or proxy the request to GitHub. In both case, the original and anonymized versions of the file are cached on the server.
## Installing Anonymous Github
Installing Anonymous Github
----------------------------
1. Clone the repository
```bash
git clone https://github.com/tdurieux/anonymous_github/
cd anonymous_github
@@ -76,6 +84,7 @@ AUTH_CALLBACK=http://localhost:5000/github/auth,
The callback of the GitHub app needs to be defined as `https://<host>/github/auth` (the same as defined in AUTH_CALLBACK).
3. Run Anonymous Github
```bash
docker-compose up -d
```
@@ -84,14 +93,12 @@ docker-compose up -d
By default, Anonymous Github uses port 5000. It can be changed in `docker-compose.yml`.
## Related tools
Related tools
--------------
[gitmask](https://www.gitmask.com/) is a tool to anonymously contribute to a Github repository.
[blind-reviews](https://github.com/zombie/blind-reviews/) is a browser add-on that enables a person reviewing a GitHub pull request to hide identifying information about the person submitting it.
See also
--------
## See also
* [Open-science and double-anonymous Peer-Review](https://www.monperrus.net/martin/open-science-double-blind)
- [Open-science and double-anonymous Peer-Review](https://www.monperrus.net/martin/open-science-double-blind)

99
cli.ts Normal file
View File

@@ -0,0 +1,99 @@
#!/usr/bin/env node
import { config as dot } from "dotenv";
dot();
import { writeFile } from "fs/promises";
import { join } from "path";
import { tmpdir } from "os";
import * as gh from "parse-github-url";
import * as inquirer from "inquirer";
import config from "./config";
import GitHubDownload from "./src/source/GitHubDownload";
import Repository from "./src/Repository";
import AnonymizedRepositoryModel from "./src/database/anonymizedRepositories/anonymizedRepositories.model";
function generateRandomFileName(size: number) {
const characters =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
let result = "";
for (let i = 0; i < size; i++) {
result += characters.charAt(Math.floor(Math.random() * characters.length));
}
return result;
}
async function main() {
config.STORAGE = "filesystem";
const inq = await inquirer.prompt([
{
type: "string",
name: "token",
message: `Enter your GitHub token. You can create one at https://github.com/settings/personal-access-tokens/new.`,
default: process.env.GITHUB_TOKEN,
},
{
type: "string",
name: "repo",
message: `URL of the repository to anonymize (if you want to download a specific branch or commit use the GitHub URL of that branch or commit).`,
},
{
type: "string",
name: "terms",
message: `Terms to remove from your repository (separated with comma).`,
},
]);
const ghURL = gh(inq.repo) || { owner: "", name: "", branch: "", commit: "" };
const repository = new Repository(
new AnonymizedRepositoryModel({
repoId: "test",
source: {
type: "GitHubDownload",
accessToken: inq.token,
branch: ghURL.branch || "master",
commit: ghURL.branch || "HEAD",
repositoryName: `${ghURL.owner}/${ghURL.name}`,
},
options: {
terms: inq.terms.split(","),
expirationMode: "never",
update: false,
image: true,
pdf: true,
notebook: true,
link: true,
page: false,
},
})
);
const source = new GitHubDownload(
{
type: "GitHubDownload",
accessToken: inq.token,
repositoryName: inq.repo,
},
repository
);
console.info("[INFO] Downloading repository...");
await source.download(inq.token);
const outputFileName = join(tmpdir(), generateRandomFileName(8) + ".zip");
console.info("[INFO] Anonymizing repository and creation zip file...");
await writeFile(outputFileName, repository.zip());
console.log(`Anonymized repository saved at ${outputFileName}`);
}
if (require.main === module) {
if (process.argv[2] == "server") {
// start the server
require("./src/server").default();
} else {
// use the cli interface
main();
}
}

599
package-lock.json generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,8 +1,10 @@
{
"name": "anonymous_github",
"name": "@tdurieux/anonymous_github",
"version": "2.1.0",
"description": "Anonymise Github repositories for double-anonymous reviews",
"main": "index.ts",
"bin": {
"anonymous_github": "build/cli.js"
},
"scripts": {
"test": "mocha --reporter spec",
"start": "node --inspect=5858 -r ts-node/register ./index.ts",
@@ -23,6 +25,10 @@
"url": "https://github.com/sponsors/tdurieux"
},
"homepage": "https://github.com/tdurieux/anonymous_github#readme",
"files": [
"public",
"build"
],
"dependencies": {
"@octokit/oauth-app": "^4.1.0",
"@octokit/rest": "^19.0.5",
@@ -39,6 +45,7 @@
"express-session": "^1.17.3",
"express-slow-down": "^1.5.0",
"got": "^11.8.5",
"inquirer": "^8.2.5",
"istextorbinary": "^6.0.0",
"marked": "^4.1.1",
"mime-types": "^2.1.35",
@@ -63,6 +70,7 @@
"@types/express-session": "^1.17.5",
"@types/express-slow-down": "^1.3.2",
"@types/got": "^9.6.12",
"@types/inquirer": "^8.0.0",
"@types/marked": "^4.0.7",
"@types/mime-types": "^2.1.0",
"@types/parse-github-url": "^1.0.0",

View File

@@ -15,6 +15,7 @@ import Conference from "./Conference";
import ConferenceModel from "./database/conference/conferences.model";
import AnonymousError from "./AnonymousError";
import { downloadQueue } from "./queue";
import { isConnected } from "./database/database";
export default class Repository {
private _model: IAnonymizedRepositoryDocument;
@@ -208,6 +209,7 @@ export default class Repository {
* Update the last view and view count
*/
async countView() {
if (!isConnected) return this.model;
this._model.lastView = new Date();
this._model.pageView = (this._model.pageView || 0) + 1;
return this._model.save();
@@ -219,9 +221,11 @@ export default class Repository {
* @param errorMessage a potential error message to display
*/
async updateStatus(status: RepositoryStatus, statusMessage?: string) {
if (!status) return this.model;
this._model.status = status;
this._model.statusDate = new Date();
this._model.statusMessage = statusMessage;
if (!isConnected) return this.model;
return this._model.save();
}
@@ -247,18 +251,17 @@ export default class Repository {
* Reset/delete the state of the repository
*/
async resetSate(status?: RepositoryStatus, statusMessage?: string) {
if (status) this._model.status = status;
if (statusMessage) this._model.statusMessage = statusMessage;
const p = this.updateStatus(status, statusMessage);
// remove attribute
this._model.size = { storage: 0, file: 0 };
this._model.originalFiles = null;
// remove cache
return Promise.all([this._model.save(), this.removeCache()]);
return Promise.all([p, this.removeCache()]);
}
/**
* Remove the cached files
* @returns
* @returns
*/
async removeCache() {
return storage.rm(this._model.repoId + "/");
@@ -281,15 +284,15 @@ export default class Repository {
}> {
if (this.status != "ready") return { storage: 0, file: 0 };
if (this._model.size.file) return this._model.size;
function recursiveCount(files) {
function recursiveCount(files: Tree): { storage: number; file: number } {
const out = { storage: 0, file: 0 };
for (const name in files) {
const file = files[name];
if (file.size && parseInt(file.size) == file.size) {
if (file.size && parseInt(file.size.toString()) == file.size) {
out.storage += file.size as number;
out.file++;
} else if (typeof file == "object") {
const r = recursiveCount(file);
const r = recursiveCount(file as Tree);
out.storage += r.storage;
out.file += r.file;
}

View File

@@ -10,10 +10,13 @@ const MONGO_URL = `mongodb://${config.DB_USERNAME}:${config.DB_PASSWORD}@${confi
export const database = mongoose.connection;
export let isConnected = false;
export async function connect() {
await mongoose.connect(MONGO_URL + "production", {
authSource: "admin",
} as ConnectOptions);
isConnected = true;
return database;
}

View File

@@ -38,7 +38,7 @@ export default class GitHubDownload extends GitHubBase implements SourceBase {
});
}
async download() {
async download(token?: string) {
const fiveMinuteAgo = new Date();
fiveMinuteAgo.setMinutes(fiveMinuteAgo.getMinutes() - 5);
if (
@@ -51,7 +51,10 @@ export default class GitHubDownload extends GitHubBase implements SourceBase {
});
let response: OctokitResponse<unknown, number>;
try {
response = await this._getZipUrl(await this.getToken());
if (!token) {
token = await this.getToken();
}
response = await this._getZipUrl(token);
} catch (error) {
if (error.status == 401 && config.GITHUB_TOKEN) {
try {

View File

@@ -5,13 +5,13 @@
"compilerOptions": {
"target": "es6",
"module": "commonjs",
"outDir": "dist",
"outDir": "build",
"removeComments": true,
"preserveConstEnums": true,
"forceConsistentCasingInFileNames": true,
"sourceMap": false,
"skipLibCheck": true
},
"include": ["src/**/*.ts", "index.ts", "tests3.ts"],
"include": ["src/**/*.ts", "index.ts", "cli.ts"],
"exclude": ["node_modules", ".vscode"]
}