Skip to content

Commit

Permalink
Merge pull request #4 from kktsubota/check-dataset-links
Browse files Browse the repository at this point in the history
Check if the links of the BAM images are valid
  • Loading branch information
kktsubota authored Oct 21, 2022
2 parents 0b6d638 + c712666 commit 8f5e927
Show file tree
Hide file tree
Showing 3 changed files with 72 additions and 2 deletions.
33 changes: 33 additions & 0 deletions .github/workflows/check-dataset.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Check links of the dataset

on:
schedule:
# run at 1:00 UTC (10:00 JST) on the first day of each month
- cron: "00 1 1 * *"

permissions:
contents: read

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: "3.10"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
pip install pandas tqdm requests
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Check if the links of the images in dataset are valid
run: |
python tests/check_dataset.py
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ pip install -r requirements.txt --extra-index-url https://download.pytorch.org/w
Prepare a dataset that consists of four domains: natural images, line drawings, comics, and vector arts.

```bash
# In 2022/08/31, four files have missing links.
# the four files: (`156526161`, `99117125`, `15642096`, `158633139`)
python scripts/download_dataset.py
# In 2022/10/21, four files have missing links.
# the four files: (`156526161`, `99117125`, `15642096`, `158633139`)
python scripts/replace_empty_image.py
```

Expand Down
37 changes: 37 additions & 0 deletions tests/check_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import random
import time

import requests
import pandas as pd
import tqdm


URL_LIST: str = "datasets/bam_url.csv"
NEW_URL_LIST: str = "datasets/bam_new_url.csv"

# The file size of empty images whose resolution is 600x343
# e.g., 15642096.png (https://mir-s3-cdn-cf.behance.net/project_modules/disp/bf992815642096.56038dc9bf59c.png)
EMPTY_IMAGE_SIZE: int = 3727

def main():
df = pd.read_csv(URL_LIST)
df_new = pd.read_csv(NEW_URL_LIST)

# merge two tables
indices = df["id"].isin(df_new["id"])
df.loc[indices, "url"] = df_new["url"].values

names = list()
for name, url in tqdm.tqdm(df.values):
time.sleep(0.5 + random.random())
r = requests.get(url, allow_redirects=True)

if len(r.content) == EMPTY_IMAGE_SIZE:
names.append(name)

print("file names of empty images:", names)
assert len(names) == 0, f"The number of empty images is expected to be 0, but actual: {len(names)}."


if __name__ == "__main__":
main()

0 comments on commit 8f5e927

Please sign in to comment.