Sie sind auf Seite 1von 3

Downloading PDFs, removing duplicate pages and last pages

We need the PDF versions of the Swiss Parliament debates from 1996, 1997 and 1998. They are
publicly available here:

https://www.amtsdruckschriften.bar.admin.ch/showHierarchyDate.do

We need only the « Amtliches Bulletin der Bundesversammlung (1891-1999) », and from there only
the years 1996, 1997, 1998. See pictures below:

There are subfolders called “Inhalt”. We don’t need the PDFs in these folders. See picture below:
The PDFs in the other folders are parts of large books. Unfortunately, there are some duplicates.
Sometimes, there are multiple copies of the same file. You can see that it’s the same file by the page
numbers which are exactly the same. In these cases, we need only one copy. We don’t need
duplicates! They have to be filtered out manually. See picture below:

Sometimes, the page numbers overlap. In these cases, you need to delete the duplicate pages (but
not the whole document). We don’t need any duplicate pages. See picture below:
Every PDF file has a last page that contains metadata. We don’t need these pages. You should delete
them from the documents. See picture below:

In the end, we need:

- A collection of PDF files, sorted by year (three folders: 1996, 1997, 1998)
- No duplicate pages, but also no pages missing
- Every last page of every PDF (with metadata) is deleted

Das könnte Ihnen auch gefallen