Download content
If you want a specific title from Digi, then full export data packages might not be best option. Then it might be useful to download txt pages for just one title instead of everything.
Machine setup (Windows)
Install python
If you do not have python already installed, install e.g. Anaconda distribution can be useful as it includes many data science modules within its download package and it does not require administrative rights.
Basic python is an alternative for anaconda (for example version 2.7).
Choose the newspaper you want
Go to the titles view and take a note of the ISSN of the newspaper you want.
Setup environment
1) Open the Windows command prompt to start executing the scripts:
Press R
and type cmd
.
2) Create an own directory to your PC.
mkdir C:\temp\datadownload
set MYWORKINGDIR=C:\temp\datadownload
3) Change to the directory where you downloaded the python file.
cd %MYWORKINGDIR%
4) Download the digi_downloadcontent.py and copy it to your working directory.
Note! The tool is not actively supported, use it at own risk.
Download pages
Run the provided helper script for downloading desired newspaper or journal.
python digi_downloadcontent.py -i 0018-2362
Expected output is:
python digi_downloadcontent.py -i 0018-2362
Digi.nationallibrary.fi - UNSUPPORTED downloader for ALTO XML or TXT from digi for given ISSN.
All years for issn 0018-2362: [1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929]
ISSN 0018-2362 - Year 1903. 0018-2362\1903\txt
Processing binding :499137
Proc.: http://digi-testi.kansalliskirjasto.fi/aikakausi/binding/499137/page-1.txt to : 0018-2362\1903\txt
Proc.: http://digi-testi.kansalliskirjasto.fi/aikakausi/binding/499137/page-2.txt to : 0018-2362\1903\txt
Proc.: http://digi-testi.kansalliskirjasto.fi/aikakausi/binding/499137/page-3.txt to : 0018-2362\1903\txt
Supported command line parameters
You can see command line parameters via running the script by giving parameter -h
python digi_downloadaltos.py -h
Digi.nationallibrary.fi - UNSUPPORTED downloader for ALTO XML or TXT from digi for given ISSN.
usage: digi_downloadcontent.py [-h] -i ISSN [-f {alto,txt}] [--no-maxlimit]
[-y TYEAR]
Digi.nationallibrary.fi - UNSUPPORTED downloader for ALTO XML or TXT from digi
for given ISSN.
optional arguments:
-h, --help show this help message and exit
-i ISSN, --issn ISSN choose an issn and download altos for it
-f {alto,txt}, --format {alto,txt}
choose data format ALTO (xml) or text (txt)
--no-maxlimit
-y TYEAR, --year TYEAR which year to download from the paper
-f format
The default format to download is text. The text is as it has been originally obtained in the text recognization of the post-processing system of the digitization. It is good to realize that depending on the material, the OCR does contain errors.
If you choose 'alto', then you get the ALTO XML, which contains the layout information of the page and the words with their location on the page.
--no-maxlimit
With this option downloads all pages of the title. Without this only around 10 files is downloaded by default. This is so that the initial experiments do not take extra resources from Digi.
--year
This option let's you pick which year's information you download, e.g. typical way to run could be pick this particular year from desired newspaper or journal.
python digi_downloadcontent.py -i fk01184 -y 1930