Analysing the data of digi.kansalliskirjasto.fi in a machine-readable format and creating enriched datasets
If you belong to the Haka-authorized user groups from specific universities of Finland and you need the data locally on your machine, follow these steps. This option is available for the duration of the Haka project.
Fill in the Haka-survey of Digi
- Go to http://digi.kansalliskirjasto.fi
- Click login from top right corner, and select 'Haka'
- At first login (and at specific intervals) digi will show you a survey about how the materials will be used. Please fill the form, as that helps National Library of Finland in future negotiations in order to extend access to researchers.
Setup environment
If you have python installed, you are good to go, otherwise download and install python
Download the helper script digi_download_haka to specific directory.
Press Windows-key R
and type cmd
to start command prompt.
Download pages
Run the provided helper script:
python digi_download_haka.py -i 0356-0996
Expected output is:
# python digi_download_haka.py -i 0356-0996
Digi.nationallibrary.fi - UNSUPPORTED downloader via Haka authentication
Download ALTO XML or TXT from digi for given ISSN.
Approaching HAKA identification for helsinki, wait for bit...
Organisation identification page reached.
Organisation login phase: HAKA password for helsinki authorisation (asked at every run):
succesful.
Ready to go!
All years for Warkauden Lehti (issn 0356-0996): [1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929]
ISSN 0356-0996 - Year 1929
Processing binding :1768795
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768795/page-1.txt -> ./0356-0996/1929/1768795_page-1.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768795/page-2.txt -> ./0356-0996/1929/1768795_page-2.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768795/page-3.txt -> ./0356-0996/1929/1768795_page-3.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768795/page-4.txt -> ./0356-0996/1929/1768795_page-4.txt
Processing binding :1768785
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768785/page-1.txt -> ./0356-0996/1929/1768785_page-1.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768785/page-2.txt -> ./0356-0996/1929/1768785_page-2.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768785/page-3.txt -> ./0356-0996/1929/1768785_page-3.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768785/page-4.txt -> ./0356-0996/1929/1768785_page-4.txt
Processing binding :1768776
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768776/page-1.txt -> ./0356-0996/1929/1768776_page-1.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768776/page-2.txt -> ./0356-0996/1929/1768776_page-2.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768776/page-3.txt -> ./0356-0996/1929/1768776_page-3.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768776/page-4.txt -> ./0356-0996/1929/1768776_page-4.txt
12 files downloaded to 0356-0996 folder.
Supported command line parameters
Digi.nationallibrary.fi - UNSUPPORTED downloader via Haka authentication
Download ALTO XML or TXT from digi for given ISSN.
usage: digi_download_haka.py [-h] -i ISSN [-u {utu,uef,helsinki}]
[-f {alto,txt}]
-f format
The default format to download is text. The text is as it has been originally received from the text recognization of the post-processing system of the digitization.
If you choose 'alto', then you get the ALTO XML, which contains the layout information of the page and the words with their location on the page.
-u university
With command line parameter -u there is an experimental login changer for different university. NB! requires customization based on the Haka login form of each university. Has been tested with University of Helsinki setup.
Demo
What next?
You can process the files you have downloaded in many ways.
From XML files you can get back to text e.g. via tools provided by the Comhis-project
It is also possible to utilize online tools, like Voyant Tools for text analysis, you can give data either via links, or upload your own data there directly.