Analysing the data of digi.kansalliskirjasto.fi in a machine-readable format and creating enriched datasets

If you belong to the Haka-authorized user groups from specific universities of Finland and you need the data locally on your machine, follow these steps. This option is available for the duration of the Haka project.

Fill in the Haka-survey of Digi

  1. Go to http://digi.kansalliskirjasto.fi
  2. Click login from top right corner, and select 'Haka'
  3. At first login (and at specific intervals) digi will show you a survey about how the materials will be used. Please fill the form, as that helps National Library of Finland in future negotiations in order to extend access to researchers.

Setup environment

If you have python installed, you are good to go, otherwise download and install python

Download the helper script digi_download_haka to specific directory.

Press Windows-key R and type cmd to start command prompt.

Download pages

Run the provided helper script:

python digi_download_haka.py -i 0356-0996

Expected output is:

# python digi_download_haka.py -i 0356-0996

Digi.nationallibrary.fi - UNSUPPORTED downloader via Haka authentication
Download ALTO XML or TXT from digi for given ISSN.
Approaching HAKA identification for helsinki, wait for bit...
Organisation identification page reached.
Organisation login phase: HAKA password for helsinki authorisation (asked at every run):
succesful.
Ready to go!
All years for Warkauden Lehti (issn 0356-0996): [1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929]
ISSN 0356-0996 - Year 1929
Processing binding :1768795
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768795/page-1.txt -> ./0356-0996/1929/1768795_page-1.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768795/page-2.txt -> ./0356-0996/1929/1768795_page-2.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768795/page-3.txt -> ./0356-0996/1929/1768795_page-3.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768795/page-4.txt -> ./0356-0996/1929/1768795_page-4.txt
Processing binding :1768785
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768785/page-1.txt -> ./0356-0996/1929/1768785_page-1.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768785/page-2.txt -> ./0356-0996/1929/1768785_page-2.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768785/page-3.txt -> ./0356-0996/1929/1768785_page-3.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768785/page-4.txt -> ./0356-0996/1929/1768785_page-4.txt
Processing binding :1768776
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768776/page-1.txt -> ./0356-0996/1929/1768776_page-1.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768776/page-2.txt -> ./0356-0996/1929/1768776_page-2.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768776/page-3.txt -> ./0356-0996/1929/1768776_page-3.txt
https://digi.kansalliskirjasto.fi/sanomalehti/binding/1768776/page-4.txt -> ./0356-0996/1929/1768776_page-4.txt
12 files downloaded to 0356-0996 folder.

Supported command line parameters

Digi.nationallibrary.fi - UNSUPPORTED downloader via Haka authentication
Download ALTO XML or TXT from digi for given ISSN.
usage: digi_download_haka.py [-h] -i ISSN [-u {utu,uef,helsinki}]
                             [-f {alto,txt}]

-f format

The default format to download is text. The text is as it has been originally received from the text recognization of the post-processing system of the digitization.

If you choose 'alto', then you get the ALTO XML, which contains the layout information of the page and the words with their location on the page.

-u university

With command line parameter -u there is an experimental login changer for different university. NB! requires customization based on the Haka login form of each university. Has been tested with University of Helsinki setup.

Demo

What next?

You can process the files you have downloaded in many ways.

  • From XML files you can get back to text e.g. via tools provided by the Comhis-project

  • It is also possible to utilize online tools, like Voyant Tools for text analysis, you can give data either via links, or upload your own data there directly.

results matching ""

    No results matching ""