PoliticalMashup Documentation

Documentation of the PoliticalMashup project; NWO number (add NWO number).

The goal of the project is to collect some of the many available online sources of political data, and make these available for political researchers in a structured, accessible manner. A strong focus lies on the proceedings/hansards from several European countries, and Dutch members of parliament and political parties.

This document describes all steps taken to acquire, process and combine (i.e. mashup) the data.

The data exists in three stages. First is the original source, located somewhere online in many different forms. Second are the clean xml versions of the data, collected through a scraping process. Third is a transformed, fully specified and validated set of structured documents. The final documents have been enriched with explicit links to other documents, both internal and external.

The concluding data set is stored in the native xml database eXist, and accessible through a multitude of search and listing interfaces. Permanent identifiers for the documents are available through the Handle system.

Related projects include the annotation of speakers in the Dutch proceedings archive.

General todo remarks

crontab automatically pulls transformers/schema from git; this has been disabled due to lack of further development.

Our data resides in the urn:nbn:nl:ui:35 namespace (see e.g. OAI).

Check http://ilps.science.uva.nl/twiki/bin/view/Main/PoliticalMashup for twiki links.

Scraping

Dowload available documents from different sources, convert them if necessary, and save as clean utf-8 encoded xml.

The process of automatically scraping data sources is handled by the scrapy toolkit, and managed with a bash script and the crontab.

Different source types (html, xml, pdf) are all converted to valid xml. This is ensured and a strict responsibility of the scraping module. The next transformation step requires valid xml as input.

Scrapy

Scraping, the process of downloading and storing documents, is done with a python toolkit called Scrapy.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages.

Additional dependencies as imported in the python crawlers, apart from scrapy, include: pyexist, lxml, BeautifulSoup, tidylib and wikitools. Make sure these modules are available.

For each of the data sources listed with a [source identifier], a scrapy crawler has been implemented

Managing scrapers

A scraper implements the process of downloading source data, cleaning it, uploading it to the transformation database and logging the results. A scraper is started with the manage script also available in the scrapy-d git repository. For example, to start the danish scraper, run ./manage.sh schedule dk-udpate. The manage script prevents multiple scraping sessions to run simultaneously, and is executed by the ilps_bg crontab according to the schema listed at the PoliticalMashupScrapyD twiki.

XML output format

Because all other steps in the project require and utilise XML, valid XML is ensured as early as possible in the process.

The data comes from three types of source documents: XML, HTML and PDF. XML documents are downloaded as is. HTML documents are cleaned, repaired if necessary, and stored as XML with the HTML elements intact. PDF documents are converted to XML with the pdf2xml tool.

All source data is downloaded and stored at mashup1:/scratch/data/parliament/ in subfolders named as the source crawlers desribed below.

pdf2xml

Converting PDF files to XML is based on the output of pdftohtml. Multiple text columns, headers, and footers are reconstructed into a sequential, chronological account of the transcript. Contextual information present on the online overview pages, but not present within the PDF's, can be added with a separately created .dcmeta.xml file.

Sadly, extracting text from pdf files remains a problematic endeavour. The collection of code used in pdf2xml is flaky, and should be considered working for the currently processed PDF data, but not for other data.

Sources

All data was downloaded either with verbal permission, or is assumed without restrictions, unless stated otherwise. Each xml document can be accessed individually, and has its rights made as explicit as possible in the dc:rights field/element. The permanent data collections are listed below.

Some data sets are available per meeting (few documents), others per topic (many documents). We only collected documents from the senate for the Dutch proceedings. Some countries do have both a lower and upper house (commons and senate), but not all, see http://en.wikipedia.org/wiki/National_parliaments_of_the_European_Union for more information. Dutch proceedings have a third house type "other", that is sometimes used for combined meetings at the start of the legislative year.

Scraped sources

[nl-ob] Officiele Bekendmakingen (OB)
1994-2013, commons/senate
XML documents
https://zoek.officielebekendmakingen.nl/
Current Dutch proceedings and other parliamentary documents are downloaded and available from 1995 and onward. These documents were published digitally, and are generally available as well-formatted xml data.
[nl-sgd] Staten Generaal Digitaal (SGD)
1814-1994, commons/senate
PDF/XML documents
http://statengeneraaldigitaal.nl/
Older Dutch proceedings, from 1814 until 1995, were scanned, OCR processed, and made available as pdf and xml, by the Koninklijke Bibliotheek (KB, Dutch Royal Library). For the data as created here, the PDF sources were used. A project related to PoliticalMashup was the explicit identification and annotation of speakers.
[nl-members] Dutch Political Members
1814-2013, commons/senate/government
HTML documents
http://www.parlement.com/
The Parlementair Documentatie Centrum (PDC) has collected and maintains an up-to-date set of dutch politicians.
[nl-party] Dutch Political Parties
?-2013, commons/senate
XML documents
http://data.politicalmashup.nl/sources/nl-party/
http://www.rug.nl/dnpp
Manually created set of Dutch political parties, in collaboration with the Documentatiecentrum Nederlandse Politieke Partijen (DNPP). Scraping is emulated (to make use of the existing, controlled and logged processing pipeline) from a local copy of the xml files.
[nl-draft] Drafts of the Dutch Commons proceedings
2011-2013, commons
HTML documents
http://www.tweedekamer.nl/kamerstukken/verslagen/index.jsp
The drafts are typically available directly after the meetings take place, and as such interesting for current affairs. Officially they are restricted however, and may not be republished.
[uk] UK proceedings
1935-2013, commons
XML documents
http://ukparse.kforge.net/parldata/scrapedxml/debates/
The UK proceedings have been processed for quite some time before the start of the PoliticalMashup project. They are used for instance on TheyWorkForYou at http://www.theyworkforyou.com/. We use the these processed versions, rather than the officially published documents.
[uk-members] UK parliament members.
?-2013, commons/lords/??
XML documents
http://ukparse.kforge.net/parlparse/
As with the UK hansards, the members of parliament are also collected and annotated in the TheyWorkForYou project.
[be-proc-vln] Vlaams parlement, Flanders proceedings
1995-2013, commons
PDF documents
http://www.vlaamsparlement.be/vp/parlementairedocumenten/index.html
The Dutch-language proceedings of the Flanders parliament (i.e. not the Belgian combined parliament).
Draft documents are called "Voorlopige Tekst" and are not processed.
[be-members-vln] Flanders members of parliament
????-2013
HTML documents
http://www.vlaamsparlement.be/vp/vlaamsevolksvertegenwoordigers/index.html
The members of the Flanders parliament are collected into a static member set available at http://transformer.politicalmashup.nl/id/be/vln-ids.xml. They are not updated automatically however. Identification of new members will therefore fail until the member set is updated manually.
[no] Stortinget, Norwegian proceedings
1998-2013, commons
HTML documents
http://www.stortinget.no/nn/Saker-og-publikasjoner/Publikasjoner/Referater/?mt=Stortinget
The Norwegian parliamentary proceedings are all available in the same format. The members are presented at http://www.stortinget.no/nn/Representanter-og-komiteer/Representantene/Biografier/. The members were collected with an xslt sheet (because the data source was very clean xml-valid html) http://transformer.politicalmashup.nl/id/no/scrape_no.xsl into a static set of members, available at http://transformer.politicalmashup.nl/id/no/no-ids.xml. New members will not be identified unless the member set is manually updated.
Draft documents are called "midlertidig" and are not processed.
[dk] Folketinget, Danish proceedings
1999-2013, commons
HTML documents
http://www.ft.dk/Dokumenter/Referater_efter_modedato.aspx
Current Danish proceedings are available in HTML format. The archive of old proceedings was processed from PDF files.
Members of parliament already have explicit ids in the source data, so they do not have to be identified by us.
[se] Riksdagen, Swedish proceedings
1990-2013, commons
HTML documents
http://www.riksdagen.se/sv/Dokument-Lagar/Kammaren/Protokoll/
Swedish proceedings are all available in HTML format. Not all available documents have been processed (the actual format changes several times).
Draft documents are called "snabbprotokoll", and are processed and later updated with permanent documents, called "protokoll".
[es] Spanish proceedings
-
PDF documents
Partly implemented but never actively used.

Other sources

Three notable other sources are, or were, used: dbpedia, wikipedia and the PentaPolica feeds.

The knowledge base dbpedia contains many relations between entities. They are used to add relations to the political member and party documents.

Information from wikipedia was used to extract knowledge about the Dutch governmental periods.

Additional online resources for politicians, such as twitter accounts, were collected by PentaPolitica on http://stolp.lab.dispectu.com/export/feeds. These feeds were explicitly linked from the member documents, but are currently not added.

more external sources?

Process Logging

The entire scrape and transformation process is logged. Logs of the scrape process include counting how many new, updated or existing documents were found, and whether they were transformed correctly.

These logs are stored, up to 15 iterations back, by scrapy, and interpreted by a companion script daily-logs.sh.

Summarising logs are emailed to politicalmashup [at] gmail [dot] com. This email account is used for all process and server admin logs. All message are labelled, and checking the last twenty or so messages quickly shows if everything is working. Detailed logs of the current state of transformations, e.g. members being identified correctly, must be inspected using the monitor. Login credentials can be found on the twiki page.

Transformation

Transform the scraped, structurally different xml files into a common verifiable xml structure.

XSLT tranformation

Input for the transformer, as said above, is always valid xml. Transformation is triggered by uploading a file to the monitor. This file is input to an XSLT transformation, that extracts content from the file and outputs this content in a pre-defined "politicalmashup" structure. The result of a transformation is stored in the permanent collection, if it was successfully validated.

Transformers are created by first importing a set of basic templates. These templates, if used correctly, ensure that the output is xml valid with respect to specifically designed schemata. The importing source-specific template typically defines the methods for extracting the correct data elements.

Document identifiers

The most important resposibility of a transformer is the construction of a unique identifier for that document. The identifier is typically based on an existing unique id for a document, padded with our information on the specific collection and sub-structural element. The identifiers are used to link documents to each other, and resolve them.

From the documentation of the resolver module:

Identifiers must follow a specific structure, exemplified below.
The identifier is provided by the controller.xql as the $path parameter,
and is the url-part after the resolver base url, without any
get-parameters or #hash parts.


Path/identifier for the bulk of our data (proceedings mostly):
nl.proc.sgd.d.192619270000382.2.14.4
^^^^^^^^^^^ * =============== ++++++
|           | |               \----> document section(s), dot separated
|           | \--------------------> local id (document basename)
|           \----------------------> {d,m,p}
\----------------------------------> collection

{d,m,p} stand for {document, member, party} respectively
Identifiers are always unique within our entire data set..

The document sections are currently only relevant for 'd' documents,
and point to a specific part (e.g. a paragraph or a topic) of the
document. These sections are automatically numbered and given during
processing. As such, if .2.14.4 exists, .2.14 will also exist as its
parent element.

The local id is a string that can not contain a dot but otherwise
any sequence of alphanumeric characters or dashes. It is unique within
a specific collection.

The collection part must match exactly a collection hierarchy, within
the permanent data+document+path collection. For instance, nl.proc.sgd.d
will be located in the /db/data/permanent/d/nl/proc/sgd collection.
nl.m.01234 (a member) is local-id 01234 in the collection
/db/data/permanent/m/nl.
The identifier-collection + document-type + local-id together form the
unique identifier for a single xml document in the database.

Structure of transformed documents

This section only contains short descriptions, and deserves to be more detailed.

Written here are global descriptions of the resulting structure of transformed documents. Exact definitions of the structure are given with the validation RelaxNG schemas. Elements are always in the pm (politicalmashup) namespace, unless specified otherwise (e.g. dc:rights). Documents can be "validly" expanded by putting new or additional information in a separate namespace, and using the "document"X.rnc extendable schemass that allow foreign nodes.

Each document always consists of a root, with:
first a docinfo (with information about the status of the transformation and validation); second a meta with meta-data like the date of proceedings, type of document, identifier, rights etc.; third a document-type specific element, e.g. pm:proceedings for proceedings.

After successful transformation, the monitor adds validation information to the document docinfo and before storing it in the database. N.B. this has been added, but has not been processed retro-actively, so typically only relatively recent data lists all validation results.

Proceedings

proceedings consist of one or more topics. a topic consist, either of scenes with speeches, or directly of speeches. speeches link to members and contain paragraphs. for specific, separate content, stage-directions are used

also, roles are added (mp, chair or government)

N.B. danish member-refs differ from all our toher refs because they do not start with "dk.m.", because of a legacy decision to stay with the ids as they are found in the proceedings.

Members

quite self-explanatory, member, birthdate, names. explain in detail: alternative names (is, the original name with overridden/added alternative fields) explain in detail: functions-government are the used government activities; memberships-commons/senate are the used dates to see if someone is part of the house or not (memberships-government are about government-periods, and do not necessarily reflect the actual activity of a person). due to rights, only the basic information can be shown the resolver (see below); the complete information is available in the database and used for e.g. the member identification (todo: link) script

uk members are a bit different, and have specific membership-session-ids and are identified as such in the source data.

Parties

Only Dutch parties are available, which are transformed based on manually defined source documents.

The one interesting thing here is the calculation of the seats distribution, which is calculated based on the number of politicians having a membership for that party on each day, and combining that information to periods with numbers of seats. Because the person seat/membership info is not perfect (actual politicians sometimes temporarily leave or join parties for instance), there are some deviations from the correct number of seats per day. The distribution is however good enough for our purposes, and works better than the earlier approach of scraping wikipedia for seats distributions.

Parliamentary Documents

Parliamentary documents are quite different from proceedings. They typically contain an introduction, textual body, and the signers who introduced or back the proposal.

Monitor

The monitor logs all transformation results. These logs are also read by the scrapy crawler to inspect the transformation status from the last uploaded document. Because eXist can be a bit slow, all uploading is done sequentially. This also makes it much easier to see how transformation went, and this gives JVM some time to do its required garbage collection.

The monitor inferface can be reached at http://monitor.politicalmashup.nl/monitor/. Login and password are the same as the administration login credentials found in the installation folder.

During heavy load from transformations, the number of open files on the monitor can rise quickly towards the (linux default) maximum allowed +/- 1000.

Linking

Add references to other documents to create a rich network of political document relations.

One goal of the project, and argument for construction of XML documents, is the idea of linked data. Where possible, relations between the documents and to outside sources are added explicitly to enable context-sensitve research queries on the data.

Linking is done mostly during the transformation phase that, apart from structure, adds: the lookup of documents, identification of members, querying dbpedia etc. One exception is the querying of wikipedia-links, which is done during scraping for members, and was defined manually for the parties.

Relations

The relations between documents and to outside sources are listed below.

proceedings -> members
Members speaking and present are explicitly linked to member documents.
proceedings -> parties
Parties of speaking members are linked to party documents.
proceedings -> parliamentary documents
Votes on amendments link to the specific amendment document.
parliamentary documents -> parliamentary documents
Amendments are sometimes (often) updated with newer versions, typically because of new content or new people backing the amendment. Links to the replaced documents are made explicit.
Links are added to new documents, and were retroactively added to about half of the available documents; not all replacements are explicit yet.
parliamentary documents -> proceedings
Once an amendment has been voted upon, a link to the actual vote within the proceedings is added.
Adding a link to a vote almost always occurs later than during the first processing of the parliamentary document (since votes typically become available later). After the switch to the new eXist setup, this detection had to be rewritten and is not enabled. A concept solution is available as a redo.xq "source" that can be "scraped", as is required for the full validation and logging process.
parliamentary documents -> members
Amendments are introduced or backed by politicians and linked. Parties of the backing politicians are not added.
parliamentary documents -> parties
If a amendment links to a vote, the vote is copied into the meta-data, and thereby contains the parties and how they voted.
members -> parties
Members (of the upper or lower house) are always seated for a specific party (as opposed to government members who are typically affiliated to a party but act "neutral"). Each seat-period is linked to a party.
parties -> members
Parties are implicitly linked to members as the number of seats representing them. These seats are counted based on the members, but not explicitly linked to members.
parties -> parties
Parties tend to split up or combine into new parties. Ancestor and descendant relations, when applicable, are made explicit.
members -> external
Members are linked with the source parlement.com, Wikipedia, dbpedia and, when available through dbpedia, with freebase.
During writing, an identification process is running that detects politicians in newspaper archives. When done, members are also linked with specific newspaper articles available at the KB.
parties -> external
Parties are linked with Wikipedia, parlement.com, dbpedia and, when available through dbpedia, with freebase.
proceedings -> external
Proceedings are linked with their source documents, when they can be uniquely determined during transformation. The Ducth source data also has eplicit document references, which are displayed as actual data links in the HTML view.

Validation

Each processed document, after transformation and linking, is checked for structural and content validity.

Directly after transformation is finished, the transformation monitor checks the resulting document. If the validation is accepted, meaning no problems or only warnings, the validation results are added to the docinfo of the document; otherwise the document is rejected and not stored. Two types of data validation are done: structural and content validation.

RelaxNg

The possibilities and limits of the document structures are explicitly defined in a RelaxNG (compact) definition. These definitions are used to validate that each output document is structurally conformant to the same format.

This means that, after a document has been validated, we know for certain that (x)queries will be correct for that type of data, and will succeed (or if it doesn't succeed, we know why it does not).

Jing

Validation of RelaxNG is done with the free tool jing. A html, documented version of the schema below can be found at http://schema.politicalmashup.nl/proceedings.html.

Example commandline validation: java -jar /path/jing.jar -c "http://schema.politicalmashup.nl/proceedings.rnc" /path/proceedings-document.xml.

Schematron

Rather than structure, schematron xslt sheets are used to validate the content of the data.

Content validation is used to evaluate the quality of the content extraction, linking etc.

Schematron validations are defined in XML as a set of rules, patterns and assertions. Such an XML file is converted to an XSLT validator with a set of transformation tools provided by the Schematron developers. The validator sheet in its turn, "transforms" a data document to a list of evaluations.

Saxon

Schematron defintions are available as .sch files in the git repository, and converted with the sch-to-xsl.sh (also available in the repository, requires saxon to be present in /home/ilps/parsetools/). The xml definition of the validator below is available at http://schema.politicalmashup.nl/proceedingsschematron.sch.

Example commandline validation of a data document: java -jar /path/saxon.jar /path/proceedings-document.xml "http://schema.politicalmashup.nl/proceedingsschematron.xsl".

Data Access

Multiple ways to access and search the data have been implemented.

Resolver

All data is stored internally as XML documents. Each document is available on the PoliticalMashup resolver. The resolver should be supplied the identifier of a document or sub-element, and returns the relevant XML.

Parts of documents (sub-elements) can also be requested with the document idententier, and the specific part supplied separately, as [identifier]?part=[part].

Apart from the raw xml, we also provide an HTML and RDF view of the data. This is acquired by adding ?view=html/rdf to the url. For simplified data harvesting, content-types can also be specified as [identifier].xml/.html/.rdf or, for the rdf view only, through header content negotiation.

The documents on the resolver can also be reached by means of handles.

Apart from content elements, the document information and meta data can also be requested separately by adding .docinfo and .meta respectively to a document identifier (i.e. no element identifiers or other views are supported).

Finally, Lucene query words can be highlighted by adding ?q=[query] to the url. This can be quite slow however, due to implementation restrictions.

Due to rights restrictions, the public data presentation of members excludes any membership, curriculum and education information. This data is available for internal processing, and used for instance during the identification of members in proceedings.

Full-text search, see search.politicalmashup.nl, is implemented with the Lucene backend available in eXist. This interface is built specifically for the Dutch proceedings, with restrictions specific to Dutch parliament/government members and political parties. Searching other collections is available through the demo-search.xq utility script.

The basic idea behind Lucene in combination with eXist, is that each xml-element (when defined in a configuration file) is treated as a single "document" and presented to Lucene. Lucene document retrievel thus reduces to xml-element retrieval.

The search is complemented with filters that restrict searches to: speakers, roles (mp, government or chair), political party, house (commons or senate), date of the meeting. Because of the added structure, we allow for entry-point retrieval.

One of the advantages of the structured documents as made available in the PoliticalMashup project, is the ability to retrieve and thus search on specific levels. The search engine can search in topic titles, entire topics, speaker scenes, and specific speeches. The demo-search searches in speeches or single paragraphs.

For each result, the context is always completely specified and retrieved. Additional explanation is given (in Dutch) on http://search.politicalmashup.nl/help.xq

Backend Utilities Scripts

analyse-members.xq
Allow analysis of members and their interactions, per legislative period: list all active members, list all text paragraphs from a member, list all scenes for inspection, and list all interruptions.
announce-available.xq
Script where the transformation monitor can announce an newly transformed document. Relevant only for the main, active frontend eXist.
Required for the live presentation and propagation of new, up to date data.
check-party-seats.xq
Old script used to analyse the seats distribution, to check if it is corectly extracted from the member data. Currently not very useful, but could be updated.
complete.xq
Analyses if the data is complete, by determining the sessions and items that should at least be availble, and checking for them. Note that this can be a computation-intensive script to run.
demo-search.xq
Search interface for the proceedings data. Has fewer options than search.politicalmashup.nl but is generally faster, and allows searching in all different language proceedings.
export.xq
Facilitates listing all available data, with multiple input source and output format options. Useful for listings fed to curl etc.
get-votes.xq
Old, flaky, but important script that searches for votes in the proceedings based on a dossier- and sub-number. Throws an error when called without arguments, as an example see kst-32469-18.
Required for the transformation of parliamentary documents.
id-members.xq
Identify political members based on their name and additional context. Explicitit member documents are available for Dutch an UK politicians. Possibly the most important script.
Required during the transformation of proceedings.
Required during the transformation of parliamentary documents.
id-parldoc.xq
Identify parliamentary documents based on a dossier number and dossier subnumber. Script is able to correctly detect reprints.
Required for the identification of parliamentary documents in votes, during the transformation of proceedings.
id-parties.xq
Identify parties based on a string and a date.
Required during the transformation of proceedings.
Required during the transformation of members.
Add input type checking.
list-handle.xq
List a data collection as a Handle batch-script.
list-members.xq
List all members given filters (date, house, member-id etc.). More or less id-members.xq without a query. Used in the past to supply people with a lists of politicians in xml.
Used (and required) during the update-fix for the SGD proceedings to add party-links where they are not mentioned in the (older) data. Not actively used.
list-parldoc.xq
List all available parliamentary documents (currently only amendments) and their number of votes.
list-parties.xq
List all parties, given some filters.
Required for the listing of active parties, during the transformation of proceedings.
Required to infer votes from non-listed parties (e.g. "the rest voted against", who is this rest).
list-party-seats.xq
List the seats of a party given some date and house. The seats are calculated dynamically during transformation, based on the people being a member for that party of that house.
Required during the transformation of parties.
list-updates.xq
List all documents added or updated since a given xs:dateTime.
Secondary eXist databases (e.g. search.politicalmashup.nl) use update-available.xq to get the latest updates, requiring list-updates.xq on the main live backend.
list-votes.xq
List all votes per legislative period. Useful for the analyses requested by newspapers etc.
run-backup.xq
Immediately triggers the backup script.
This is to immediate. Make sure it requests a password or something similar (see monitor code).
stats.xq
Calculate statistics on the size of the data and percentage of members identified.
update-available.xq
Update a secondary eXist database with newly available documents. This script is not useful/relevant on the main resolver eXist, since this is the reference database. Used most importantly by the search.politicalmashup.nl database.
Requires a running main eXist server with backend.politicalmashup.nl/list-updates.xq.

Some script could/should be updated to fully utilise the export.xqm module and create a useable webinterface: get-votes.xq, list-members.xq, list-parldoc.xq, list-parties.xq, list-party-seats.xq

Other scripts

autocomplete/query.xq
Script returning a set of possible members matching a term in JSON format. Used for autocomplete options in the main search interface.
monitor.politicalmashup.nl/proc/nl-prev-doc.xq
Find the previous document of dutch proceedings document, to copy over the current chair information (this is not repeated in the source of consecutive items).
Required during the transformation of proceedings.
monitor.politicalmashup.nl/proc/logs.xq
Read the transformation logs to see if a document was actually (and properly) processed.
Required after the transformation of documents by scrapy to check if transformation was successful.

OAI

XML data is also available for harvesting through the OAI-PMH protocol. With this protocol, meta-data is listed according to specification, with links to the actual data on the resolver. OAI uses a set of verb commands documented separately. For example, retrieve the data for Dutch political parties as http://oai.politicalmashup.nl/?verb=ListRecords&set=p:nl

eXist

The data is stored and accessible through the eXist database server. Installation, code development, and database content structure.

eXist is a native XML database. It is used to store all data and xquery scripts, and apply data transformations and validations.

We use version 1.4.3, notes on exist 2 are given at the end.

The current setup favours multiple differentiated databases, over one database that does everything. A list of all running database instances, and their use, is available on the twiki. One important exception is an old database, that is still running and showcases old demo's. More information is listed below on xml.politicalmashup.nl.

Install eXist database/copy

Installation of a personal development (or new live) database is assisted with a deployment script. This script automates many of the steps that necessary, such as setting listening ports during installation, and adding data or updating code to an installed database.

The guide below walks through the actual installation process, and next gives an example of adding data. In the guide, the SSH links are used, which allow for actual development but require a working github.science.uva.nl account. The subsection below this uses this deployment script to streamline actual code development.

The installation was created for the mashup server machines. It requires the availability of wdfs to mount webdav and copy code. This is an unfortunate legacy requirement.

# Get and setup the deployment script.
git clone git@github.science.uva.nl:politicalmashup/parliament-deploy.git
# [[ Enter git pw.
# Rename the deployment to some relevant
# name if wanted.
mv parliament-deploy/ some_name
cd some_name/
# Configure the installation
./configure
# Typically, for frontend development,
# all default settings are fine, but pay
# attention to max. memory, and the local
# port used.
# JAVA max_mem (2000):
# EXIST port (8080):

# Now, start the installation.
make install
# [[ Enter new admin pw.
# [[ Enter new default usernamed.
# [[ Enter new default user pw.
# [[ Enter git pw.

# Done!

Download and add data

After installation, data can be copied from the main database to the local folder, and then added to the local database. To aid collection downloading, a bash_completion script can be loaded, to TAB through the possible options. Below an example of downloading the (transformed and validated) Dutch parties.

# Load bash_completion
source bash_completion.sh
# Load available collections
make update_available_collections
# Initialise the collection.
make collection_init collection=permanent/p/nl
# Retrieve the parties.
make get_local_copy collection=permanent/p/nl
# Put/copy the files to the local database
make put_local_copy collection=permanent/p/nl
# Clean up downloaded files
rm -rf tmp-data/

Data update from main frontend

After a database has been installed, and initial data added, new/updated data can be retrieved from the main servers with a basic update script. This will update any collections that have been initialised during the data-copying step above.

On the installed database, run: http://[host]:[port]/backend/update-available.xq?since=2013-[MM]-[DD]T00:00:00.

The update process can take some time if files (e.g. nl-members) need updating. Adding files is relatively fast.

Code development

The process of developing code and applications is by programming in a separate development eXist. If the result is satisfactory, commit and push the changes to git, and then pull and copy on the live eXist machines.

One exception is the /db/www/private/ collection. Any code therein is ingored by the git repository. So long as the database is not overloaded, it is possible to experiment with all data/code etc. To reach the code, browse to the root of the database, i.e. not to resolver.politicalmashup.nl, but to mashup[*].science.uva.nl:[port]/private/filename.ext.

Another possibility is to write a query, and run it (using oXygen) against a database as datasource.

oxygen

For all development concerning xquery, xml or xslt, the editor oXygen is extremely useful. Apart from code highlighting and interpretation, it allows eXist databases to be added as datasource. See the twiki EXistDB#eXist_with_oXygen for more information on loading eXist data, and licenses.

git push

Often, and for example this documentation, code is written through oXygen. This code is than copied into the local git version, located in the same folder as the eXist instance.

To update git, copy the changes out of the eXist and commit+push to git. Note that pushing to git is only possible if you know the git user password as given during the installation of the eXist database. This is also the reason read-only live databases from ilps_bg can not push.

make update_git
# Enter admin pw
cd installation/parliament-*
# * == {backend,frontend}-clone
git status
git commit [files you want to commit]
git push
# Enter git pw

git pull

Development on live databases, e.g. the resolver, should not be done. Rather, after pushing development changes to git, pull them into the live dabases afterwards. One reason for this is that main databases are run by the group user ilps_bg, who can not push changes into the git repository.

To update a database to the latest version available in git, go to the main deployment folder (as given on the twiki) and run (this requires the database password in local-pw) the pull+update code.

make pull
make update_exist
# Enter admin pw

eXist uptime

A downside of eXist, at least for version 1.4.*, is that it is not always very stable. Reliable checking if a system is running is not really possible. A system might simply be temporary non-responsive, hung completely, or not running at all. Checking for running processes is therefore not sufficient. Also, when a database crashes, it might have remaining logs and locks that prevent it from starting up properly. The best way is to manually check and restart a database if there is a question about its functioning. See the running services overview for guidance to (re)starting specific databases.

In case of emergency requirement of a working eXist, a shadow copy of the live resolver is running (and updated daily) on mashup0. See the twiki database list for its exact location (current backup is R, and edit the apache configuration on mashup0 to point any required subdomain url's to the backup database.

Modules

Collections in the database serve specific purposes, as listed below.

backend
Code for the backend utitlities.
backup
Code to run backups, largely old and unused since automated backups are run by the systems.
data
Contains all data, permanent data for frontends, and additionally upload data for the monitor.
local
Some configuration created at installation time.
logs
Relevant for the monitor only, contains logs of the transformations.
modules
Contains code used by many scripts, notably the export.xqm and functx definitions.
oai
Code for OAI.
resolver
Code for the resolver.
system
Contains, nested within, the data- and lucene-index configuration files.
www
Collection accessible from outside.
www/documentation
Contains this documentation.
www/pop
Old search script for Populisme project. Kept for code reference.
www/proc
Only exists in the monitor. Contains logs and previous document scripts, see other backend utilities.
www/private
Collection created for development, will not be ignored when updating git through the parliament-deploy script.
www/search
Contains the code for the Dutch proceedings search.
www/xqueries
Old collection for xqueries, used in xml.politicalmashup.nl etc?

Data collections

Data in eXist are stored in collections. Below is an overview of the crawlers, collection paths and number of documents. Detailed statistics overviews, e.g. number of speeches and percentage of identified members, of the proceedings can be calculated online.

crawler collection #documents
[nl-ob] //permanent/d/nl/proc/ob 28173+
[nl-ob] //permanent/d/nl/parldoc 14645+
[nl-sgd] //permanent/d/nl/proc/sgd 24002
[nl-members] //permanent/m/nl 3657+
[nl-party] //permanent/p/nl 151+
[nl-draft] //permanent/d/nl/draft 201+
[uk] //permanent/d/uk 11965+
[uk-members] //permanent/m/uk 13131+
[be-proc-vln] //permanent/d/be/proc/vln 897+
[no] //permanent/d/no/proc 12339+
[dk] //permanent/d/dk/proc 7698+
[se] //permanent/d/se/proc 2966+

Bugs

The eXist database, although very flexible and useful for our needs, has the occational hiccups. When under heavy load the responsiveness can become very low (if eXists starts to swap RAM, it is usually better to just shut it down and restart).

The most important bug is with the xs:date index on our dc:date fields. Although xs:dateTime indices work perfectly, the xs:date index sometimes returns erroneous results. It is therefore very important to always check if your results seem to conform to your idea. This holds mostly for the search engine (todo link), but also for the export script (todo link) for instance. Script like list-updates (todo link) that work on xs:dateTime do not suffer from this bug and can be assumed safe. Switching from xs:date to xs:string indices on the dc:date field were considered and tested, but proved far too slow for the amount of data collected.

eXist 2.0

A new stable version 2.0 of eXist has been available for a short while, but it came too late for us to incorporate. The existing systems therefore run on the currently most up-to-date previous version 1.4.3.

The parliament-deploy scripts contain the branch exist2 that implemented a working, but not fully automated, installation of eXist 2.0. Althought the database works, and existing code can be get to work, it requires quite a lot of redesigning due to a change in structure. It does look promising, for use as current access system, for other uses in e.g. education, and hopefully reduces the number of bugs.

Fully revise the code to work with eXist 2.0 and update the frontend.

Handle

All documents can be resolved, and referenced elsewhere, through the handle resolution system.

System

The Handle System provides efficient, extensible, and secure resolution services for unique and persistent identifiers of digital objects, and is a component of CNRI's Digital Object Architecture. - http://handle.net/.

Handle is used for instance by DOI, and provides a uniform, scalable framework for referencing all things digital. The simplest use, and our method currently, is the resolution of a "handle" (an identifier), to an online location (our resolver).

Each handle contains a set of information rows, called "Handle Values". These Values consist of four fields: the Index, the Type, a Timestamp, and the Data. The examples below are the best explanation. Additional information on running a server and adding handles can be found on the twiki.

There is currently no backup server. If the virtual machine breaks down/gets deleted, due to bad maintance or otherwise, the handles are gone. Keep this in mind when handles are added manually. The PoliticalMashup handles are easily added again through the list-handle.xq script and the batch processing of the GUI tool. Passwords for the current server are available on paper only in room C3.230.

Identifiers

All documents, as available on 2013-06-21, are available through handle as pm:[identifier]. The only exception is the Dutch proceedings drafts, since these officially may not be republished.

Update

Updating the handles should be done manually, mainly because the password should not end up in an automation script. Manual updates require two steps: download handle definitions as batch file; and process the batch file with the GUI tool. This does not require access to the handle server.

Download the handle software through http://handle.net/download.html. Then download the server files from the twiki gui-required-files.tar.gz.

Assuming both files were saved to /path/, extract both files there. Then, start the gui tool with /path/hsj-*/bin/hdl-admintool.

Download batch file

This example downloads and updates the Dutch modern proceedings.

http://backend.politicalmashup.nl/list-handle.xq?prefix=11145&project=pm&keyfile=/path/to/admpriv.bin&password=copypasswordfrompaper&since=2013-06-10T00:00:00&collection=s%2Fnl%2Fproc%2Fob&view=table.

Now, change the 'keyfile' field to the physical, absolute path where the admpriv.bin file is located on the machine you are running the GUI tool, i.e. /path/gui-required-files/admpriv.bin. Change the 'password' field to the password that is required to use the admpriv.bin (i.e. not the server certificate password). Change the 'view' field to csv. And click search. Copy paste this text into a file, e.g. update_collection.batch. Use the GUI tool to process this batch.

Use GUI tool

In the open GUI, click right-bottom "Authenticate". Then, enter 0.NA/11145 (zero.NA, not ooh.NA); select Public/Private Key. Click "Select Key File.." and browse to /path/gui-required-files/admpriv.bin. Click "OK", and enter the handle admin password to decrypt the key.

Click File, Open Batch File. Select update_collection.batch and press OK. If you want, logging the output to file is possible with Send Output To, but the context window is fine for small batches. Then, press Run Batch(es), and it will start. When handles already exist, they will generate FAILURE message, which is fine.

SGD Annotation

Find, identify and visualise speakers in OCR data.

The processing of the old (pre 1995) Dutch proceedings required more specific steps and a higher quality, in part because of the external Koninklijke Bibliotheek project.

Shouldn't there be more information about this project on the twiki?

KB project

The KB ran a project wherein all historical proceedings from 1814-1995, published only on paper, were digitised and automatically parsed with OCR.

Part of the project, detecting speakers, was performed by us in a project related to PoliticalMashup. Apart from detecting speakers in the text, the exact word-coordinates on the scanned images, and unique identification of those speakers we requested. Some of the techniques and insights from this project were reused within PoliticalMashup.

Transformation

The transformation of the SGD data occurred in three stages. The first stage consisted of transforming the PDF documents to relatively structured data. The second stage was determining and adding the unique member ids to the data. The third stage was a more recent clean-up script that is performed through the uploading process as described in the transformation section.

Redoing the entire process, specifically the first stage, will most likely be difficult due to (changed) dependencies, data locations etc. The results however are still available.

Exposure

PoliticalMashup has collaborated with journalists, and published and presented at scientific conferences.

Members

list people? maarten marx, arjan nusselder, bart de goede, justin van wees, johan doornik, lars buitinck, isaac sijaranamual, anne schuth, steven grijzenhout?

scrape arjan's twiki for info: http://ilps.science.uva.nl/twiki/bin/view/Main/ArjanNusselderTODO http://ilps.science.uva.nl/twiki/bin/view/Main/ArjanNusselderTodoArchive http://ilps.science.uva.nl/twiki/bin/view/Main/ArjanNusselderLogBook

Publications

list publications? list conferences? (dir, okcon, ding in enschede van bart-en-justin, etc?)

Newspapers

VN, nrc, ?

Blog

politicalmashup.nl

KB/Dienst informatie voorziening?

---

Other collaborations?

Marina Lacroix, Wijfjes, ?