• logo.png

icon menu-mobile.svg

EN.pngNL.png
From wikibase-solutions.com

Developer Log # 7 - Importing files and documents structured into wiki

Topics: MediaWiki API (Login, Edit page, Upload file), Semantic MediaWiki properties.

In this developer log I will be discussing the use case of importing your scattered, unstructured data into your new wiki. Think of files and documents stored on Dropbox or Google Drive or on other storage platforms. Your data can be structured in your favourite programming language. We will be using the storage platform API and MediaWiki API.

Step 1. Translate the structure

First of all, we want to capture the structure that is present on the current storage platform. Using the storage platform API we will be able to define each file with the names of its parenting directories and its own filename. Based on these properties we might already be able to structure some file and document types. We can create an object in our programming language containing the file and corresponding found properties. From this object, we can create a page referencing the file and storing its properties as Semantic MediaWiki properties, see the example below.

{{Template
 |Name=<Name>
 |Property1=<Property1>
 |Property2=<Property2>
 |Filename=<Filename>
}}

Each file or document category can have its own template. This template can be used to define the feel and look in the wiki.

Step 2. Upload the files and create the pages

Now we use the MediaWiki API to create pages and upload the corresponding files. I expect you have bot credentials ready, but otherwise go to Special:BotPasswords and create bot credentials there. You might need to enable the setting for bots ($wgEnableBotPasswords).

In Python (my favorite programming language) the MediaWiki API class might look like the class at the end of this page. This class authenticates on creation and exposes the functionality to create pages and upload files.

Additional tweaks

There might also be files or documents which need more properties than can be found using the structure on the old storage platform. Which additional properties are needed and how they can be found highly depends on the particular use case and file type, but I will list some examples:

  • Properties that can be scraped from the content of the files.
  • Files containing synonyms/aliases, such as files containing CustomerSurname1 and CustomerSurnameandLastname1.
  • Miscellaneous / non-scrapable files.

Most programming languages offer tools that let you scrape most file types so I would refer to those languages and tools. For example, Python BeautifulSoup4 can be an excellent tool for HTML and XML files, Pandas for Excel-type files and so on.

Want to use synonyms and aliases? These can be handled by creating a ‘dictionary’. Make sure all forms of occurrences of the same property are documented in a such way that the scrape tool can recognize them as the same property.

There will always be a chance that a few files do not fit a certain type or are overlooked by your program. We can analyze these files/documents and improve the program to include them in some category or define them as miscellaneous. Making sure all data is imported by the program in the end will make sure that reproduction does not require manual actions and that future files might be picked up better.

Example code

import requests # The python module to create requests, other languages might use curl equivalent
from os import rename

class Api:

    def __init__(self, bot_credentials=Config.bot):
        self.bot_credentials = bot_credentials # Define credentials for api user
        self.csrf_token, self.s = self.get_csrftoken() # Authenticate and get csrf token and session

    def get_csrftoken(self):
        s = requests.Session() # Initialize session
        param = {
            "action": "query",
            "meta": "tokens",
            "type": "login",
            "format": "json"
        }
        r = s.get(url=self.bot_credentials['apiurl'], params=param) # Query login token
        login_token = r.json()['query']['tokens']['logintoken']

        param = {
            "action": "login",
            "lgname": self.bot_credentials["username"],
            "lgpassword": self.bot_credentials["password"],
            "lgtoken": login_token,
            "format": "json"
        }
        r = s.post(self.bot_credentials['apiurl'], data=param) # Login with credentials
        if r.json()['login']['result'] != 'Success':
            print(r.json())
            raise Exception('Logging in to MW api failed for: ' + self.bot_credentials['username'])

        param = {
            "action": "query",
            "meta": "tokens",
            "format": "json"
        }
        r = s.get(url=self.bot_credentials['apiurl'], params=param)
        csrf_token = r.json()['query']['tokens']['csrftoken']
        return csrf_token, s # Get csrf token and keep session alive

    def post_page(self, title, text, bot=True):
        param = {
            "action": "edit",
            "title": title,
            "text": text,
            "bot": bot,
            "token": self.csrf_token,
            "format": "json",
        }
        r = self.s.post(self.bot_credentials['apiurl'], data=param) # Create page with title and text
        if r.json()['edit']['result'] != 'Success':
            print(r.json())
            raise Exception('Editing/Creating page failed for: ' + title)

    def post_file(self, file_path, file_name):
        param = {
            "action": "upload",
            "filename": file_name,
            "format": "json",
            "token": self.csrf_token,
            "ignorewarnings": 1
        }
        with open(file_path, 'rb') as f:
            file = {'file': (file_name, f, 'multipart/form-data')}
            r = self.s.post(self.bot_credentials['apiurl'], files=file, data=param) # Upload file to wiki
        try: # Beneath is some error handling as errors might pop up and need some special handling
            r.json()
        except Exception:
            print(r.content)
            print('Bad return: File may be too big')
            raise
        if 'upload' in r.json():
            if r.json()['upload']['result'] == 'Success':
                return
        if r.json()['error']['code'] == 'fileexists-no-change':
            print(r.json()['error']['info'])
        else:
            if r.json()['error']['code'] == 'verification-error':
                if file_name[-4:] == 'docx':
                    rename(file_name, file_name[:-1]) # Hacky way as some files saved as docx are doc files
                    self.post_file(file_path[:-1], file_name[:-1])
                else:
                    print(r.json())
                    print(file_name)
                    return
            else:
                if r.json()['error']['code'] == 'empty-file':
                    return
                else:
                    print(r.json())
                    raise Exception('Uploading file failed for: ' + file_name)