• logo.png

icon menu-mobile.svg

EN.pngNL.png
From wikibase-solutions.com
Revision as of 13:17, 24 December 2020 by Admin (talk | contribs)

Developer Log # 7 - Importing files and documents structured into wiki

Topics: MediaWiki api (Login, Edit page, Upload file), Semantic MediaWiki properties.

In this developer log I will be discussing the use case of importing your scattered unstructured data into your new Wiki. Think about files and documents stored on Dropbox or Google Drive or on other storage platforms. The structuration of the data can be done in your favorite programming language. We will be using the storage platform api and MediaWiki api.

First of all we will like to capture the structure there is already present on the current storage platform. Using the storage platform api we will be able to define each file with the names of its parenting directories and it's own filename. Based on these properties we might already be able to structure some file and document types. We can create an object in our programming language containing the file and corresponding found properties. From this object we can create a page referencing the file and storing it's properties as Semantic MediaWiki properties, see the example below.

{{Template
 |Name=<Name>
 |Property1=<Property1>
 |Property2=<Property2>
 |Filename=<Filename>
}}

Each file or document category can have its own template. This template can define the feel and look in the wiki.

Now we can use the MediaWiki api to create this page and upload the corresponding file. I expect you have bot credentials ready, but otherwise go to /Special:BotPasswords

and create bot credentials. You might need to enable the setting for bots ($wgEnableBotPasswords). In Python (My favorite programming language) the MediaWiki api class might look as the class on the end of this document. This class authenticates on creation and exposes the functionality to create pages and upload files.

There might also be files or documents which need more properties then can be found using the structure on the old storage platform. These extra properties and how to find them highly depends on the use case and file type but I will list some examples:

  • Properties that can be scraped from the files content.
  • Files containing synonyms/aliases, think about files containing CustomerSurname1 and CustomerSurnameandLastname1
  • Miscellaneous files/Non scrapable files


Most programming languages provide tools to scrape most file types, I would refer to those languages and tools. For example in Python BeautifulSoup4 can be an excellent tool for html and xml files, Pandas for excel type files and so on.

Handling with synonyms and aliases can be done by creating a 'dictionary', make sure all forms of occurrences of the same property are documented such that the scrape tool can recognize them as the same property.

There will always be a chance that a few files do not fit a certain type or are missed by your program. We can analyze these files/documents and improve the program to include them in some category or define them as miscellaneous. Making sure all data is imported by the program in the end, will make sure reproduction does not require manual actions and future files might be picked up better.

import requests # The python module to create requests, other languages might use curl equivalent
from os import rename

class Api:

    def __init__(self, bot_credentials=Config.bot):
        self.bot_credentials = bot_credentials # Define credentials for api user
        self.csrf_token, self.s = self.get_csrftoken() # Authenticate and get csrf token and session

    def get_csrftoken(self):
        s = requests.Session() # Initialize session
        param = {
            "action": "query",
            "meta": "tokens",
            "type": "login",
            "format": "json"
        }
        r = s.get(url=self.bot_credentials['apiurl'], params=param) # Query login token
        login_token = r.json()['query']['tokens']['logintoken']

        param = {
            "action": "login",
            "lgname": self.bot_credentials["username"],
            "lgpassword": self.bot_credentials["password"],
            "lgtoken": login_token,
            "format": "json"
        }
        r = s.post(self.bot_credentials['apiurl'], data=param) # Login with credentials
        if r.json()['login']['result'] != 'Success':
            print(r.json())
            raise Exception('Logging in to MW api failed for: ' + self.bot_credentials['username'])

        param = {
            "action": "query",
            "meta": "tokens",
            "format": "json"
        }
        r = s.get(url=self.bot_credentials['apiurl'], params=param)
        csrf_token = r.json()['query']['tokens']['csrftoken']
        return csrf_token, s # Get csrf token and keep session alive

    def post_page(self, title, text, bot=True):
        param = {
            "action": "edit",
            "title": title,
            "text": text,
            "bot": bot,
            "token": self.csrf_token,
            "format": "json",
        }
        r = self.s.post(self.bot_credentials['apiurl'], data=param) # Create page with title and text
        if r.json()['edit']['result'] != 'Success':
            print(r.json())
            raise Exception('Editing/Creating page failed for: ' + title)

    def post_file(self, file_path, file_name):
        param = {
            "action": "upload",
            "filename": file_name,
            "format": "json",
            "token": self.csrf_token,
            "ignorewarnings": 1
        }
        with open(file_path, 'rb') as f:
            file = {'file': (file_name, f, 'multipart/form-data')}
            r = self.s.post(self.bot_credentials['apiurl'], files=file, data=param) # Upload file to wiki
        try: # Beneath is some error handling as errors might pop up and need some special handling
            r.json()
        except Exception:
            print(r.content)
            print('Bad return: File may be too big')
            raise
        if 'upload' in r.json():
            if r.json()['upload']['result'] == 'Success':
                return
        if r.json()['error']['code'] == 'fileexists-no-change':
            print(r.json()['error']['info'])
        else:
            if r.json()['error']['code'] == 'verification-error':
                if file_name[-4:] == 'docx':
                    rename(file_name, file_name[:-1]) # Hacky way as some files saved as docx are doc files
                    self.post_file(file_path[:-1], file_name[:-1])
                else:
                    print(r.json())
                    print(file_name)
                    return
            else:
                if r.json()['error']['code'] == 'empty-file':
                    return
                else:
                    print(r.json())
                    raise Exception('Uploading file failed for: ' + file_name)