Developer Log # 7 - Importing files and documents structured into wiki
Topics: MediaWiki api (Login, Edit page, Upload file), Semantic MediaWiki properties.
In this developer log I will be discussing the use case of importing your scattered unstructured data into your new Wiki. Think about files and documents stored on Dropbox or Google Drive or on other storage platforms. The structuration of the data can be done in your favorite programming language. We will be using the storage platform api and MediaWiki api.
First of all we will like to capture the structure there is already present on the current storage platform. Using the storage platform api we will be able to define each file with the names of its parenting directories and it's own filename. Based on these properties we might already be able to structure some file and document types. We can create an object in our programming language containing the file and corresponding found properties. From this object we can create a page referencing the file and storing it's properties as Semantic MediaWiki properties, see the example below.
{{Template |Name=<Name> |Property1=<Property1> |Property2=<Property2> |Filename=<Filename> }}
Each file or document category can have its own template. This template can define the feel and look in the wiki.
Now we can use the MediaWiki api to create this page and upload the corresponding file. I expect you have bot credentials ready, but otherwise go to <your_url>/Special:BotPasswords and create bot credentials. You might need to enable the setting for bots ($wgEnableBotPasswords). In Python (My favorite programming language) the MediaWiki api class might look as the class on the end of this document. This class authenticates on creation and exposes the functionality to create pages and upload files.
There might also be files or documents which need more properties then can be found using the structure on the old storage platform. These extra properties and how to find them highly depends on the use case and file type but I will list some examples:
- Properties that can be scraped from the files content.
- Files containing synonyms/aliases, think about files containing CustomerSurname1 and CustomerSurnameandLastname1
- Miscellaneous files/Non scrapable files
Most programming languages provide tools to scrape most file types, I would refer to those languages and tools.
Handling with synonyms and aliases can be done by creating a 'dictionary', make sure all forms of occurrences of the same property are documented such that the scrape tool can recognize them as the same property.
There will always be a chance that a few files do not fit a certain type or are missed by your program. We can analyze these files/documents and improve the program to include them in some category or define them as miscellaneous. Making sure all data is imported by the program in the end, will make sure reproduction does not require manual actions and future files might be picked up better.
import requests # The python module to create requests, other languages might use curl equivalent from os import rename class Api: def __init__(self, bot_credentials=Config.bot): self.bot_credentials = bot_credentials # Define credentials for api user self.csrf_token, self.s = self.get_csrftoken() # Authenticate and get csrf token and session def get_csrftoken(self): s = requests.Session() # Initialize session param = { "action": "query", "meta": "tokens", "type": "login", "format": "json" } r = s.get(url=self.bot_credentials['apiurl'], params=param) # Query login token login_token = r.json()['query']['tokens']['logintoken'] param = { "action": "login", "lgname": self.bot_credentials["username"], "lgpassword": self.bot_credentials["password"], "lgtoken": login_token, "format": "json" } r = s.post(self.bot_credentials['apiurl'], data=param) # Login with credentials if r.json()['login']['result'] != 'Success': print(r.json()) raise Exception('Logging in to MW api failed for: ' + self.bot_credentials['username']) param = { "action": "query", "meta": "tokens", "format": "json" } r = s.get(url=self.bot_credentials['apiurl'], params=param) csrf_token = r.json()['query']['tokens']['csrftoken'] return csrf_token, s # Get csrf token and keep session alive def post_page(self, title, text, bot=True): param = { "action": "edit", "title": title, "text": text, "bot": bot, "token": self.csrf_token, "format": "json", } r = self.s.post(self.bot_credentials['apiurl'], data=param) # Create page with title and text if r.json()['edit']['result'] != 'Success': print(r.json()) raise Exception('Editing/Creating page failed for: ' + title) def post_file(self, file_path, file_name): param = { "action": "upload", "filename": file_name, "format": "json", "token": self.csrf_token, "ignorewarnings": 1 } with open(file_path, 'rb') as f: file = {'file': (file_name, f, 'multipart/form-data')} r = self.s.post(self.bot_credentials['apiurl'], files=file, data=param) # Upload file to wiki try: # Beneath is some error handling as errors might pop up and need some special handling r.json() except Exception: print(r.content) print('Bad return: File may be too big') raise if 'upload' in r.json(): if r.json()['upload']['result'] == 'Success': return if r.json()['error']['code'] == 'fileexists-no-change': print(r.json()['error']['info']) else: if r.json()['error']['code'] == 'verification-error': if file_name[-4:] == 'docx': rename(file_name, file_name[:-1]) # Hacky way as some files saved as docx are doc files self.post_file(file_path[:-1], file_name[:-1]) else: print(r.json()) print(file_name) return else: if r.json()['error']['code'] == 'empty-file': return else: print(r.json()) raise Exception('Uploading file failed for: ' + file_name)