Developer Log # 7 - Importing files and documents structured into wiki
Topics: MediaWiki API (Login, Edit page, Upload file), Semantic MediaWiki properties.
In this developer log I will be discussing the use case of importing your scattered, unstructured data into your new wiki. Think of files and documents stored on Dropbox or Google Drive or on other storage platforms. Your data can be structured in your favourite programming language. We will be using the storage platform API and MediaWiki API.
- Step 1. Translate the structure
First of all, we want to capture the structure that is present on the current storage platform. Using the storage platform API we will be able to define each file with the names of its parenting directories and its own filename. Based on these properties we might already be able to structure some file and document types. We can create an object in our programming language containing the file and corresponding found properties. From this object, we can create a page referencing the file and storing its properties as Semantic MediaWiki properties, see the example below.
{{Template |Name=<Name> |Property1=<Property1> |Property2=<Property2> |Filename=<Filename> }}
Each file or document category can have its own template. This template can be used to define the feel and look in the wiki.
- Step 2. Upload the files and create the pages
Now we use the MediaWiki API to create pages and upload the corresponding files. I expect you have bot credentials ready, but otherwise go to Special:BotPasswords
and create bot credentials there. You might need to enable the setting for bots ($wgEnableBotPasswords
).
In Python (my favorite programming language) the MediaWiki API class might look like the class at the end of this page. This class authenticates on creation and exposes the functionality to create pages and upload files.
- Additional tweaks
There might also be files or documents which need more properties than can be found using the structure on the old storage platform. Which additional properties are needed and how they can be found highly depends on the particular use case and file type, but I will list some examples:
- Properties that can be scraped from the content of the files.
- Files containing synonyms/aliases, such as files containing
CustomerSurname1
andCustomerSurnameandLastname1
. - Miscellaneous / non-scrapable files.
Most programming languages offer tools that let you scrape most file types so I would refer to those languages and tools. For example, Python BeautifulSoup4 can be an excellent tool for HTML and XML files, Pandas for Excel-type files and so on.
Want to use synonyms and aliases? These can be handled by creating a ‘dictionary’. Make sure all forms of occurrences of the same property are documented in a such way that the scrape tool can recognize them as the same property.
There will always be a chance that a few files do not fit a certain type or are overlooked by your program. We can analyze these files/documents and improve the program to include them in some category or define them as miscellaneous. Making sure all data is imported by the program in the end will make sure that reproduction does not require manual actions and that future files might be picked up better.
Example code
import requests # The python module to create requests, other languages might use curl equivalent from os import rename class Api: def __init__(self, bot_credentials=Config.bot): self.bot_credentials = bot_credentials # Define credentials for api user self.csrf_token, self.s = self.get_csrftoken() # Authenticate and get csrf token and session def get_csrftoken(self): s = requests.Session() # Initialize session param = { "action": "query", "meta": "tokens", "type": "login", "format": "json" } r = s.get(url=self.bot_credentials['apiurl'], params=param) # Query login token login_token = r.json()['query']['tokens']['logintoken'] param = { "action": "login", "lgname": self.bot_credentials["username"], "lgpassword": self.bot_credentials["password"], "lgtoken": login_token, "format": "json" } r = s.post(self.bot_credentials['apiurl'], data=param) # Login with credentials if r.json()['login']['result'] != 'Success': print(r.json()) raise Exception('Logging in to MW api failed for: ' + self.bot_credentials['username']) param = { "action": "query", "meta": "tokens", "format": "json" } r = s.get(url=self.bot_credentials['apiurl'], params=param) csrf_token = r.json()['query']['tokens']['csrftoken'] return csrf_token, s # Get csrf token and keep session alive def post_page(self, title, text, bot=True): param = { "action": "edit", "title": title, "text": text, "bot": bot, "token": self.csrf_token, "format": "json", } r = self.s.post(self.bot_credentials['apiurl'], data=param) # Create page with title and text if r.json()['edit']['result'] != 'Success': print(r.json()) raise Exception('Editing/Creating page failed for: ' + title) def post_file(self, file_path, file_name): param = { "action": "upload", "filename": file_name, "format": "json", "token": self.csrf_token, "ignorewarnings": 1 } with open(file_path, 'rb') as f: file = {'file': (file_name, f, 'multipart/form-data')} r = self.s.post(self.bot_credentials['apiurl'], files=file, data=param) # Upload file to wiki try: # Beneath is some error handling as errors might pop up and need some special handling r.json() except Exception: print(r.content) print('Bad return: File may be too big') raise if 'upload' in r.json(): if r.json()['upload']['result'] == 'Success': return if r.json()['error']['code'] == 'fileexists-no-change': print(r.json()['error']['info']) else: if r.json()['error']['code'] == 'verification-error': if file_name[-4:] == 'docx': rename(file_name, file_name[:-1]) # Hacky way as some files saved as docx are doc files self.post_file(file_path[:-1], file_name[:-1]) else: print(r.json()) print(file_name) return else: if r.json()['error']['code'] == 'empty-file': return else: print(r.json()) raise Exception('Uploading file failed for: ' + file_name)