Writing a research paper with ReStructuredText

Why?

As a part of my life as a Ph.D. student, I have to read a large number of scientific papers and I have seen a couple of them written. What struck me was that these papers have a great deal of internal structure (bibliographic references, references to figure, definitions, adaptation to a particular public, journal, etc…). However, they are written all at once, usually in a word document, in a way that all that structure lives and can only be verified in the writer’s head.

As a programmer, I am used to dealing with organizing complex structured text as well – my source code. However, the experience has shown me that relying on what is inside your brain to keep the structure of your project together works only for a couple of lines of code. Beyond that, I have no choice but to rely on compilers, linters, static analysis tools and IDE tools to help me dealing with the logical structure of my program and prevent me from planting logical bombs and destroying one aspect of my work while I am focusing on the other. An even bigger problem is to keep myself motivated while writing a program over several months and learning new tools I need to do it efficiently.

From that perspective, writing the papers in a word file is very similar to trying to implement high-level abstraction with very low-level code. In fact, so low-level, you are not allowed to import from other files (automatically declare common abbreviations in your paper), define functions and import upon execution. Basically, the only thing you can rely on is the manuscript reviews by the writers and the editors/reviewers of journals where the paper is submitted. Well, unless they get stuck in an infinite loop.

Alt text

So I decided to give it a try and write my paper in the same way I would write a program: using modules, declarations, import, compilation. And also throwing in some comments, version control and ways to organize revisions.

Why ReStructuredText?

I loved working with it when I was documenting my projects in Python with Sphinx. It allows quite a lot of operations I was looking for, such as .. include:: and .. replace:: directives. It is well supported in the PyCharm editor I am using, allowing me to fire up my favorite dark theme and go fullscreen to remain distraction-free. It also can be translated with pandoc to both .docx for my non-techy Professor and LaTeX files for my mathy collaborator.

It also allowed me to type the formulas with raw mathematics in LaTeX notation quite easily by using .. math:: directive.

How did it go?

Not too bad so far. I had some problems with pandoc’s ability to convert my .rst file tree into .docx, especially when it came to failing on the .. import:: and citation formatting. (link corresponding issues) There was also some issue with rendering .png images in the .docx format as well (link issue). In the end, I had to translate the .rst files into html and with rst2html tool and then html to docx. For now, I am still trying to see how good of the advantages it is giving me.

After some writing, I’ve noticed that I am now being saved from the pending references. for instance, at some point I wrote a reference [KimKishnoy2013]_ in my text and while writing biblio I realized the paper came out in 2012, so defined the paper there as .. [KimKishnoy2012]. And rst compilation engine threw an error on compilation about Unknown target name: "kimkishnoy2013" Yay, no more dead references! The Same thing is true for the references defined in the bibliography but not used in the text.

Now a shortcoming of this method of writing is the fact that inter-part transitions do not come naturally. It can be easily mitigated once the writers’ block has been overcome by writing all parts separately by opening a compiled HTML or .docx document and editing the elements os that they align properly.

An additional difference with the tools that has been developed to review code is that the information density and consistency in the programming languages is closer to mathematical notations rather than a human-readable text with all the redundancy necessary for a proper understanding. A word change or a line change is a big deal in the case of programming. It isn’t so important in the case of writing. and all the annotation and diff tools used for that are not very useful.

On the other hand, it is still related to the fact that human language is still a low-level language. Git won’t be as useful to review the binaries that it is when reviewing the programming languages that are important.

Over the time, a two significant problems emerged with that approach. First – incorporating the revisions. Since the other people in the revision pipeline are using the MS word built-in review tools, in order to address every single revision I have to find the location in the text tree file where the revision needs to be made, then correct it. Doing it once is pretty straight-forward. Doing it hundreds upon hundreds of time across tens of revision by different collaborators is an entire thing altogether and is pretty tiresome.

The second problem is related to the first. When the revisions are more major and require re-writing and re-organization of entire parts, I need to go and edit the parts one by one, then figure which part contents is going into what new part. Which is a lot of side work for not a lot of added value in the end.

What is is still missing?

  • Conditional rendering rules. There are some tags that I would want to see when I am rendering my document for proofreading and correction (like parts name, my comments, reviewer comments), but not in the final “build”.

  • Linters. I would really like to run something like a Hemingway app on my text, as well as some kind of Clonedigger to figure out where I tend to repeat myself over and over again and make sure I only repeat myself when I was to. In other terms automated tools for proof-reading the text and figuring out how well it is understood. It seems that I am not the only one to have the idea: Proselint creators seem to have had the same idea and implemented a linter for prose in python. Something I am quite excited about, even though they are still in the infancy because of how liberal the spoken language is compared to programming language. We will likely see a lot of improvements in the future, with the development in NLP and machine learning. Here are a couple of other linter functions I could see be really useful.

    • Check that the first sentence of each paragraph describes what the paragraph will be about
    • Check that all the sentences can be parsed (subject-verb-object-qualifiers)
    • Check that there is no abrupt interruption in the words used between close sentences.
    • Check for the digressions and build a narration tree.
  • Words outside common vocabulary that are undefined. I tend to write to several very different public about topics they are very knowledgeable about and sometimes not really. The catch is that since I am writing about it, I am knowledgeable about them, to the point that sometimes I fail to realize that some words might need. If I have an app that shows me words that I introduce that are rare and that I don’t define, I could rapidly adapt my paper to a different public or like the reviewers like to ask unpack a bit.

  • chaining with other apps. There are already applications that do a great job on the structuring the citation and referencing to the desired format. I need to find a way to pipe the results of .rst text compilation into them so that they can adapt the citation framework in a way that is consistent with the publication we are writing.

  • Skeptic module. I am writing scientific papers. Ideally, my every assertion in the introduction has to be backed by an appropriate citation and every paragraph in the methods and results section has to be backed by the data, either in the figures or in supplementary data.

  • A proper management of mathematical formulas. They tend to get really broken. Latex is the answer, but it would be nice if the renderings of formulae could also be translated into HTML or docx, that has it’s own set of mathematical notation (MS office always had to do it differently from open source standards).

  • Way to write from the requirements. In software we have something we refer to as unittests: pre-defined behaviors we want our code to be able to implement. As a rule of thumb, accumulation of unittests is a tedious process, that is nonetheless critical for building a system and validating that upon change our software is still behaving in a way we expect it to. In writing we want to transmit a certain set of concepts to our readers, but because the human time is so valuable, we regularly fail at that task, especially when we fail to see that 100th+ revision makes a concept that is referred to in a paper not be defined anymore. In software it is a little bit like acting as a computer and executing the software in your head. Works well for small pieces, but there are edge cases and what you know about what program should do that really gets into the way.