Jupyter/Ipython notebooks

After writing it down a couple of weeks ago for Hacker News, here is the recap and some updates:

I am a computational biologist with a heavy emphasis on the data analysis. I did try Jupyter a couple of years ago and here are my concerns with it, compared to my usual flow (Pycharm + pure python + pickle to store results of heavy processing).

  1. Extracting functions is harder
  2. Your git commits become completely borked
  3. Opening some data-heavy notebooks is neigh impossible once they have been shut down
  4. Import of other modules you have in local is pretty non-trivial.
  5. Refactoring is pretty hard
  6. Sphinx for autodoc extraction is pretty much out of the picture
  7. Non-deterministic re-runs – depending on the cell
    execution order you can get very different results. That’s an issue
    when you are coming back to your code a couple of months later and
    try to figure what you did to get there.
  8. Connecting to the ipython notebook, even from the environments like Pycharm is highly non-trivial, just as the mapping to the OS
    filesystem
  9. Hard to impossible to inspect the contents of the ipython notebook when it’s hosted on Github due to the encoding snafus

There are likely work-arounds for most of these problems, but the issue is that with my standard workflow they are non-issues to start with.

In my experience, Jupyter is pretty good if you rely only on existing libraries that you are piecing together, but once you need to do more involved development work, you are screwed.

How to upgrade MediaWiki – approximate 2018 guide

Unfortunately, unlike WordPress, MediaWiki doesn’t come with a single-button update version. Perhaps because of that, perhaps because of my laziness, I have been postponing my updates of Wikimedia websites for over five years by now. However, in the light of recent vulnerability revelations, I have finally decided to upgrade my installations and started trying to figure what exactly I needed, given I only have web interfaces and FTP access to the website I manage.

First of all, this link gives a good overview of the whole process. For my specific case, I was upgrading to the 1.30, which required a number of edits to the config file, explained here. Now, what seemed to be happening was that after backing up my database (done for me by my hosting provider) and files (that I could to by FTP), I just needed to copy the files from the latest release version (REL1_30 in my case – DO NOT DO IT, see edit below) and copy it to the directories via FTP and then just run the database update script at wiki.mywebsite.org/mw-config/. Seems pretty easy, right?

Nope, not so fast! The problem is that this distribution does not contain a crucial directory that you need to run the installation and without which you wiki installation will fail with a 500 code without leaving anything in the error logs of the server.

This step isn’t really mentioned in the installation guide, but you actually need to remove the existing /vendor folder in your installation over FTP, build the latest version for your build with a git clone https://gerrit.wikimedia.org/r/p/mediawiki/vendor.git into a /vendor folder on your machine and then upload the files to your server.

Only after that step can you connect the /mw-config/ and finish upgrading the wiki.

So yeah, let’s hope that in a not-so-distant future MediaWiki would have the same handy ‘update now’ button as the WordPress. Because something is telling me that there are A LOT of outdated MediaWiki installs out there…

Edit:

After spending a couple additional hours dealing with additional issues: do not use the “core” build, but instead download the complete one, including all the skins, extensions and vendor files from here.

Recommendation engine lock-in

Youtube’s recommendation engine, at least in my experience, has three modes:
– Suggest the channels of which I’ve already watched the content:
– Suggest me the content I’ve already watched to watch again
– Suggest me the new updates on the playing lists of which I’ve already watched several videos

Unfortunately, while it works very well when I’ve just discovered a new couple of channels and have their content chosen and pushed to me, it fails to deliver the experience of discovery – it’s overfitting my late preferences, locking me in into the videos similar to what I have watched instead of suggesting me new content and new types of content I might be interested in. And seen that I also experience the same problem with the Quora’s recommendation engine (a couple of upvotes and all of my feed is almost exclusively army weapon tech).

I feel like the recommendation engine creators should abandon their blind faith into general algorithms and try to figure out how to create feeds that are interesting and engaging with respect to several categories of interest of their user, as well covering several reasons I might be seeking for a recommendation to what to watch (what is everyone else is watching – have something to discuss with my friends; discover something new; follow up on topics I am already interested in, …)

Synergy from the boot on Ubuntu

This one seemed to be quite trivial per official blog, but the whole pipeline gets a bit more complicated once the SSL enters into the game. Here is how I made it work with synergy and Ubuntu 14.04

  • Configure the server and the client with the GUI application
  • Make sure SSL server certificate fingerprint was stored in the ~/.synergy/SSL/Fingerprints/TrustedServers.txt
  • Run sudo -su myself /usr/bin/synergyc -f --enable-crypto my.server.ip.address
  • After that check everything was working with sudo /usr/bin/synergyc -d DEBUG2 -f --enable-crypto my.server.ip.address
  • Finally add the greeter-setup-script=sudo /usr/bin/synergyc --enable-crypto my.server.ip.address line into the /etc/lightdm/lightdm.conf file under the [SeatDefaults] section

Why you shouldn’t do it?

Despite the convenience, there seemed to be a bit or an interference for the keyboard command and command interpretation on my side, so since my two computers side by side and since I have an usb button switch from before I got synergy, I’ve decided to manually start synergy every time I log in.

Linux server security

DISCLAIMER: I AM NOT AN INFOSEC EXPERT. THIS ARTICLE IS MORE OF A MEMO FOR MYSELF. IF YOU LOOSE DATA OR HAVE A BREACH, I BEAR NO RESPONSIBILITY IN IT.

Now, because of all the occasions at which I had to act as a makeshift sysadmin, I did end up reading a number of policies and pick up some advice I wanted to group in a single place, if but for my own memory.

Installation:

  • Use SE Linux distro
  • Use an intrusion prevention tool, such as Fail2Ban
  • Configure primary and secondary DHS
  • Switch away from the password-protected SSH to a key-based SSH log-in. Diable root login all together (/etc/ssh/sshd_config, PermitRootLogin no). Here is an Ubuntu/OpenSSH guide.
  • Remove network super-service packages
  • Disable Telnet and FTP (SFTP should be used)
  • use chroot where available, notably for webservers and FTP servers
  • encrypt the filesystem
  • disable remote root login
  • disable sudo su – all the root actions need to be done with a sudo

Audit:

  • Once the server has been build, run Lynsis. It will audit your system and suggest additional steps to protect your machine
  • Force multi-factor authentification for the roots, especially via SSH. Here is a tutorial from Digital Ocean.

Watching the logs:

If you have more than one logging system to watch:

Configuring PyCharm for remote development

I do most of my programming from my windows laptop and/or desktop computer. However, in order to be able to develop anything sane, I need to operate fully in Linux. I used to have to dualboot or even to have two machines, but now that I have access to a stable server I can safely ssh into, I would rather just use my IDE to develop directly on it. Lucky enough for me, PyCharm has an option for it.

A how-to guide to do this is pretty straightforward, well-explained on the PyCharm blog and docs explaining how to configure a remote server that is not a Vagrant box.

There are three steps in the configuration:

  • setting up the deployment server and auto-update
  • setting up the remote interpreter
  • setting up the run configuration

Setting up the deployment server:

Tools | Deployment | Configuration > configure your sftp server, go ahead and perform the root autodetection (usually the /home/uname) and uncheck the “available only for this project. You will need that last option in order to configure the remote interpreter. Go ahead, go into the mapping, perform the equivalence mappings for the project, but be aware the home from the previous screen, if filled, would be prepended to any path you try to map to on the remote server. So if you want your project to go to /home/uname/PycharmProjects/my_project and your root is /home/uname/, the path you are mapping to needs to be /PycharmProjects/my_projet.

Now, head to the Tools | Deployment click the automatic upload, so that every edit you do on your machine is constantly uploaded to the remote server.

Setting up the remote interpreter:

Head to the File | Settings | Project | Interpreter, click on the cogwheel and click on add remote. At that point by default PyCharm will fill in the properties for the “deployment configuration”. In my case I needed to tweak a bit the python interpreter path, since I use Anaconda Python (scientific computing). If like me you use Anaconda2 and store it in your home directory, you will need to replace the interpreter path by /home/uname/anaconda/bin/python. At that point, just click save and you are good for this part.

Setting up the run configuration:

With the previous two steps finished, when you go into Run | Edit configuration, add the main running script to the Script field, check that the python interpreter is configured to be the remote one and then click on the three small dots next to “path mappings” field and fill it out, at least with the location of the script on your machine mapped to it’s location on the remote.

That’s it, you are good to go!

Health Data interpretation

I used to like to use Tactio Health App back in the day, before the introduction of the Apple Health Kit.

However, after getting a more modern iPhone and installing it onto it, I realized that despite the fact that Tactio Health was reading tons of data from the Health app, it was only writing weight to it. So all of my details related to blood pressure measurements, blood analyses, et Co were locked-in inside the app and it had no intention to share it.

Scanning the App store for apps that would cover that angle actually lead me to a realization – there are tons of copycat apps with slightly different flavors covering four major directions: workout tracking/guidance, weight loss/gain, periods tracking, and baby-related apps.

All in all, there are no lifestyle tracking apps to keep an eye on your habits and warn when you are getting into a lifestyle that would lead to dire health consequences. And there is even less collaboration between apps that try to do it – and Tactio Health is a case in point.

More interestingly, it looks like there are no market right now for that kind of apps – either the users are already bent on keeping their health intact and don’t need any reminders, or they are so hopelessly behind that the “you are too bad” tone of the current apps is way too discouraging.

At the same time, I can understand the reticence of the users to put their health data out there, in the wild, while knowing that potentially this data can be used to deny them coverage in the future or drive their premiums up.

Food/activity tracking apps

I am back to trying to get an insight on quantifying my life and am running into the same problem that I used to always experience with the activity/food trackers in the past. They are simply not made to encourage people to change and maintain changes. Just a couple of problems to start with:

  • The activity tracking suggests at least ~150 minutes cardio per week. If a new user is just starting and switching from a sedentary lifestyle and are trying to go into an active one, this will be deadly to them – the most they can carry out is 60 minutes of cardio at maximum for the first month and a half. Trying to get to 150 minutes is a guaranteed recipe for failure to adhere more than for the first week or so, either because of the lack of will or lack of because they will hurt themselves by trying to ramp up too fast. A better way of doing it would be to take ~2 weeks of monitoring upon each uninterrupted session, then suggest a ramp-up that would gradually improve the habits of the user in a way that would stick in the long run.
  • In my own experience, the reason a lot of people end up in a pretty bad shape is not necessary because they don’t know any better, they don’t have the time because of their work and other occupations, that constantly make self-care slide to the end of their list of priorities. A lot of activity/food tracking solutions require a lot of active input from the user and because of that, tend to have a low adherence rate, especially in the long term. A much better option would be to perform monitoring in the long run that requires almost none
  • Specifically for the food trackers – the lack of a unified repository of products and ability to fraction amount of them consumed. I was able to find for some of them teas that contain cholesterol (WTF?), but wasn’t able to see what was in unless I reviewed the labels.
  • And as per usual, the current state of the trackers is deplorable when it comes to measuring anything outside the calories. A lot of “healthy” foods are healthy not so much because they contain fewer calories, but because they contain a lot of micro-elements and vitamins that make them cover and prevent cravings in the long run.

Bonsu point: Apple health app unifying different apps. That doesn’t seem like much, but it definitely stitches all the apps together into one, making sure the information flows inside the health app ecosystem, allowing me to log in an activity once, as opposed to 3-4 times before that, and still benefit from the best of all the apps without having to deal with the worst.

Sleep monitors and internet of things

I do think that the sleep monitors should not require an active action from the user to activate them every night. Instead, it should be something that runs in the background – like GPS or pedometer in your telephone for walking distance monitoring.

Hence I see a tool that would be having two following functions:

  • movement detection for the quality of sleep computation
  • light detection, in order to figure out when you are sleeping or could be potentially sleeping

Usability of adhesion systems

Catch-22 with a pretty large health insurance website: – you need to give us the first payment to get your card – in order to perform a payment, you need to log-in. – to log-in your first need to register – to register you need you adhesion number – to get your adhesion number, you first need your card.

Best part? When I tried calling, I had to wait ~ 1 hour to get connected to the right person, a with every telephone tree branch saying to me that I needed to go to the website to do everything I needed. In addition to that, after waiting all that time, I was told I needed to wait until the invoice was generated.

Morale:

  1. Make sure you solicit user’s action only when your system is ready for it and when that action is likely to succeed.
  2. Make your user create an account that would be recognized from the go, even if it would mean that there will be nothing shown on his account.
  3. Have a collection point where the reports of your “happy system” malfunctions would go.
  4. Register failures to properly use the interface and progressively build a database of corner cases and edit your system fall-backs to account for them.
  5. Always test for usability to check that there are no catch-22 that will waste your tech support time.

Bonus points for the website – there is a paper invoice I hold in my hands, but the website shows that no invoice was generated I could pay for. Final bonus point – COMIC SANS. On the main USER-facing GUI page. Overriding other “sane” types.