Research and Stuff

• academia • cities • machine-learning • open-data • fairness • evolution • philosophy

Aug 2, 2021
Notes on text mining human evolution
This is a follow up on the post on trends in human evolution research, outlining how the analysis was done. This is not an elegant solution, I apologize. I wanted a quick prototype, thus I used a zoo of tools. I’m sure most of the steps could be done more elegantly and efficiently, but my solution worked for the purpose (of exploration over a weekend).
Jul 30, 2021
Trends in human evolution research
Tegan Foister asked me one day if we could analyze human evolution via text mining. As a test case I did topic modeling and sentiment analysis on research articles published in the Journal of Human Evolution over the last 50 years. The analysis covered 4014 articles.
Jun 4, 2020
Scaling of tooth wear rates
We get rejections all the time. I’ve had perhaps hundreds of them over the years, jobs, funding applications, research papers. But this one somehow got me off guard. This was a small scale study prepared over Christmas break and sent to a local journal. And it got unlistened to, hopelessly.
May 13, 2020
Academic fiction
Long days of quarantine. Here is a piece of fiction on research and data.
Apr 30, 2020
Can data speak for themselves?
I wrote an opinion piece about objectivity in evolutionary sciences, palaeontology and data science in general. It appears as a blog post in Nature Ecology and Evolution Community .
Apr 29, 2020
The nature of time information in the fossil record
I wrote a beginners guide to time information in the fossil record. I used my tablecloths as illustrations. It begins like this:
Sep 29, 2019
There will be no single AI culture
Artificial intelligence is a part of our culture. Technology is our culture, as is our ability to externalise thoughts and experiences in myths, in laws, in atlases or poetry. Be it for mundane tasks from managing our calendars to our street navigation, shopping lists or information retrieval, AI already externalises our brain, just like learning to control fire externalised our digestion [1].
Sep 14, 2019
Happy birthday Alexander von Humboldt
I wrote a blog post in Nature Ecology and Evolution on the occasion of Alexander von Humboldt’s 250th birthday – Greetings from now to then.
Dec 11, 2018
My address to PhD students
Recently I was invited to speak at a PhD student dinner. When the time came to speak, I improvised, but here is the script that I wrote for myself in preparation earlier.
Nov 30, 2017
Do species age?
Mikael Fortelius and I wrote a blog post in Nature Ecology and Evolution to accompany our recent paper on The Red Queen. It starts like this.
May 6, 2016
Normalizing R2 for using with cross-validation
The coefficient of determination, denoted as $R^2$, is commonly used in evaluating the performance of predictive models, particularly in life sciences. It indicates what proportion of variance in the target variable is explained by model predictions.
Nov 3, 2015
A survey on measuring indirect discrimination in machine learning
A first version of A survey on measuring indirect discrimination in machine learning is now posted to ArXiv.
Oct 31, 2015
Oversampling paradox
A recent study in criminology points out a major flaw in interpretation by previous studies, which have been suggesting that 50% of offenders released from state prisons return to prison within 3 to 5 years. The study points out that 2/3 of the prisoners actually never return. The figures are overstated due to sampling procedure. If a survey targets prisoners currently serving their sentence, it is biased towards returning prisoners, and not representative of all prisoners.
Oct 28, 2015
Measuring similarity between sets in paleontological analysis
Symmetric similarity measures
Jul 22, 2015
Omitted variable bias and discrimination
Omitted variable bias occurs when a regression model is fit leaving out an important causal variable.
Jun 1, 2015
Why cars in the next lane seem to go faster
Because more time is spent being overtaken by other vehicles than is spent in overtaking them, that’s what a Nature paper points out.
May 30, 2015
Publishing in computer science vs. natural sciences
Publishing culture in computer science is quite different from natural sciences. Here are my impressions as an author and a reviewer in both, and as an editor in computer science (data mining and machine learning). In CS I’m mostly familiar with publishing related to data analysis, in NS I’m familiar with forestry, atmospheric sciences, palaeoecology, and medicine domains. Counterexamples can always be founds, but here are general trends that I have observed.
May 28, 2015
Apartment prices in Helsinki relate to accessibility by public transport
Traditionally, apartment prices are considered to relate to the apartment characteristics and its location. We had a hypothesis that accessibility of a neighbourhood perhaps is even more important than its location. So we did a pilot study in Helsinki region to check that.
May 23, 2015
Digital redlining
Back in the days American banks used to draw maps redlining the areas in which they don’t want to lend. Hence the term redlining.
May 14, 2015
Berkson's paradox
Berkson’s paradox describes a situation when two independent events become negatively correlated after one of them occurs. Here is an illustration.
May 6, 2015
How to explain overfitting to a non-specialist?
I explain like this.
May 4, 2015
Propensity models
Propensity models are used in clinical studies and in marketing to account for differences in treatment and control groups. Often assignment to treatment is not a random procedure, but somebody decides. Propensity score is the probability for a person to be assigned to treatment. Propensity score comes from a model.
Apr 28, 2015
What is digital discrimination?
Modern forms of discrimination are subtile and difficult to spot, and, perhaps even not intentional. Indirect discrimination is such. Typically, it is a rule or a procedure that puts certain groups of people into a disadvantage. For example, a requirement to fill in a research grant application form in MS Excel puts into a disadvantage users of Linux or Mac OS. Software platform, of course, is not yet a legally recognised ground for discrimination, but it makes a good example.
Mar 8, 2015
Data-driven decision making may discriminate
In the era of big data more and more decisions are made using predictive models, built on historical data, for example, automated CV screening of job applicants, credit scoring for loans, or profiling of potential suspects by the police.
Nov 5, 2014
Seimo rinkimų balsavimo analizė
This post is about analysis of voting data from Lithuanian Parliament elections in 2012. It is in Lithuanian, since, perhaps, it is of little interest to non-Lithuanian speakers.
Oct 31, 2014
Detecting auroras in all-sky camera images
In February, while on winter holiday trip, I visited aurora researchers at The University Centre in Svalbard (UNIS). We talked about machine learning and stuff. They recently set up a colour camera at Kjell Henriksen Observatory, and are interested in detecting and recognising auroras from images in real time.
Jul 15, 2014
Braess's paradox
Braess’s paradox describes a situation in traffic planning, when adding an extra road makes things worse.
Jul 8, 2014
Heatmaps for visualizing events over time on a map
Here is an experiment to track prominent locations in a city.
Jan 3, 2014
PLS regression
Partial Least Squares (PLS) regression is popular in chemometrics, but not so well known in data streams. It is a linear regression model. Data is projected into lower dimensional space, and a regression model is produced.
Dec 13, 2013
Online adaptive estimation of mean and variance
Suppose we have a random variable $x$. Observations arrive in a stream, $x_t$ indicates the observation at time $t$. If we have access to all the historical observations, the mean is $\bar{x}_t = \frac{1}{t}\sum_{i=1}^t x_i$.
Dec 11, 2013
Distance between two geographical coordinates
Suppose we have two objects with known geographic coordinates in WGS84. Here is a simplified formula for calculating the Earth distance $D$ between these two objects in kilometres.
Nov 27, 2013
Predicting ratings of academic journals based on titles
I was wondering, if a title of an academic journal or conference somehow reflects the (perceived) quality. So I did an experiment.
Nov 23, 2013
Online adaptive regression
Earlier I wrote about online regression, which receives observations one by one and recursively learns a regression model. We get a the same model, as learning offline on all the training observations. What, if we want the model to adapt over time?
Nov 19, 2013
Online regression
Linear regression models assume that the relationship between $r$ input variables $X = (x_1,x_2,\ldots,x_r)$ and the target variable $y$ is linear in the form $y = b_1x_1 + b_2x_2 + \ldots + b_kx_k + e = XB + e,$where the vector $B = (b_1, b_2,\ldots,b_r)^T$ contains the parameters of the linear model (regression coefficients), and $e$ is a random error.
Nov 18, 2013
Adaptive learning for traffic prediction by Yandex
Yandex provides congestion maps, that include traffic jam forecasts. They are using adaptive learning for that, predictive models are updated daily. Here is some more information about the algorithmic solution (in Russian).
Nov 17, 2013
How much energy does a mobile phone consume?
We have a new project, called TrafficSense. One of the goals is to infer and predict movement patterns of people using mobile sensing for better efficiency in transportation.
Nov 16, 2013
My new research blog
I am starting a research blog. I will post work in progress, interesting findings on related work and related applications, and shortcuts on how to do stuff. For example, how to calculate how much energy is a mobile phone using. Hence the blog name - Research and Stuff.

Symmetric similarity measures