• Notes on text mining human evolution

This is a follow up on the post on trends in human evolution research, outlining how the analysis was done. This is not an elegant solution, I apologize. I wanted a quick prototype, thus I used a zoo of tools. I’m sure most of the steps could be done more elegantly and efficiently, but my solution worked for the purpose (of exploration over a weekend).

• Trends in human evolution research

Tegan Foister asked me one day if we could analyze human evolution via text mining. As a test case I did topic modeling and sentiment analysis on research articles published in the Journal of Human Evolution over the last 50 years. The analysis covered 4014 articles.

• Scaling of tooth wear rates

We get rejections all the time. I’ve had perhaps hundreds of them over the years, jobs, funding applications, research papers. But this one somehow got me off guard. This was a small scale study prepared over Christmas break and sent to a local journal. And it got unlistened to, hopelessly.

Long days of quarantine. Here is a piece of fiction on research and data.

• Can data speak for themselves?

I wrote an opinion piece about objectivity in evolutionary sciences, palaeontology and data science in general. It appears as a blog post in Nature Ecology and Evolution Community .

• The nature of time information in the fossil record

I wrote a beginners guide to time information in the fossil record. I used my tablecloths as illustrations. It begins like this:

• There will be no single AI culture

Artificial intelligence is a part of our culture. Technology is our culture, as is our ability to externalise thoughts and experiences in myths, in laws, in atlases or poetry. Be it for mundane tasks from managing our calendars to our street navigation, shopping lists or information retrieval, AI already externalises our brain, just like learning to control fire externalised our digestion [1].

• Happy birthday Alexander von Humboldt

I wrote a blog post in Nature Ecology and Evolution on the occasion of Alexander von Humboldt’s 250th birthday – Greetings from now to then.

• My address to PhD students

Recently I was invited to speak at a PhD student dinner. When the time came to speak, I improvised, but here is the script that I wrote for myself in preparation earlier.

• Do species age?

Mikael Fortelius and I wrote a blog post in Nature Ecology and Evolution to accompany our recent paper on The Red Queen. It starts like this.

• Normalizing R2 for using with cross-validation

The coefficient of determination, denoted as $R^2$, is commonly used in evaluating the performance of predictive models, particularly in life sciences. It indicates what proportion of variance in the target variable is explained by model predictions.

• A survey on measuring indirect discrimination in machine learning

A first version of A survey on measuring indirect discrimination in machine learning is now posted to ArXiv.

A recent study in criminology points out a major flaw in interpretation by previous studies, which have been suggesting that 50% of offenders released from state prisons return to prison within 3 to 5 years. The study points out that 2/3 of the prisoners actually never return. The figures are overstated due to sampling procedure. If a survey targets prisoners currently serving their sentence, it is biased towards returning prisoners, and not representative of all prisoners.

• Omitted variable bias and discrimination

Omitted variable bias occurs when a regression model is fit leaving out an important causal variable.

• Why cars in the next lane seem to go faster

Because more time is spent being overtaken by other vehicles than is spent in overtaking them, that’s what a Nature paper points out.

• Publishing in computer science vs. natural sciences

Publishing culture in computer science is quite different from natural sciences. Here are my impressions as an author and a reviewer in both, and as an editor in computer science (data mining and machine learning). In CS I’m mostly familiar with publishing related to data analysis, in NS I’m familiar with forestry, atmospheric sciences, palaeoecology, and medicine domains. Counterexamples can always be founds, but here are general trends that I have observed.

• Apartment prices in Helsinki relate to accessibility by public transport

Traditionally, apartment prices are considered to relate to the apartment characteristics and its location. We had a hypothesis that accessibility of a neighbourhood perhaps is even more important than its location. So we did a pilot study in Helsinki region to check that.

• Digital redlining

Back in the days American banks used to draw maps redlining the areas in which they don’t want to lend. Hence the term redlining.

Berkson’s paradox describes a situation when two independent events become negatively correlated after one of them occurs. Here is an illustration.

• How to explain overfitting to a non-specialist?

I explain like this.

• Propensity models

Propensity models are used in clinical studies and in marketing to account for differences in treatment and control groups. Often assignment to treatment is not a random procedure, but somebody decides. Propensity score is the probability for a person to be assigned to treatment. Propensity score comes from a model.

• What is digital discrimination?

Modern forms of discrimination are subtile and difficult to spot, and, perhaps even not intentional. Indirect discrimination is such. Typically, it is a rule or a procedure that puts certain groups of people into a disadvantage. For example, a requirement to fill in a research grant application form in MS Excel puts into a disadvantage users of Linux or Mac OS. Software platform, of course, is not yet a legally recognised ground for discrimination, but it makes a good example.

• Data-driven decision making may discriminate

In the era of big data more and more decisions are made using predictive models, built on historical data, for example, automated CV screening of job applicants, credit scoring for loans, or profiling of potential suspects by the police.

• Seimo rinkimų balsavimo analizė

This post is about analysis of voting data from Lithuanian Parliament elections in 2012. It is in Lithuanian, since, perhaps, it is of little interest to non-Lithuanian speakers.

• Detecting auroras in all-sky camera images

In February, while on winter holiday trip, I visited aurora researchers at The University Centre in Svalbard (UNIS). We talked about machine learning and stuff. They recently set up a colour camera at Kjell Henriksen Observatory, and are interested in detecting and recognising auroras from images in real time.

• Heatmaps for visualizing events over time on a map

Here is an experiment to track prominent locations in a city.

• PLS regression

Partial Least Squares (PLS) regression is popular in chemometrics, but not so well known in data streams. It is a linear regression model. Data is projected into lower dimensional space, and a regression model is produced.

• Online adaptive estimation of mean and variance

Suppose we have a random variable $x$. Observations arrive in a stream, $x_t$ indicates the observation at time $t$. If we have access to all the historical observations, the mean is $\bar{x}_t = \frac{1}{t}\sum_{i=1}^t x_i$.

• Distance between two geographical coordinates

Suppose we have two objects with known geographic coordinates in WGS84. Here is a simplified formula for calculating the Earth distance $D$ between these two objects in kilometres.

• Predicting ratings of academic journals based on titles

I was wondering, if a title of an academic journal or conference somehow reflects the (perceived) quality. So I did an experiment.

Earlier I wrote about online regression, which receives observations one by one and recursively learns a regression model. We get a the same model, as learning offline on all the training observations. What, if we want the model to adapt over time?

• Online regression

Linear regression models assume that the relationship between $r$ input variables $X = (x_1,x_2,\ldots,x_r)$ and the target variable $y$ is linear in the form $y = b_1x_1 + b_2x_2 + \ldots + b_kx_k + e = XB + e,$ where the vector $B = (b_1, b_2,\ldots,b_r)^T$ contains the parameters of the linear model (regression coefficients), and $e$ is a random error.

• Adaptive learning for traffic prediction by Yandex

Yandex provides congestion maps, that include traffic jam forecasts. They are using adaptive learning for that, predictive models are updated daily. Here is some more information about the algorithmic solution (in Russian).

• How much energy does a mobile phone consume?

We have a new project, called TrafficSense. One of the goals is to infer and predict movement patterns of people using mobile sensing for better efficiency in transportation.

• My new research blog

I am starting a research blog. I will post work in progress, interesting findings on related work and related applications, and shortcuts on how to do stuff. For example, how to calculate how much energy is a mobile phone using. Hence the blog name - Research and Stuff.