The article Why Machine Learning Doesn’t Work Well for Some Problems?(Shahab , 2017) describes the effect of Emergence as a barrier for predictive inference.
Emergence is the phenomenon of completely new behavior arising (emerging) from interactions of elementary entities, such as life emerging from biochemistry and collective intelligence emerging from social animals.
In general, effects of emergence cannot be inferred through a priori analysis of a system (or its elementary entities). While weak emergence can be understood still by observing or simulating the system, emergent qualities from strong emergence cannot be simulated with current systems.
Sheikh-Bahei suggests interpreting emergence (in a predictive context) as an additional dimension, called the E-Dimension, where moving along that dimension results in new qualities emerging. Crossing E-Dimensions during inferrence leads to reduced predictive power as emergent qualities cannot be necessarily described as a function of the observed features alone. The more E-Dimensions are crossed during inferrence, the lower the prediction success will be, regardless of the amount of feature noise. Current-generation algorithms do not handle this kind of problem well and further research is required in this area.
Hypothetical example of the E-Dimension concept: Emergence phenomena can be considered as a barrier for making predictive inferences. The further away the target is from features along this dimension, the less information the features provide about the target. The figure shows an example of predicting organism level properties (target) using molecular and physicochemical properties (feature space). (Shahab , 2017)
Effects of emergence on example machine learning problems (Shahab , 2017):
Sadly, running jupyter notebook from within a conda environment does not imply your notebook also runs in the same environment. Thankfully, there’s an easy fix for that, namely nb_conda, and you’ll get it using
conda install nb_conda
in the environment of your choice. After that, start up your notebook and select the Kernel you want either when creating a new notebook or from the notebook’s Kernel menu:
There we go.
August 23rd, 2017 GMT +2 von
Markus
2017-08-23T21:33:34+02:002017-08-23T21:34:16+02:00
· 1 Kommentar
The Baum-Welch algorithm determines the (locally) optimal parameters for a Hidden Markov Model by essentially using three equations.
One for the initial probabilities:
\begin{align}
\pi_i &= \frac{E\left(\text{Number of times a sequence started with state}\, s_i\right)}{E\left(\text{Number of times a sequence started with any state}\right)}
\end{align}
Another for the transition probabilities:
\begin{align}
a_{ij} &= \frac{E\left(\text{Number of times the state changed from}\, s_i \, \text{to}\,s_j\right)}{E\left(\text{Number of times the state changed from}\, s_i \, \text{to any state}\right)}
\end{align}
And the last one for the emission probabilities:
\begin{align}
b_{ik} &= \frac{E\left(\text{Number of times the state was}\, s_i \, \text{and the observation was}\,v_k\right)}{E\left(\text{Number of times the state was}\, s_i\right)}
\end{align}
If one had a fully labeled training corpus representing all possible outcomes, this would be exactly the optimal solution: Count each occurrence, normalize and you’re good. If, however, no such labeled training corpus is available — i.e. only observations are given, no according state sequences — the expected values \(E(c)\) of these counts would have to be estimated. This can be done (and is done) using the forward and backward probabilities \(\alpha_t(i)\) and \(\beta_t(i)\) , as described below. Weiterlesen »
September 1st, 2014 GMT +2 von
Markus
2014-09-1T03:32:04+02:002018-03-4T15:11:15+02:00
· 0 Kommentare
While planning an eleven-day trekking trip through the Hardangervidda in Norway, I came across the age old problem of estimating the walking time for a given path on the map. While one is easily able to determine the times for the main west-east and north-south routes from a travel guide, there sadly is no information about those self-made problems (i.e. custom routes). Obviously, a simple and correct solution needs to be found.
Of course, there is no such thing. When searching for hiking time rules, two candidates pop up regularly: Naismith’s rule (including Tranter’s corrections), as well as Tobler’s hiking function.
William W. Naismith’s rule — and I couldn’t find a single scientific source — is more a rule of thumb than it is exact. It states:
For every 5 kilometres, allow one hour. For every 600 metres of ascend, add another hour.
which reads as
\begin{align}
\theta &= \tan^{-1}(\frac{\Delta a}{\Delta s}) \\
t &= \Delta s \left( \frac{1\mathrm{h}}{5\mathrm{km}} \right) + \Delta a \left( \frac{1 \mathrm{h}}{0.6 \mathrm{km}} \right) \\
|\vec{w}| &= \frac{\Delta s}{t}
\end{align}
where \(|\vec{w}|\) is the walking speed, \(\Delta s\) the length on the horizontal plane (i.e. “forward”), \(\Delta a\) the ascend (i.e. the difference in height) and \(\theta\) the slope.
function [w, t, slope] = naismith(length, ascend)
slope = ascend/length;
t = length*(1/5) + ascend*(1/0.6);
w = length./t;
end
That looks like
Interestingly, this implies that if you climb a 3 km mountain straight up, it will take you 5 hours. By recognising that \(5 \textrm{km} / 0.6 \textrm{km} \approx 8.3 \approx 8\) , the 8 to 1 rule can be employed, which allows the transformation of any (Naismith-ish) track to a flat track by calculating
\begin{align}
\Delta s_{flat} &= \Delta s + \frac{5 \mathrm{km}}{0.6 \mathrm{km}} \cdot \Delta a\\
&\approx \Delta s + 8 \cdot \Delta a
\end{align}
So a track of \(20 \textrm{km}\) in length with \(1 \textrm{km}\) of ascend would make for \(\mathrm{km} + 8 \cdot 1 \mathrm{km} = 28 \mathrm{km}\) of total track length. Assuming an average walking speed of \(5 \mathrm{km/h}\) , that route will take \(28 \mathrm{km} / 5 \mathrm{km/h} = 5.6 \mathrm{h}\) , or 5 hours and 36 minutes. Although quite inaccurate, somebody found this rule to be accurate enough when comparing it against times of men running down hills in Norway. Don’t quote me on that.
Robert Aitken assumed that 5 km/h might be too much and settled for 4 km/h on all off-track surfaces. Unfortunately the Naismith rule still didn’t state anything about descent or slopes in general, so Eric Langmuir added some refinements:
When walking off-track, allow one hour for every 4 kilometres (instead of 5 km). When on a small decline of 5 to 12°, subtract 10 minutes per 300 metres (1000 feet). For any steeper decline (i.e. over 12°), add 10 minutes per 300 metres of descent.
Now that’s the stuff wonderfully non-differentiable functions are made of:
It should be clear that 12 km/h is an highly unlikely speed, even on roads.
function [w, t, slope] = naismith_al(length, ascend, base_speed)
if ~exist('base_speed', 'var')
base_speed = 4; % km/h
end
slope = ascend/length;
t = length*(1/base_speed);
if slope >= 0
t = t + ascend*(1/0.6);
elseif atand(slope) <= -5 && atand(slope) >= -12
t = t - abs(ascend)*((10/60)/0.3);
elseif atand(slope) < -12
t = t + abs(ascend)*((10/60)/0.3);
end
w = length./t;
end
So Waldo Tobler came along and developed his “hiking function”, an equation that assumes a top speed of 6 km/h with an interesting feature: It — though still indifferentiable — adapts gracefully to the slope of the ground. That function can be found in his 1993 report “Three presentations on geographical analysis and modeling: Non-isotropic geographic modeling speculations on the geometry of geography global spatial analysis” and looks like the following:
It boils down to the following equation of the walking speed \(|\vec{w}|\) “on footpaths in hilly terrain” (with \(s=1\) ) and “off-path travel” (with \(s=0.6\) ):
where \(\tan(\theta)\) is the tangent of the slope (i.e. vertical distance over horizontal distance). By taking into account the exact slope of the terrain, this function is superior to Naismith’s rule and a much better alternative to the Langmuir bugfix, especially when used on GIS data.
function [w] = tobler(slope, scaling)
w = scaling*6*exp(-3.5 * abs(slope+0.05));
end
It however lacks the one thing that makes the Naismith rule stand out: Tranter’s corrections for fatigue and fitness. (Yes, I know it gets weird.) Sadly these corrections seem to only exists in the form of a mystical table that looks, basically, like that:
Fitness in minutes
Time in hours according to Naismith’s rule
2
3
4
5
6
7
8
9
10
12
14
16
18
20
22
24
15 (very fit)
1
1½
2
2¾
3½
4½
5½
6¾
7¾
10
12½
14½
17
19½
22
24
20
1¼
2¼
3¼
4½
5½
6½
7¾
8¾
10
12½
15
17½
20
23
25
1½
3
4¼
5½
7
8½
10
11½
13¼
15
17½
30
2
3½
5
6¾
8½
10½
12½
14½
40
2¾
4¼
5¾
7½
9½
11½
50 (unfit)
3¼
4¾
6½
8½
where the minutes are a rather obscure measure of how fast somebody is able to hike up 300 metres over a distance of 800 metres ($20^\circ$). With that table the rule is: If you get into nastier terrain, drop one fitness level. If you suck at walking, drop a fitness level. If you use a 20 kg backpack, drop one level. Sadly, there’s no equation to be found, so I had to make up one myself.
By looking at the table and the mesh plot it seems that each time axis for a given fitness is logarithmic.
I did a log-log plot and it turns out that the series not only appear to be logarithmic in time, but also in fitness. By deriving the (log-log-)linear regression for each series, the following equations can be found:
These early approximations appear to be quite good, as can be seen in the following linear plot. The last three lines \(t_{30}\) , \(t_{40}\) and \(t_{50}\) however begin to drift away. That’s expected for the last two ones due to the small number of samples, but the \(t_{30}\) line was irritating.
My first assumption was that the \(t_{40}\) and \(t_{50}\) lines simply are outliers and that the real coefficient for the time variable is the (outlier corrected) mean of \(1.2215 \pm 0.11207\) . This would imply, that the intersect coefficient is the variable for fitness.
Unfortunately, this only seems to make things better in the log-log plot, but makes them a little bit worse in the linear world.
Equi-distant intersect coefficients also did not do the trick. Well, well. In the end, I decided to give the brute force method a chance and defined several fitting functions for the use with genetic algorithm and pattern search solvers, including exponential, third-order and sigmoidal forms. The best version I could come up with was
This function results in a least squared error of about 21.35 hours over all data points. The following shows the original surface from the table and the synthetic surface from the function.
A maximum deviation of about 1 hour can be seen clearly in the following error plot for the $t_{30}$ line, which really seems to be an outlier.
For comparison (here’s the original table), this is the synthetic correction table:
Fitness in minutes
Time in hours according to Naismith’s rule
2
3
4
5
6
7
8
9
10
12
14
16
18
20
22
24
15 (very fit)
1¼
2
2¾
3½
4½
5¼
6¼
7¼
8¼
10¼
12¼
14½
16½
18¾
21¼
23½
20
1½
2½
3½
4½
5½
6¾
7¾
9
10¼
12¾
15½
18¼
21
23¾
25
1¾
3
4
5¼
6¾
8
9½
10¾
12¼
15½
18½
30
2
3¼
4¾
6¼
7¾
9¼
11
12½
40
2½
4¼
6
7¾
9¾
11¾
50 (unfit)
3
5
7¼
9½
Juni 14th, 2014 GMT +2 von
Markus
2014-06-14T08:19:40+02:002018-03-4T14:16:13+02:00
· 0 Kommentare
So, suppose you’re in university, it’s that time of the year again (i.e. exams) and you already have written some of them. Some are still left though and you wonder: How hard can you fail — or: how good do you have to be — in the following exams given that you do not want your mean grade to be worse than a given value.
Linear programming
Say you’re in Germany and the possible grades are [1, 1.3, 1.7, 2, .. 4] (i.e. a closed interval) with 1 being the best grade and 4 being only a minor step to a major fuckup. Given that you’ve already written four exams with the grades 1, 1, 1.3 and 1 and that you do not want to have a mean grade worse than 1.49 in the end (because 1.5 would be rounded to 2 on your diploma), but there still are 9 more exams to write, the question is: Which worst-case grades can you have in the following exams and what would that imply for the others?
This is what’s known to be a linear programming or linear optimization problem, and since the values (here: the number of grades) are constrained to be discrete values, it’s integer programming.
The goal of linear programming is to find the arguments \(x\) of the objective function \(f(x)\) such that \(f(x)\) is maximized, given some constraints on \(x\) . In Matlab, all linear programming functions try to minimize the cost function, so the problem is formulated as
Obviously, maximizing an objective function is the same as minimizing it’s negative, so \(\mathrm{max} \, f(\vec{x}) = -\mathrm{min} \left(\, -f(\vec{x}) \right)\) . In Matlab, these kind of problems can be solved with the linprog function.