-
-
Notifications
You must be signed in to change notification settings - Fork 16
/
08-splitcombine.qmd
252 lines (179 loc) · 7.26 KB
/
08-splitcombine.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
# Split-apply-combine
```{julia}
#| echo: false
#| output: false
import Pkg
Pkg.activate(".")
using GeoStats
import CairoMakie as Mke
```
In **Part II** of the book, we introduced **transform pipelines** to
process the `values` and the `domain` of geotables using high-level
abstractions that preserve geospatial information. In this chapter, we
start to introduce other tools to **query** geotables after they have been
pre-processed by pipelines.
## Motivation
In geospatial data science, geoscientific questions are often posed in
terms of both the `values` and the `domain` of a geotable. For example:
1. **Where** are the areas with high probability of landslide?
2. What is the average rainfall **per watershed** over the last year?
3. How much lithium will be mined from each **geological unit**?
4. What is the variation of log-permeability per **depositional facies**?
The word "where" is often present in these questions to indicate that
answers must be georeferenced. If the variable of interest is already
present in the geotable, then we can effectively answer "where" questions
using the `viewer` and the `Filter` transform from previous chapters:
```julia
geotable |> Filter(row -> row.probability > 0.9) |> viewer
```
If the variable of interest is not present in the geotable, or if the
"where" word is not present in the original question, then there will
be some reference to "geospatial units" on which geostatistics must be
computed. These questions can be answered with a **geospatial** version
of the **split-apply-combine** strategy [@Wickham2011] from data science.
Our framework provides the `@groupby`, `@transform` and `@combine` macros
to split-apply-combine geotables.
## Bonnie data set
We will use the `Bonnie` data set to illustrate our geospatial split-apply-combine:
> The Bonnie Project Example is under copyright of Transmin Metallurgical Consultants, 2019.
> It is issued under the Creative Commons Attribution-ShareAlike 4.0 International Public License.
```{julia}
using GeoIO
gt = GeoIO.load("data/bonnie.csv", coords = ("EAST", "NORTH", "RL"))
```
It represents a 3D mineral deposit with grades in parts per million (ppm),
sulfur contaminant in percent, and other categorical variables with geological
and lithological units:
```{julia}
names(gt)
```
The "EAST", "NORTH" and "RL" coordinates used to `georef` the CSV file
represent the centroids of the mining blocks. The sides of these blocks
are provided as metadata (e.g., 5x5x5):
```{julia}
gt |> viewer
```
Let's clean the geotable using what we learned in previous chapters.
We will reject the column with sulfur, will rename the variables for
greater readability, and will drop missing values. We will also add
units to some of the variables using bracket notation:
```{julia}
clean = Reject("Sper") →
Rename("Agppm" => "Ag [ppm]",
"Auppm" => "Au [ppm]",
"Asppm" => "As [ppm]",
"Cuppm" => "Cu [ppm]",
"ISBD" => "ρ [Mg/m^3]",
"CODE" => "geo",
"OX" => "litho") →
DropMissing() →
Unitify()
gt = gt |> clean
```
::: {.callout-note}
The `Unitify` transform recognizes unit bracket notation in column names.
It adds the units specified within brackets to the values of the columns,
and removes the brackets from the column names.
:::
That is a lot better! Let's assume that we want to answer the following
business question:
*What is the total mass of gold that will be mined from each geological unit?*
## Splitting geotables
We can split geotables into lazy **geospatial partitions** based on values stored
in a column. In this case, we want to split the mineral deposit in geological units
stored in the `geo` column:
```{julia}
groups = @groupby(gt, "geo")
```
There are two geological units in this deposit, represented as `SubGeoTable`. We
can access these units by indexing into the geospatial partition:
```{julia}
groups[1]
```
```{julia}
viz(groups[1].geometry, color = "teal")
viz!(groups[2].geometry, color = "slategray3")
Mke.current_figure()
```
## Applying expressions
The mass of gold in a mining block is a function of the gold grade (`Au`), the rock
density (`ρ`), and the `volume` of the block:
$$
m = Au \times \rho \times V
$$
Let's use an auxiliary function to convert the `Point`s in the `geometry` column into
`Box`es of sides 5x5x5. The function takes a centroid point as input and produces a
box centered at the point with corners that are 2.5x2.5x2.5 units away in both
directions:
```{julia}
box(point) = Box(point - Vec(2.5, 2.5, 2.5), point + Vec(2.5, 2.5, 2.5))
```
```{julia}
box.(gt.geometry)
```
The `@transform` macro modifies or creates new columns in the geotable based on
expressions with existing column names. In this case, we want to replace the
`geometry` column by calling the auxiliary function above:
```{julia}
gt = @transform(gt, :geometry = box(:geometry))
```
The macro will broadcast the expression to all rows of the geotable.
::: {.callout-note}
We used the `:geometry` symbol to refer to the `geometry` column instead of the usual
`"geometry"` string. The `@transform` macro understands that strings can also appear
as valid values in the right-hand-side of the expression, which are not columns in the
geotable. To mark a string as a column name, we need to use curly braces:
```julia
@transform(gt, {"geometry"} = box({"geometry"}))
```
:::
::: {.callout-note}
## Tip for all users
The use of symbols to represent column names is preferred in macros.
:::
The mass of gold on each mining block can be computed now that the geometries have `volume`:
```{julia}
@transform(gt, :m = :Au * :ρ * volume(:geometry))
```
::: {.callout-note}
The `@transform` macro can be used with both geotables and geospatial partitions.
:::
## Combining results
We can use the `@combine` macro to reduce columns of geotables in a geospatial partition
obtained with the `@groupby` macro. The macro is similar to the `@transform` macro,
but expects valid reduction functions such as `sum`, `mean`, `std`. Reduction functions
take a vector of values as input and produce a single scalar as output:
```{julia}
groups = @groupby(gt, :geo)
@combine(groups, :μ = mean(:Au), :σ = std(:Au))
```
Note that the macro reduces the geometry column and produces a new complex `Multi`
geometry with all the `Box`es that are inside each geological unit. This is a very
advanced feature of the framework that
**cannot be represented with the simple features standard**.
It is also possible to use a custom reduction function for the `geometry` column:
```{julia}
groups = @groupby(gt, :geo)
@combine(groups, :μ = mean(:Au), :σ = std(:Au), :geometry = first(:geometry))
```
## Answering questions
Let's recall our original business question:
*What is the total mass of gold that will be mined from each geological unit?*
We can now answer this question with three lines of code:
```{julia}
groups = @groupby(gt, :geo)
mass = @transform(groups, :m = :Au * :ρ * volume(:geometry))
answer = @combine(mass, :m = sum(:m))
```
::: {.callout-note}
## Tip for all users
We can simplify the code further using the `@chain` macro. It forwards the
resulting geotable or geospatial partition to the next macro:
```{julia}
@chain gt begin
@groupby(:geo)
@transform(:m = :Au * :ρ * volume(:geometry))
@combine(:m = sum(:m))
end
```
:::