From pixels to patches: Pooling strategies for earth embeddings

As geospatial foundation models shift from patch-level to pixel-level embeddings, practitioners must aggregate thousands of pixel vectors into patch representations that preserve class-discriminative signal while matching downstream label resolution. The default choice, mean pooling, discards within-patch variability and can reduce accuracy by more than 10% under spatial shift. To evaluate this effect, we introduce EuroSAT-Embed, a dataset of 81,000 embedding GeoTIFFs derived from three foundation models: AlphaEarth, OlmoEarth, and Tessera. We benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits. Results show that richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increase accuracy by up to 5% on spatial splits. We recommend Generalized Mean Pooling (GeM) as a drop-in replacement for mean pooling, as it improves accuracy without increasing embedding dimensionality. For maximum accuracy, Stats pooling, which concatenates min, max, mean, and standard deviation pooling, performs best at four times the embedding size. We further find that pooling effectiveness varies across embedding sources and that higher-dimensional embeddings benefit most from distributional statistics.