High-Resolution Soybean Yield Mapping

7 min readNov 25, 2020


In the United States, farmers planted over 90 million acres of soybean in 2017; in 2018, soybean planted acreage outpaced corn for the first time in decades [1]. Soybean is the United States’ most lucrative agricultural export, driven by increasing demand for animal feeds associated with rising meat consumption around the world [1].

With such an important agricultural commodity, understanding spatial and temporal patterns in yield is of widespread interest. This heterogeneity across space and time can be used to help identify yield gaps, inform farm management strategies, and guide sustainable intensification [2, 3]. One familiar example is using high-resolution yield maps to vary management at a subfield scale, with potential for closing yield gaps while improving input efficiency and environmental outcomes [4,5].

Where ground-truth data on yields are not available, leveraging remote sensing and machine learning to accurately estimate farm yields can help provide these insights. To this end, cloud computing and freely available, high-resolution satellite data have enabled recent progress in crop yield mapping at fine scales. However, being able to test the performance of a high-resolution crop yield mapping algorithm, with extensive data at a matching spatial resolution, remains uncommon or infeasible due to data availability. This has limited the ability to evaluate different yield estimation models and improve understanding of key features useful for yield estimation.

Furthermore, among global agricultural staples, soybean is a crop for which high-resolution yield mapping has not been fully explored. In past soybean yield mapping efforts, many different approaches have been taken, in terms of spatial resolution (ranging from a few meters to the county level), spatio-temporal extent (from single fields in a single year, to thousands of fields over many years), and predictor variables. Comparing approaches on a robust, extensive ground-truth dataset might help point to how we can improve yield estimation efforts in the future.

In a recent study published in Remote Sensing (here), we assess machine learning models’ capacity for soybean yield prediction using a unique ground-truth dataset of high-resolution (5 m) yield maps generated from combine harvester yield monitor data for over a million field-year observations across the Midwestern United States from 2008 to 2018. That’s a lot of beans!

First, we explored how well a machine learning algorithm can predict soybean yields in new regions/years, and the relative differences in performance for key metrics of crop growth. Then, building on this modelling work, we examined whether models trained at the aggregated level of freely-available data differ considerably from those trained with the more rare fine-scale data. Finally, we leveraged findings from the empirical modelling work to improve a simulations-based approach to yield mapping — essential for regions/years in which we do not have ground data.

To ascertain how well a model might perform at predicting fine-scale yields in new regions and years, we selected the best performing random forest model and tested it on unseen counties. The preferred model was able to explain 44% of yield variability on held-out test data. The preferred model implementation relied on predictors derived from harmonic regressions — additive combinations of sine and cosine waves — which have been shown to capture the growth and decline of crop development during the growing season. Figure 1 illustrates how harmonic regressions can smooth noisy satellite time series: the raw time series of a vegetation index (GCVI) is shown on the top row, and the resulting harmonic regression fit is shown in orange on the bottom row.

Figure 1.

Next, we assessed a suite of metrics which attempt to parsimoniously capture crop growth information in just a few predictor variables. For example, we compared the results of summing daily vegetation index (estimated using harmonic regressions, as above) with using just the peak vegetation index value to predict end-season yields. We found that the approaches all performed relatively similarly in our implementation (which included weather covariates), but that these parsimonious metrics tended to under-perform a full set of monthly vegetation index (or raw satellite band) observations. This may illustrate the capacity of modern machine learning algorithms to effectively ‘learn’ relevant interactions and responses even with many predictor variables.

Building off of this modelling work, we explored the differences between our models described above (trained using subfield scale yield data) and a model trained using aggregated (i.e. United States county-level) yield data. This question is important since many yield-mapping efforts in the literature are restricted to training using freely-available, highly-aggregated data, and it is unknown how they might perform at finer scales or whether they learn the same responses. For our county-scale model, we used a random forest with monthly observations of all Landsat bands (red, blue, green, near infrared, and short wave infrared) and weather covariates. We then retrained a model with these same predictor variables on our fine-scale yield data, and compared performance across scales. We observed that the decrease in performance using a county-trained model at the fine-scale is considerably larger than the decrease when using a fine-scale model at the county scale (Figure 2). Based on variable importance measures, we inferred that county-trained models learned a simpler response function which relied heavily on a small subset of predictors as compared to the fine-scale-trained model. The takeaway is that algorithms which rely on county-scale labels for training may not be able to capture the underlying response at the fine (subfield) scale.

Figure 2.

Lastly, we transferred learnings from this empirical modelling work (trained and tested with yield monitor data) to inform modifications of a simulations-based approach. This simulations-based approach, known as the Scalable Crop Yield Mapper (SCYM), involves training a machine learning model on the estimated yields of crop simulations, run many times over a suite of realistic scenarios for weather, variety, fertilizer use, etc., and applying this simulations-based model to observed satellite and weather data. We found that our modelling work at both the subfield- and county-scale indicated that August precipitation was far and away the most predictive weather variable for Midwestern soybean yields, and the simulations-based approach indeed performed best when using August precipitation as the only weather covariate. Additionally, harmonic-based metrics — in this case, the peak vegetation index and an additional observation 30 days later — also improved performance, consistent with our observations from the empirical modelling work that harmonic regressions can help capture key aspects of crop growth. Improvements to the simulations-based model are essential for expanding yield mapping to regions in which subfield-scale yield data simply does not exist; further updates to this approach for soybean will be key future questions for exploration.

All together, we leveraged an extensive, fine-scale dataset of ground-truth yields to begin answering important questions for soybean yield mapping. We found that harmonic regressions continue to show great promise for feature engineering when using spectral satellite imagery, and that random forests were able to perform better when given many ‘raw’ inputs as compared with metrics which attempt to compress signal into a single variable or two. We uncovered key differences in models trained at fine- and aggregated-scales, with implications for the generalizability of yield models across scales. Using some of these learnings, we were able to improve the methodology of a more widely-applicable, simulations-based approach. As precision agriculture tools grow more common, understanding subfield yield heterogeneity and its drivers opens opportunities for targeted management; continuing work in this field is essential for sustainable, efficient food production.


Title Image:


[1] USDA ERS — Oil Crops Sector at a Glance Available online: https://www.ers.usda.gov/topics/crops/soybeans-oil-crops/oil-crops-sector-at-a-glance/ (accessed on Mar 29, 2020).

[2] Jain, M.; Srivastava, A.K.; Balwinder-Singh; Joon, R.K.; McDonald, A.; Royal, K.; Lisaius, M.C.; Lobell, D.B. Mapping smallholder wheat yields and sowing dates using micro-satellite data. Remote Sens. 2016, 8, 1–18, doi:10.3390/rs8100860.

[3] Lobell, D.B. The use of satellite data for crop yield gap analysis. F. Crop. Res. 2013, 143, 56–64, doi:10.1016/j.fcr.2012.08.008.

[4] Basso, B.; Dumont, B.; Cammarano, D.; Pezzuolo, A.; Marinello, F.; Sartori, L. Environmental and economic benefits of variable rate nitrogen fertilization in a nitrate vulnerable zone. Sci. Total Environ. 2016, 545546, 227–235, doi:10.1016/j.scitotenv.2015.12.104.

[5] Basso, B.; Dumont, B.; Cammarano, D.; Pezzuolo, A.; Marinello, F.; Sartori, L. Environmental and economic benefits of variable rate nitrogen fertilization in a nitrate vulnerable zone. Sci. Total Environ. 2016, 545546, 227–235, doi:10.1016/j.scitotenv.2015.12.104.

About the Author:

The article is written by Walter “Teke” Dado is a graduated master’s student from the Stanford Earth Systems program, and is now working as a data scientist at Farmers Business Network.




Stanford's Center on Food Security and the Environment (FSE) leads cutting-edge research on global issues of food, hunger, poverty and the environment.