Summary

Palm Oil is a widely used product the production of which is having a serious impact on the tropical rain forests of several areas of the world due to deforestation to make way for palm oil plantations. This deforestation is driven by the large and profitable international market for palm oil. In this article, we show that it is possible to automatically identify where palm oil plantations are using satellite imagery and, in the case of Indonesia, identify where illegal palm oil plantations are. The model identifies palm oil plantations with 99% accuracy, and can calculate the value of the conflict palm oil as well as the total tonnage. We found that in the area around Central Kalimantan in Indonesian Borneo 48% of the production was illegal with a total yearly value almost $200 million. The ability to identify high risk areas of production as well as the expected production quantity could help multinational companies to source palm oil from legal and sustainable sources and aid Governmental bodies and NGO's to better enforce local and international environmental laws. Introduction The rain forests of Indonesia are home to a staggering array of wildlife including iconic species such as the Javan rhino, the Sumatran elephant, the Sumatran tiger and of course the Orangutan. The fauna of the Indonesian archipelago require the dense forest that traditionally covers the islands to provide them with food, breeding sites and places to hide from predators (usually us). Unfortunately for them, Indonesian rain forests are also a particularly productive area for growing Elaeis guineensis or the African Oil palm. Although traditionally used in west African cooking, in recent years the application of palm oil has exploded. Currently palm oil is one of the most commonly used vegetable oils in the world. Why is palm oil so widely used? Palm oil is high in saturated fats which makes it stable at high temperatures and provides it with a long shelf life, in addition it is solid yet spreadable at room temperature and can be made tasteless. These characteristics mean that palm oil is in almost every type of ready-made food and snack food, including 'Palm Base Cheese Analog'...yummy! It's qualities also mean that it can commonly be found in soaps and cosmetics, as well as being a component of Bio-diesel. Unsurprisingly this wide range of uses means that palm oils have a very large market. The USDA reports that in 2016 global palm oil production was 64 million tonnes, with an average price of c.$620 a tonne. The annual value of the global palm oil market is almost $40 billion. What does this mean for the rain forests? With such an incredibly valuable crop it is no surprise that palm oil plantations are replacing rain forest and having some quite dramatically negative consequences. The cheapest way to remove the rain forest from an area is to slash and burn, this method is so popular it caused the infamous "haze", an enormous cloud of acrid smoke covering multiple countries and millions of people (for further reading see CNN, BBC, The Telegraph). The haze is so bad that the Government of Singapore is actually attempting to sue several Indonesian companies for damages due to air pollution. What can be done about it? The causes of this situation are obviously complex, among the complicating factors are: rural poverty in a developing country, cash crops and international demand for a versatile substance. Ultimately, however, it comes down to money; controlling where the money for palm oil goes is a key part of preventing further deforestation. The Roundtable on Sustainable Palm Oil (RSPO) is an organisation that works with producers, buyers and NGOs in order to help reduce the impact of the industry on the environment and encourage sustainable practices so that the soil being farmed doesn't become degraded and force farmers deeper into forest territory. They certify palm oil growers and independently audit their operations to ensure sustainable environmental practices, with the result that international buyers can purchase from such growers with greater confidence. How does data science come into this? The world is constantly being photographed by satellites which provide a rich data set for analysis, which is sometimes freely available to the public. An example of such data are the images from LandSat 8, which NASA provide and are free to download. Companies like Descartes labs are already using machine learning and satellite imagery to improve agricultural forecasting so as to improve efficiency in commodity trading, insurance and futures in the agricultural industry. However, there is also a lot of research happening for non-commercial purposes. A group of researchers at Harvard, for example, used satellite imagery to predict poverty in developing economies. The department of plant sciences at Cambridge have used remote sensing to analyse deforestation in Chile as well as Carbon storage in West Africa. This article will use satellite images from Borneo, home of the orangutan, and maps of legal palm plantations, downloaded from the RSPO, to detect illegal "conflict palm". Method You don't have to read the method if you don't want to, some people might find it interesting, others will find it too technical. Please feel free to skip over and come back if you want to read a bit more about something. The process will be as follows 1. Get data; 2. Identify training set; 3. Make new variables; 4. Build Model (XGBoost); 5. Create new map of predicted land use; 6. Find illegal sites and calculate value. Get Data This analysis uses the Landsat 8 satellites images which are available from the easy-to-use interface at EarthExplorer, a website run by the US Geological Survey. The palm oil plantation data is available from The Sustainable Palm Oil Transparency Toolkit (SPOTT). In this analysis, we focus on the ecologically rich and palm oil dense region of Central Kalimantan in Indonesian Borneo. The area being analysed is 5111km² which is just under 1.5 times the size of Cornwall (Or just over 1.5 times Rhode island if you prefer). Identify training set This analysis uses XGBoost a type of Supervised Learning. Supervised learning means giving an algorithm examples of what you want to find (in this case, forest, plantation, water, cloud), teaching it to recognise these examples, and then giving it a bunch of new data for which examples are unknown and asking the algorithm to identify these (kind of horrible explanation read the Wikipedia page if you're interested). In our data set there are 4 types to recognise: Forest, Plantation, Water and Cloud. The algorithm needs as many examples as possible of each to build a good model. In order to give these examples sections of the satellite image were hand labeled as each of the 4 categories. Hand-labelling is time consuming but once the model is trained it can be applied on many different satellite images, extremely quickly. Highlighting the training sets is done using QGIS, a free software for working with Geospatial Data data. Make new Variables Landsat 8 takes pictures of the Earth using 11 "Spectral Bands" details of which can be found here . What this essentially means is that it takes a series of monochrome pictures at different light frequencies that can then be combined together to make colour images. However, Landsat's sensors have a wider range of wavelengths than human vision allowing the multiple images to be combined to create a much richer image than we can perceive. In addition, research has shown that combining the bands together can create new psuedo-bands that can help identify certain classes of things such as water or vegetation. The new bands that will be created are the Transformed Normalized Difference Vegetation Index (TNDVI) used to identify forest and cropland, and two different ways of separating water and land. Build the Model As mentioned earlier the supervised learning algorithm XGBoost will be used. This is a popular method for classification, which can deliver extremely accurate results. The algorithm uses a technique called boosting which first classifies the data set, sees how many examples were classified incorrectly, then re-weights the data set so that the misclassified examples are given greater importance. The algorithim then rebuilds the model and sees if it gets a higher score, eventually the results are combined together to make an accurate predictive model. In this model the process repeats 200 times. Create new map of predicted land use Taking the newly created model and crawling over the original satellite image, a map can be created that shows the classification of each pixel, this can then be combined with the known locations of palm plantations in the final step of our process. Find Conflict palm and calculate value The locations of the palm oil plantations can then be overlayed onto the map created in the previous step. All areas that are classified as palm oil but do not fall into the bounds of the known palm oil plantations will then be classified as "Conflict Palm". The value of the palm oil produced in this area can be approximated as we know the area a single pixel covers is 90m². The amount of palm oil that can be grown per year per hectare is 3.69 tonnes (according to the PalmOil Forum), and the price per tonne of palm oil is approximately$780 a tonne (2017).

This is the original image downloaded from Earth Explorer. The clouds on the image make classification of the ground more difficult.

Results

The initial satellite image had quite a few clouds; cloud cover is a common problem in satellite imagery and over a rain-forest it is unsurprising. Some high altitude cloud types were quite thin and wispy and can probably be dealt with by the algorithm using the various bands it has. Thicker lower clouds however are more difficult to deal with if not impossible. The decision was made to classify clouds as a category to try and minimise confusion. You can see how much cloud there was in the original image shown above.

The amount of cloud scattered around meant that choosing training data had to be done carefully as it was easy to accidentally include cloud in the test data being selected, ultimately there was always going to be a bit of cross-contamination with trying to minimise it being the best option.

Building and test the model

Once the training set was made the model could be run, it tested well, getting an accuracy of 99%. The TNDVI was by far the most useful variable for classification followed by another crop type separating variable and then a water land separator. A bar plot of the importance can be seen below.

With the model made the entire original image was classified and visualised. The results seem very positive, with the classified image seeming to match well what was actually there when a manual inspection was performed. However, there were some issues: the lower right quadrant of the image below has spots of plantation in the middle of a cloud out in the sea, this shows that the contamination of the training set has had an effect on the accuracy of the model.

This bar chart shows the relative importance of each of the variables. The transformed normalised difference vegetation index is by far the most useful in this model, which would be expected when we are interested in seperating two different types of vegetation.

But didn't the model get really high test scores before?

Yes it did, and this is something to be aware of when making models, if your labelled data is not 100% accurate then your test results won't be either. The model scored 99% on the test set that had some flaws in it, this means that some of the things it "got right" were actually wrong! How does this effect our belief in the model's effectiveness? For this kind of analysis, not a lot. As we are only trying to get a broad understanding of what is going on, it is like a prototype model, with the inevitable rough edges that this entails. To put the problem in perspective, if the model errors are being underestimated by a factor of 10, the model is still 90% accurate. If we wanted to make a truly accurate model we would have to be very careful about selecting our training sets to minimise the kind of bias observed here.

The classified image. Generally speaking the model has done a good job but there are clearly some errors. Notice the "Plantation in the middle of the sea, this is caused by not being careful enough when selecting the the training data.

Finding and analysing Conflict Palm

 Type Km² Percent of Total 1 Cloud 1,579 30 2 Forest 1,157 22 3 Plantation 743 14 4 Water 949 18 5 Conflict Palm 683 13 6 Total 5,111 100

Finally we overlay the maps of the legal palm oil concessions and identify all plantation outside legal zones. In this image the conflict palm has been changed to red. It also highlights the error (mentioned previously) where cloud is classified as plantation. A break down of the types of land use as classified by the model is shown in the table on the right. The table shows that Cloud cover makes up a comparatively large amount of the map at 30%. However, Palm plantations (legal and otherwise) make up the second largest fraction at 27%, with Forest and Water coming in at 22% and 18% respectively.

From this new data it is possible to calculate the percentage of total palm plantations that is conflict palm, the total volume produced and the value that conflict palm generates per year. The analysis indicates that conflict palm makes up 48% of all palm grown in the region, therefore suggesting that conflict palm is a substantial problem. This percentage equates to a quarter of a million tonnes of conflict palm being produced per year in this region. The annual value of conflict palm from this region is almost $200 million. The table below summarises the findings.  key value 1 Percent Conflict 48 2 Tonnes Production 526194 3 Tonnes Conflict 252027 4 Value of Conflict (MUSD) 197 Conclusion This model is not yet perfect. We have seen how some clouds can be misclassified as palm plantations, and because of the speed at which the palm oil industry is growing it is likely that the farm maps used could be out of date. However, despite these limited caveats, the amount of illegal palm is so large that even if 50% of the area classified as conflict palm was actually legal (compensating for those areas of potential inaccuracy mentioned) the value of illegally grown palm would still be$100 million a year and represent just under 25% of total production. Clearly, although the model has errors these are insignificant on the scale of palm production in this part of Borneo. Of course, identifying and quantifying illegal production is only one step in a complicated process of separating conflict and legal palm production. However, it isn't very hard to do and provides useful insight into which palm oil mills are operating illegally and which companies have high risk product. This project has shown it is possible to perform risk audits of palm production areas using a laptop, freely available software and a long weekend. Given the relative ease at which geographical assessment can be done, the major international buyers of palm oil should ensure that they are taking the steps necessary to stop funding the illegal deforestation of some of the most biologically important areas of the planet, directly damaging the health of millions of people and putting in jeopardy one of the jewels in the inheritance of the Indonesian people.

Further work

There are several things that could be done in this project to make it more robust and provide greater insight. TAKING CARE OF THOSE CLOUDS! Although it doesn't appear to be a major problem fixing the cloud issue would remove an easy point of attack, this could be done by putting more time into selecting the training set, using the thermal bands (not included in this analysis), and using images from different days. Separating out urban areas and other settlements: this wasn't done and would probably reduce the amount of total plantation. Distance from a mill: the locations of the processing mills are known and can probably be identified on a map by reviewing the plantations in the area it would be possible to get an estimation of how much of the production is legal. Deforestation rate: taking time lapsed images it would be possible to measure the forward march of the plantations into forest, this kind of work has been done quite a lot so it could get very good results.

Notes on doing this yourself

Analysing remote sensing data is one of those things that is easy to learn but hard to master (It's also not that easy to learn). If you are new to analysing satellite imagery check out Ali Santacruz's excellent tutorial "Image classification with Random Forests in R". Also point yourself in the direction of Benjamin Leutner's package "RStoolbox", it makes life a lot easier.