If you needed money to consolidate debt, pay for a wedding, take a vacation, fix your home, or cover unexpected bills, would you apply for an online loan? Last year millions of people answered that question with a resounding yes!
If you choose to join them, what will your loan interest rate be? Most people take for granted a poor credit score will translate directly to a higher interest rate. It's a valid assumption... isn't it?
This case study follows Jonathan Blum, a New York-based author and GIS-novice, as he attempts to answer that question using GIS. He builds a simple regression model to test the assumed, unquestioned, relationship between loan grades and interest rates. What he learns, is that measuring and mapping expected relationships can sometimes lead to unexpected, and very interesting, findings.
As you read through Jonathan's story, keep in mind this workflow can be used to evaluate the correlation between any two variables. Other examples of how to use this workflow are provided at the end of this document.
Jonathan is writing a layman's guide to information science and as part of his work, he's been investigating online marketplace lending. These Internet, typically non-bank lenders use proprietary computer algorithms and credit checking tools to connect borrowers, lenders, and investors. They can provide loan approval and funding within minutes or hours. Contrast that with loan processing by traditional banks, which can take days or even weeks. While online marketplace lending is still small compared to the trillions of dollars in loans provided by traditional banks, growth rates for this industry are exponential. This is attracting attention not only from borrowers and investors, but also from lawyers and regulators.
Thus far, online lending has thrived with minimal oversight. This is changing. The U.S. Supreme Court recently upheld a regulator's use of disparate impact in a housing-related discrimination case. Disparate impact can occur when loan decisions that are not intentionally discriminatory result in discriminatory outcomes. A policy of only funding home loans above $200,000, for example, even if applied consistently, could have the unintended impact of redlining if the average home values in a region's minority neighborhoods is less than $200,000. Avoiding disparate impact is difficult for lenders because it isn't exposed until many loans have been made.
After direct interviews with company executives and experts in online marketplace lending, Jonathan decides to see if he can teach himself enough about GIS and spatial analysis to map the online lending frontier and test assumptions about what drives loan interest rates. In the end, credit scores do matter. They are an important component in calculating loan grades and setting interest rates, but they don't tell the whole story.
Jonathan begins his exploratory analysis by seeing which areas of the country are participating in online lending, and where interest rates are highest and lowest. He then creates a very basic model to predict average interest rates from average loan grades. Jonathan's results are interesting and certainly not what he was expecting.
LendingClub provides loan data that can be easily downloaded, linked to ZIP3 areas, and analyzed. (ZIP3 areas are the geometry defined by the first three digits of a standard 5-digit ZIP Code). Jonathan downloads data for all LendingClub loan applications, funded or rejected, between August 2007 and September 2015. Summarizing the data by ZIP3 area yields the total number of loan submissions, the total number of loans issued, the average interest rate for loans issued, and the average loan grade for loans issued. He also obtains data for the total number of households within each ZIP3 area during 2014.
LendingClub assigns a loan grade to every loan application it receives, ranging from A1 (lowest interest rate) to E5 (highest interest rate). These loan grades were converted to simple numeric rankings for analysis. A1 loan grades were assigned a rank of 1, A2 loan grades were assigned a rank of 2, and so on. The higher the ranking, the riskier the loan tends to be.
Jonathan examines the number of LendingClub loan applications submitted each year. There is a clear increase in the number of submissions.
He wonders if participation in online lending is evenly distributed across the United States. Since there will be more loan applications (more of most everything, in fact) in the ZIP3 areas that have more people, a map of loan application counts would not be very helpful here. It would probably only reveal where most people in the contiguous United States live. Consequently, to get a picture of the online lending frontier (locations where online lending is concentrating), he must create a rate variable. He divides the number of loan applications in each ZIP3 area by the number of households to get per household online loan application rates. He maps these rates using hot spot analysis. The result, shown below, identifies which areas of the country are participating most heavily in online lending (red) and which areas are not (blue).
The dark red areas on both the west and the east coast have the most intense clustering of high-per-household online loan application rates, followed by southern areas near Atlanta, Montgomery, and areas north of Miami. In contrast, vast expanses of the country appear not to be participating in online lending at all. Households in Iowa, Nebraska, North Dakota, Maine, and pockets of both South Dakota and Idaho are either not interested, or unable to participate.
Having determined that the level of participation in online lending varies across the country, Jonathan wonders if the average interest rates people pay for their online loans varies as well.
To ensure the average interest rate reported for each ZIP3 area is both reliable and representative, the remaining analyses focus on ZIP3 areas where at least 30 loans have been funded.
The map below shows a hot spot analysis of average interest rates.
There is a definite geography to online loan interest rates. Red areas are locations where the highest interest rates concentrate. Similarly, the blue areas are locations with concentrations of the lowest average interest rates. Excluded from the map are locations with fewer than 30 funded loans.
Company executives, online lending experts, and the LendingClub website all confirm that interest rates are a function of loan grades. The logic is simple enough. The borrowers who are assigned A and B loan grades tend to have the healthiest credit metrics and represent the lowest lending risk. Grades C, D, and E, tend to have progressively lower credit scores. Once a loan grade is assigned to a loan application, the corresponding interest rate can simply be obtained from a table. Consequently, if interest rates are higher in Alabama, as shown in the hot spot map above, it is fair to assume it is because the loan grades assigned there reflect riskier loans. A risky borrower in San Francisco should be just as risky in Alabama, right?
Ever the skeptic, Jonathan decides to dig deeper. He learns about Ordinary Least Squares (OLS) regression and creates a basic model to predict the average interest rate in each ZIP3 area based only on average loan grade rankings.
The OLS tool Jonathan uses computes the predicted average interest rate values and then creates a residual map, shown below. The residual map indicates where the model predicted well, where it predicted too high, and where it predicted too low.
If interest rates are purely a function of loan grades, Jonathan expects his model to confirm two things:
1. The relationship between average loan grades and average interest rates is strong. The strength of the relationship is reported as an adjusted R2 value ranging from 0.0 to 1.00.
2. The relationship is consistent across the country. There may be locations where the model predicts a little better or a little worse, but overall, the spatial pattern of underpredictions and overpredictions is randomly distributed.
Results are not quite what Jonathan expected. While the strength of the relationship is strong (0.94), looking at the residual map reveals a problem.
If the prediction for a particular ZIP3 area is too high (purple), it means the actual average interest rate value for that ZIP3 area is lower than expected, given the associated average loan grade rank. Similarly, if the prediction is too low (green), it means the actual average interest rate is higher than expected, given the corresponding average loan grade rank.
Notice the state of Mississippi is purple. There is nothing spatially random about lower-than-expected interest rates for an entire state. Apparently, average loan grade rankings are not an effective predictor of average interest rates in that part of the country.
Finding lower-than-expected interest rates throughout the state of Mississippi is important. It gives the impression, at least, of either intentional bias or disparate impact. In any case, it is clear that average interest rates are not purely a function of average loan grades everywhere in the country.
Jonathan may have missed this geographic disparity completely if he hadn't mapped the overpredictions and underpredictions from his model. Mapping the results of analysis is an important step in any workflow involving spatial data.
When the relationship between two variables is strong, you can predict the value of one from the other. This is what Jonathan did with his simple OLS model above. The OLS method, however, summarizes relationship strength using a single value (a single coefficient). In other words, it assumes the relationship between average loan grades and average interest rates is the same for every ZIP3 area in the country. If Jonathan wants to examine how this relationship changes − if he wants to see where average loan grade rankings have a larger or smaller impact on average interest rates − he needs to learn about another regression technique called Geographically Weighted Regression (GWR). GWR computes a potentially unique coefficient for every single ZIP3 area. Where coefficients are large, changes in the average loan grade ranking will have a larger impact on average interest rates; where coefficients are small, changes in average loan grade rankings will have a smaller impact on average interest rates.
Jonathan creates a map of the GWR regression coefficients below.
The darkest areas reflect locations where the relationship between average loan grade rankings and average interest rates is strongest. This is where average loan grade rankings are most effective predicting average interest rates. Improvements in average loan grade rankings in these locations will have the largest impact on reducing interest rates. Conversely, a change in the average loan grade ranking in the lightest areas (where the relationship is weak) will have the smallest impact on average interest rates.
The map suggests that interest rates are not solely dependent on loan grades, at least not everywhere. In both Mississippi and much of Kansas, for example, there is a weak relationship between average loan grades and average interest rates. Interest rates are lower than expected, on average, throughout Mississippi. They are higher than expected, however, in much of Kansas.
This has tangible and material consequences. Differences in loan interest rates impact the entire economy. When access to loans is limited because of high interest rates, people tend to save, to spend less, and businesses tend to scale back. When loan interest rates are low, people are more willing to both borrow and spend, and businesses are more likely to expand.
The question asked at the beginning of this case study is: do online lenders unfairly discriminate? Researchers have found evidence of both race and gender discrimination in a variety of online marketplaces. Jonathan's exploratory analysis contributes to this important research area by uncovering evidence of geographic discrimination associated with online lending. Jonathan has only considered loan grades, however. Despite published tables indicating a direct relationship between loan grades and interest rates, the maps above suggest other factors must also be involved. For example, some researchers are finding that as many as one third of borrowers will purposely choose the loan with the fastest funding time, over the one with the lowest interest rate
Jonathan is not a professional data scientist. He is not looking to statistically prove one outcome over another. He is a journalist. His job is to report and inform emerging debates around the important story of online lending. And maps and the analyses diagrammed here are new, critical storytelling tools.
Jonathan can now sketch the geography of online lending on napkins. He can send email and post social media about his work. He can disclose a transparent, statistically viable argument that stakeholders can easily respond to.
He is finding that good maps are ideal material for regulators and managers as well. Maps create neutral storytelling ground that teams of people can collaborate on and easily understand. Regulators seem to be more willing to talk; managers are more transparent. As scrutiny increases around marketplace lending, Jonathan expects these maps will help focus the debate.
Jonathan senses competitors will also be drawn to these maps. This case study only considers data from LendingClub, but billions of dollars in other loans have been made. Perhaps competing firms will find opportunities by offering lower rates in the locations associated with higher-than-expected interest rates.
Data-driven mapping provides a powerful tool for storytelling.
This workflow focuses on the relationship between average interest rates and average loan grade rankings, testing an assumed correlation. A complete step-by-step tutorial, including data, is available if you want to do the analyses described here for yourself, or apply the workflow to your own data.
Here are some ideas to get you started. Communities with higher average incomes will likely pay higher average income taxes. But is this consistently true? Where is it less true or more consistent across the country? Agricultural areas with the best growing conditions should produce the highest yields. Is that the case everywhere? If not, why not? Wouldn't it be reasonable to assume schools with better teacher-to-student ratios have higher test scores?
Have fun!
An error has occurred |