Credit risk in marketplace lending: Lending Club

Credit risk in marketplace lending: Lending Club


Lending Club (LC) was among the world’s largest marketplace lenders, as of September 30, 2020 the company issued 60,188,236,052 USD in loans1. It enabled borrowers to obtain unsecured fixed-term loans with interest rates that they found attractive, and investors to fund loans with credit characteristics, interest rates and other terms the investors found attractive. The platform charged borrowers an origination fee and investors a service fee.


LC verified the identity of borrowers, obtained their credit profiles from consumer reporting agencies, such as TransUnion, Experian or Equifax, and screened borrowers for eligibility to participate in the platform. The screening was based on the prospective borrower’s FICO score2, a debt-to-income ratio, a credit profile (as reported by a consumer reporting agency). Lending Club provided their own risk grades, and set the interest rates based on the risk level. In addition, LC made available anonymized detailed information on loans granted and their subsequent credit performance, that sophisticated investors could use to build their own screening models3.


The objective of the assignment is to apply the knowledge of Credit Risk Management obtained from the course to the real-life data provided by LC. The data can be obtained from any public source that currently offers the anonymised LC data for analysis  I will send you the specific data by email, no need to look for additional information







1 In 2020 Lending Club acquired Radius bank and announced that it would be closing the marketplace platform.

2 A credit score developed by Fair Isaac Corporation (FICO), the most popular summary measure of creditworthiness in the USA. It ranges between 300 and 850. Higher scores correspond to a lower risk of defaulting on credit obligations.

3 Lending Club Prospectus.



The assignment consists in producing an essay/report no longer than 3000 words (excluding tables and graphs, bibliography and appendices). The report should include the following:


  • The development of a standard credit scoring model (50%), with the following elements:
  • Description of a dataset, the source it has been obtained from, the number of variables and their type; if a subset of the data is used, the rationale for selecting certain samples, variables,
  • An overview of the model building process with all steps clearly stated and
  • An evaluation of the model’s predictive


  • The detailed investigation of a research/ modelling question of your choice (50%). The question can relate to any aspect of credit risk and can be either within the scope of material covered by the course or outside it. Any additional data may be used at this stage (although there is no requirement to do so) as a complement to LC data, that should remain the main focus of the Examples may include (but not limited to): comparison of different       ways     to      transform/select        predictors,      comparison  of          different modelling/classification algorithms (beyond just predictive accuracy), comparison of reject inference techniques.


You can use any software of your choice. (It is recommended to use SAS. Lectures 3, 5, 9 and 10 in the courseware I sent you are computer experiments. Some SAS codes may be used. No matter what software you use, you need to tell me what software you are using and send me the code)


About the data

Data channel 1: Download through the following website, but the database does not seem to have rejceted data, all of which have been reviewed and approved (I didn’t read it carefully, and there may be) /index.php?data=loans_full_schema


Data channel 2: I haven’t sent you this database. This database is larger and has more variables. It contains all the accepted and rejected loan information from 2012 to 2013 (the file is very large, and I will send it to you if you need to use it)






Two questions need to be completed separately

The data is sent to you via email. The specific meaning of each variable is also in the email. The two compressed packages of the email are data csv and sas in different formats, just choose one.

Please check another word file for the specific steps of the first question, it is very important

The following text is the part that needs extra attention


first question

The purpose of the first question is to use the data modeling given by the teacher to write down the pre-modeling preparations, the modeling process, the results of the modeling, and the quality of the model, instead of extracting large sections of text directly from the Internet

An overview of the model building process with all steps clearly stated and described.  In the question.  The steps are illustrated in the pictureSee the last page of lecture 2 ppt1

reject inference: This step can be done or not done, no matter how you choose, you need to explain the reason



Characteristic analysis is required first, the purpose is to convert these characteristics into the form suitable for logistic regression and also reduce the number of variables and then proceed


  1. Some variables such as grade, sub_grade, interest rate, etc. are calculated by the lending club based on annual income, verified income, debt to income, early_credit_line, inquiries_last_12m, total_credit_lines, etc. before lending club loans, so these two groups There is likely to be a correlation between variables. If these variables are used in logistics regression, multicollinearity will appear. For the subsequent predictive accuracy will not be affected, because the coefficients are still unbiased, but they are not the most efficient in terms of having smallest standard errors (variance)
  2. Some variables such as balance (current balance on the loan paid), paid total, paid principle, etc., and the aforementioned annual income, verified income, debt to income, earliest_credit_line, inquiries_last_12m, total_credit_lines, etc., should be irrelevant to them. Related to variables related to loan status. Therefore, if your regression includes variables such as balance (current balance on the loan paid), paid total, and paid principle, you will get very high predictive accuracy, but this model may not be very practical in reality, because You will not know what the value of balance and paid total will be after the time when the loan is reviewed. All in all, careful consideration is needed when determining which variables need to be included in the model.
  3. Regarding the selection of variables, you can use the Information Value (WOE) method, which can be used to rank order your characteristics and select the most predictive ones. You can also use stepwise selection (forward or backward) on the basis of logistic regression and then according to AUC Sort each variable
  4. Need to use VIF when measuring multicollinearity
  5. According to the teacher’s suggestion, all variables can be used at the beginning, or only some variables that are obviously not applicable need to be eliminated, and then the variables can be filtered according to the above methods. In the final model, the number of variables should usually be around 10-20 (not mandatory)
  6. It is best not to directly remove variables with a particularly large missing value, you need to think of other methods



Second question

No matter what the final research question is, don’t deviate from the set of data I emailed to you

You can add a background introduction to the front, why did you choose this research question?

If you choose to do a comparison of different classification models, then you can’t just calculate the predictive accuracy of several models and compare them, you must also explain the advantages or disadvantages of this method, the potential problems of using this method, etc.