Exploring pseudo-labeling for reject inference

Martins, Margarida

http://hdl.handle.net/10400.14/44863

Use this identifier to reference this record.

Name:	Description:	Size:	Format:
203590783.pdf		1.13 MB	Adobe PDF	Download

Send Feedback

Authors

Martins, Margarida

Advisor(s)

Brandão, Susana

Abstract(s)

Banks use algorithms to estimate the credit risk of loan applicants. However, we need to retrain these models. When retraining, we only know the label, meaning whether the applicant defaulted or not, for those accepted for the loan. Retraining only with the accepted will result in biased models and losses for the bank due to selection bias. To counteract this issue, we can infer the labels of those rejected. This is known as reject inference. In this thesis, we will pursue pseudo-labeling to do reject inference, which needs two models, the first to create the pseudo-labels for the rejected and the second to make the final predictions. We will create the pseudo-labels by training a lightGBM on the available data. Afterward, we will apply a logistic regression as the final model. We will compare the results against a baseline, setting all rejected to a category (default /not default). In addition, we will compare to a scenario where the rejection results from random decision-making, experiment five rejection rates, and see the effect of setting to default vs. not default. We found that doing lightGBM to infer the labels had a lower F1 score, AUC, and profit for the bank. As such, the bank should set all rejected to a category. Additionally, we found that setting all to default has a higher recall in the rejected population and higher profit. Moreover, a lower rejection rate increases profits.

Os bancos usam algoritmos para estimar o risco de crédito dos candidatos a empréstimos. No entanto, esses algoritmos necessitam de ser novamente treinados, mas para tal, é preciso possuir dados históricos com etiqueta. Neste caso, é necessário ter uma variável que indique se o candidato cumpriu na totalidade o pagamento do empréstimo. Nesta circunstância, só conhecemos a etiqueta de candidatos que foram aprovados para empréstimo. Ao treinar novamente apenas com estas observações, o modelo irá ser enviesado, resultando em perdas monetárias para o banco. De forma a impedir tais perdas, tentaremos apurar as etiquetas dos candidatos rejeitados. Nesta tese, iremos usar “pseudo-labeling” para inferir esta etiqueta. “Pseudo-labeling” funciona tendo dois modelos. Primeiro, criar-se-á “pseudo-labels” ao treinar o modelo “lightGBM”. Após, iremos aplicar regressão logística. No final, estes resultados serão comparados com o cenário de classificação de duas categorias, analisando ambas. Concomitantemente, iremos comparar com o cenário da decisão de rejeição inicial resultante do acaso e experimentar cinco taxas de rejeição sobre a regressão logística. Ao usar o “lightGBM” obteve-se um “F1”, “AUC” e lucro inferior. Como tal, o banco deverá classificar os rejeitados em uma das categorias. Sucede que se descobriu que classificar os rejeitados como incumpridores tem um ”recall” superior na população rejeitada e leva a um lucro superior. E que uma taxa de rejeição inferior tem um lucro superior.