Which conversational variables can be extracted from inbound retention call transcripts, and how are these variables associated with call outcomes in the telecommunications sector, based on historical call data?

Paiva, NunoCheng, Paulo2026-05-272026-05-272026-04-202026-03-01bf870e7f-42f8-4389-af9e-10601e23868chttp://hdl.handle.net/10400.14/57850Traditional churn prediction models in the telecommunications sector rely heavily on structured CRM data, often overlooking the rich context embedded in customer conversations. This thesis investigates which conversational variables extracted from inbound retention call transcripts are most strongly associated with customer churn outcomes. Using a dataset of 8592 real-world retention calls from a major Portuguese telecom operator (NOS), this study engineered a comprehensive set of conversational features, including structural metrics, rule-based thematic flags, sentiment scores, and unsupervised topic probabilities (LDA and BERTopic). These variables were rigorously evaluated using statistical tests and predictive algorithms, including Logistic Regression and XGBoost. The results demonstrate that churn is driven primarily by specific conversational triggers, such as explicit cancellation language and competitor mentions, rather than emotional tone or overall call length. Furthermore, augmenting baseline models with a refined 10-topic LDA solution substantially improved predictive performance, raising the ROC-AUC from 0.794 to 0.905. The findings prove that unstructured call transcripts contain highly predictive, quantifiable signals of customer intent. Ultimately, integrating these interpretable conversational features into existing churn management systems provides actionable, data-driven insights to significantly enhance proactive customer retention strategies.Os modelos tradicionais de previsão de churn no setor das telecomunicações dependem bastante de dados estruturados de CRM. Assim, costumam ignorar frequentemente o contexto presente nas conversas com os clientes. Esta tese investiga que variáveis conversacionais extraídas de transcrições de chamadas de retenção estão mais fortemente associadas aos resultados de churn dos clientes. Utilizando um conjunto de dados de 8592 chamadas reais de retenção de uma das principais operadoras de telecomunicações portuguesas (NOS), este estudo extraiu um conjunto abrangente de atributos conversacionais, incluindo métricas estruturais, indicadores temáticos baseados em regras, pontuações de sentimento e probabilidades de tópicos não supervisionados (LDA e BERTopic). Estas variáveis foram avaliadas através de testes estatísticos e algoritmos preditivos, incluindo Regressão Logística e XGBoost. Os resultados demonstram que o churn é impulsionado principalmente por gatilhos conversacionais específicos tal como a linguagem explícita de cancelamento e menções à concorrência e não pelo tom emocional ou da duração total da chamada. Para além disso, a expansão dos modelos base com uma solução LDA de 10 tópicos melhorou substancialmente o desempenho preditivo, que aumentou a métrica ROC-AUC de 0.794para 0.905. As descobertas provam que as transcrições de chamadas não estruturadas contêm sinais de intenção do cliente altamente preditivos e quantificáveis. Em suma, a integração destes atributos conversacionais interpretáveis nos sistemas de gestão de churn existentes fornece informações acionáveis baseadas em dados, melhorando significativamente as estratégias proativas de retenção de clientes.engChurnTelecommunicationsNatural language processing (NLP)Topic modelingLatent dirichlet allocation (LDA)predictive modelingCall transcriptsTelecomunicaçõesProcessamento de linguagem natural (NLP)Modelação de tópicosModelação preditivaTranscriçõesWhich conversational variables can be extracted from inbound retention call transcripts, and how are these variables associated with call outcomes in the telecommunications sector, based on historical call data?master thesis204308259