DeveshJaiswalprojectpart3report

.docx

School

Benjamin N Cardozo High School**We aren't endorsed by this school

Course

SCIENCE AP BIOLOGY

Subject

Information Systems

Date

Dec 19, 2024

Pages

Uploaded by deveshj23

Name: Jaiswal, Devesh Date:November 29th, 2024NYU ID: N14682845 dj1380Course Section: Database Systems CSCI-GA 2433 Section 001Database Systems Project Part 3: Logical Schema Optimization and Machine LearningModel Creation Total in Points (100 points in total): Professors Comments:Affirmation of my Independent Effort: Devesh Jaiswal

EDA Physical Database Design1:Decided to use SQLite3 as this is an application that I have used before. Specifically I am using DBBrowser for SQLite. Converted each entity into tables while converting attributes into columns simultaneously. I had to match data types that SQLite supports, so char data type attributes became text data types and the rest were able to convert perfectly. We now have 29 tables along with their attributes.Established primary keys, unique keys, and relationships for each table. Then verified that tables are in 3nf form. Most of this was done in the previous part for the logical design buy I wanted to ensure these properties were maintained in the physical design as well. As mentioned from the previous part my data is hosted on microsoft azure so that is the cloud platform that I am using to deploy my database besides the RDBMS SQLite as required. Next step of the physical design was to define indexes for indexing capability as we want out database to be efficient. Columns with many unique values are typically made indexes and so this is the logic I used when defining indexes. In total 10 indices were created. These 10 were VehicleClaim which consisted of CustSSN, VehicleID, and PolicyNum, AgentID, CampaignID, CustSSN, InvoiceNumber, PolicyNumber, VehicleID that consists of CustSSN and VehicleID, and PolicyNumber that consists of CustSSN and PolicyNumber. These were chosen since they are the attributes that will most likely be used for lookup in their respective tables. Therefore in order to ensure proper efficiency they were made indexes reducing the number of disk lookups needed. Since I made many tables to represent each specific entity I don’t think partitioning or clustering is needed for my case since tables are in 3nf form which implies they are simplified already. 2:There are multiple business use cases for my insurance company design that support a workflow based application. One is customer registration where customers create their accountsand manager their information. The customers would input their details as needed to create an account, create a username and password, and verify their account. Another use case is customers browsing insurance policies. The customers would enter the type of insurance (in ourcase it must be either health or vehicle), region policy is needed, and then they can filter throughthe different policies offered based on premiums, deductibles, and expired date. A third use case is quote generation on the business side based on customer information. Customer would enter information of what/who is to be covered by the policy, select the coverage option they want, and the system evaluates these inputs and generates an appropriate quote for the customer. A fourth use case policy purchasing where customers now have chosen the policy they want and quote has been generated so now they must review the final quote, select payment methods from the different options, and system generates the policy id. Finally, a fifth use case is claim help when customers try to process a claim of a policy they own. The first stepwould be to display the claim procedure which is generated when the policy is bought, submit the claim, and get back the claim status. So an example workflow based on these use cases is customer registers for an account with a insurance company, customer inputs insurance they are looking for, company system generates quote for the insurance based on customer information, customer reviews and purchases the policy, and finally customer makes a claim

with the policy they have bought when needed. Below is a flowchart diagram as a visual representation. Arrows indicate ordering of use cases that should be followed. Use CaseGroups InvolvedConditionsWorkflowResult Customer Registration↓↓↓↓↓↓↓↓Customer, Insurance PortalAge is over 211-Customer inputs personal details2-Customer makes username and password3-Customer verifies identity4-Customer creates account Customers can now access the insurance page and are redirected to the company's main page. Browsing Insurance↓↓↓↓↓↓↓↓↓↓↓↓↓↓Customer, Insurance pageCustomer has an account1-Customer filters for the type of insurance looking for.2-Customer enters region insurance must cover.3-Customer looks through the different options which can be filtered based on premiums, deductibles, etc.Customer is brought to the insurance search page andis displayed a listof policies which can be filtered and searched through Insurance QuoteGeneration↓↓↓↓↓↓↓↓↓↓↓Customer, Company's Insurance SystemCustomer has an account and has browsed through insurance options1-Customer inputs necessarydetails for the insurance2-customer chooses the typeof insurance coverage3-insurance system calculates and display the quotebased on this The final quote of the insurance is displayed to the customer

↓↓↓informationPolicy Purchase↓↓↓↓↓↓↓↓↓↓↓↓Customer, Insurance System, and Payment PortalCustomer has an account, browsed throughinsurance, and has chosen an insurance to buy.1-Customer reviews insurance and final quote2-Customer chooses payment method3-Customer submits payment4-System produces a policy document and IDCustomers receive a copy ofthe policy they bought and an ID for the policy. Claim Filing Customer, Insurance System, Claim portal Customer has an account, browsed throughinsurance, has chosen an insurance, and has bought a policy. 1-Customer chooses policy itwants to file a claim for2- Customer inputs claim details 3-Customer submits the claim. 4-Customer receives claim statusCustomer gets back the status of the claim which indicates ifit has been accepted or not Machine Learning Model Creation 1:Refined Machine Learning Use Cases:The first machine learning use case is Fraudulent Claim Detection where the objective is to predict if an insurance claim made by a customer is fraudulent using customer behavior, claims patterns, and historical data. The second use case is Policy Claim Likelihood Prediction where we want to predict the likelihood that a policy will be claimed based on customer information andclaim historical data. The third use case is Customer Risk Assessment where we evaluate if a customer is at high risk for a policy to prevent risk taken on by insurance company based on customer information and past customer data. Finally the last use case is Insurance Product Recommendation where we try to predict if a customer is more likely to purchase health insurance, vehicle insurance, or both based on past customer transactions and customer information. These are 4 use cases which are vital to an insurance company since it prevents them from undergoing losses, taking on risk, getting more customers, and improving company profits. Essentially the approach I will take is using our data lake and do preprocessing, which

includes data collection and cleaning and feature engineering/selection. Then I will test different models and use different metrics to evaluate which one is the best. Short term idea would be to implement fraud detection to minimize immediate losses and build a recommendation system for selling policies to increase profits. Medium term ideas are to develop the claim likelihood prediction and introduce a dynamic price model to adjust premiums based on these likelihood. Long term idea is to build a customer risk profiling system and include all these machine learning faucets that were made for this to be automated as new customers come. 2:Selection and Training of Machine Learning Models. Essentially I chose to run classification models for 2 of the 4 use cases listed above which were fraud detection and customer risk assessment. I started off with doing basic exploratory data analysis. Then I proceeded with feature engineering that consisted of feature encoding, scaling, and selection. I chose scaling methods based on skewness and outliers of the features and used various methods for feature selection. Then I ran oversampling on the training data in order to increase the amount of data that the model can be trained on and address class label imbalances. Finally I converted data features to floats and trained the model. I used different classification models such as XGBoost, Random Forest, Logistic Regression, and more. The result of my model is the plot below and some plots show the intermediary steps for visualization. --> model performance for first model

-->model performance for second model-->third model performance3: I added some functionality in the data lake including the machine learning model. One functionality is to see outlier distribution in features. Another functionality is to see if data has missing values, duplicated rows, or class imbalances. In addition, I have added statistical measures like F-score, mutual information, and chi square test to see the importance of differentfeatures. Furthermore, I added functionality that allows us to see the data types of the columns in the dataset. Finally, I added functionality to see if features are related with each other or related to the class label through correlation test and thresholds4-Due to utilizing hybrid data when considering a reference architecture one factor is data integration. Usually we are getting data from multiple sources where some may be structured

data and some may be unstructured data. So we must integrate the two for efficiency. This can be done using ETL, data API’s, data streaming, or other methods. Another factor to consider is the actual physical storage of the data which must support the different types of data as well. Here we can consider using a data lake or data warehouse where we can store unstructured data as well and have efficient querying and analytics. Then since we have the data stored we can actually use the data for things like prediction, business intelligence, etc. In addition, a few properties we may want in the reference architecture is scalability and flexibility to accommodate different data sources and evolution of the insurance company. Furthermore since data usually contains important information, the security layer is another property we would need in the reference architecture to ensure data remains private. So currently we have established a data lake and hosting our data on microsoft azure which allows for the integration of data from multiple sources. For physical storage of data we are using the database called SQlite and microsoft azure in order to manipulate the data along with our designed data lake. Finally we have created the machine learning models that will use the data. 5-below are the insights i got from running analytics from the datasets in the azure cloud through visual representations as plots to see the distributions of the datasets.