The project provides a robust and standardized pipeline for reading, parsing,
cleaning/preprocessing, analyzing, and modeling data from CSV (or similarly
delimited) files. The typical flow this pipeline is designed to handle:
- Input file lines are captured line by line and stored in-memory data
structures (
char**
), with special functions to detect delimiters (commas, semicolons, tabs, etc.).
- Identifying and trimming problematic characters such as whitespaces, repeated delimiters, placeholder or missing-value markers (many cases), etc.
- Ensuring consistent date/time string representations by detecting occurrences of common date/time formats() and converting them to their Unix time equivalent for the purpose of adhering to the expected compatibility protocol.
- Extracting and/or standardizing units attached to numeric fields (link to
documentation in
StringUtilities.h/c
).
-
The parameters of each field and identify which fields are plottable and which might need special handling using type inference.
-
In the context of this program, any data that can be effectively displayed and analyzed through graphs or plots, i.e., mapped to a coordinate system and visually represented, is considered valid for plotting.
This typically includes numerical values (measurements, statistics, time series data, etc.), categorical data (that can be mapped to a coordinate system, e.g., using a color palette or labels), or any other data that can be transformed into a format suitable for plotting and whose method of doing is accommodated by the program’s functionality (some functions can be plotted by evaluating them at different points along a coordinate system).
It is important to note that not all data is inherently plottable. For instance, very complex or high-dimensional data may require advanced techniques or dimensionality reduction methods to be effectively visualized which is largely not within the scope or purpose of this project.
Raw Unmodified Data vs. Preprocessed Extracted Data

-
Once field types are identified, each field’s data is split out and systematically written to a dedicated file within a custom generated directory (located at the same level as the original file path) to act as structured storage facilitating clear boundaries between raw data, cleaned CSV lines, and final numeric fields that are ready for direct processing.
-
Each field’s dedicated file will adhere to a nomenclature structure for encoding relevant metadata via three distinct parts, in particular, the name of the data set file, the field’s indexing relative to all other plottable fields, and finally, the name of the field itself. This organized layout ensures that any subsequent script, statistical routine, or plotting function can discover and load numeric data easily, without re-parsing the entire CSV.
-
The name of the file will provide the data set it is contained in, the order and index of its plottability, and of course the name of the field itself; in this way, any future analysis (statistical computations, modeling, plotting) can easily reference a single file dedicated to exactly one column or variable. For example:
- Field: “mass”
- Data set file: “physics_particles.txt”
- Order/Index: Seventh among all fields and fourth among plottable fields
- Name (index from zero):
“physics_particles_Plottable_Field_3-mass.txt”
-
Non-numeric columns (e.g., text descriptions) are also preserved in their own consistent structure, maintaining a one-to-one correspondence between row indices.
-
As the dataset grows or new columns appear, the directory structure simply expands and prunes itself accordingly. Such compartmentalized outputs facilitate advanced or domain-specific analysis steps with downstream tools like MATLAB, reduce clutter in intermediate computations, and enable a modular approach that readily extends to future transformations or data processing features.
Example of Directory Generated for the Plottable Data from the file 'physics_particles.txt'

Example of Directory Generated for the file 'weather_measurements.txt'

- Once the numeric columns have been isolated in separate files, the program computes a variety of standard statistics and descriptive measures, without the need for repeated parsing or ad hoc cleaning.
- Typical analyses include computing standard statistics (mean, median, variance, skewness), applying histogram-based techniques (e.g., Freedman–Diaconis binning to manage outliers), or performing normality tests (such as Anderson–Darling) to detect whether a column follows a normal distribution.
- These methods leverage the uniform schema established earlier—namely, that each column truly is numeric, free of spurious text, and dimensionally consistent.
- By referencing the separated data files, the pipeline’s statistical routines can iterate through each numeric column, generate and store results (like summary statistics), and produce textual analysis summarizing each field’s key properties and reporting mean, standard deviation, skewness, A–D test statistic, bin counts, etc.
- Through these steps, each numeric field ends with both raw numeric data (in its own file) and a companion analysis file detailing its statistical properties.
Example of Directory Generated for the Analysis Results from the Plottable Data Directory 'physics_particles_Plottable_Fields'

- The program’s modeling phase builds upon these clean numeric columns to generate more advanced plots or fit more sophisticated computational models, such as Gaussian error functions, theoretical distributions, or time-series forecasting stubs.
- This step involves creating MATLAB script files that load the numeric data, overlay reference curves (like a normal PDF), and/or apply further transformations (smoothing, curve fitting).
- Because the pipeline stores each stage’s outputs in a structured manner
with uniform data definitions, modeling scripts can systematically
reference column
$i$ as “FieldName:numeric,” and it becomes straightforward to loop over each numeric file, load data, and produce standardized plots and summaries—ensuring all fields follow the same analysis pipeline. - The standardized data directory structures become a reusable asset, where new modeling steps can be appended without a complete re-run from scratch.
- The program thus bridges raw CSV data to higher-level modeling steps by ensuring that the data’s shape and types remain coherent from initial ingestion through final curve-fitting routines.
- 'physics_particles.txt' dataset.
- 'weather_measurements.txt' dataset.
- 'physics_particles.txt' Modeling, Generated from Analysis Results 'physics_particles_Plottable_Fields_Full_Analysis_Results'

- 'weather_measurements.txt' Modeling, Generated from Analysis Results 'weather_measurements_Plottable_Fields_Full_Analysis_Results'

- after Program is Ran with dataset 'physics_particles.txt' file pathname supplied by user
- Only Key Properties/Steps Printed to Terminal
----------------------------------------------------------------
Blind extraction of File Contents --> Formatting:
----------------------------------------------------------------
pdg_id,pdg_name,name,charge,rank,quarks,mass,mass_lower,mass_upper,width,width_lower,width_upper
-2212,p,anti_proton,-1.0,4,UUD,938.27208816,2.9e-07,2.9e-07,0.0,0.0,0.0
-2112,n,anti_neutron,0.0,4,UDD,939.5654205,5e-07,5e-07,7.493e-25,4e-28,4e-28
-321,K,kaon-,-1.0,0,Us,493.677,0.016,0.016,5.317e-14,9e-17,9e-17
-211,pi,pion,-1.0,0,Ud,139.57039,0.00018,0.00018,2.5284e-14,5e-18,5e-18
-24,W,W-,-1.0,0,-,80379.0,12.0,12.0,2080.0,40.0,40.0
-16,nu(tau),anti_tau_neutrino,0.0,0,-,-,-,-,0.0,0.0,0.0
-15,tau,anti_tau,1.0,0,-,1776.86,0.12,0.12,2.267e-09,4e-12,4e-12
-14,nu(mu),anti_muon_neutrino,0.0,0,-,-,-,-,0.0,0.0,0.0
-13,mu,anti_muon,1.0,0,-,105.6583755,2.3e-06,2.3e-06,2.9959836e-16,3e-22,3e-22
-12,nu(e),anti_electron_neutrino,0.0,0,-,-,-,-,0.0,0.0,0.0
-11,e,anti_electron (positron),1.0,0,-,0.51099895,1.5e-10,1.5e-10,0.0,0.0,0.0
-6,t,anti_top,-0.6666666666666666,0,T,172500.0,700.0,700.0,1420.0,150.0,190.0
-5,b,anti_bottom,0.3333333333333333,0,B,4180.0,20.0,30.0,-,-,-
-4,c,anti_charm,-0.6666666666666666,0,C,1270.0,20.0,20.0,-,-,-
-3,s,anti_strange,0.3333333333333333,0,S,93.4,3.4,8.6,-,-,-
-2,u,anti_up,-0.6666666666666666,0,U,2.16,0.3,0.5,-,-,-
-1,d,anti_down,0.3333333333333333,0,D,4.67,0.2,0.5,-,-,-
1,d,down,-0.3333333333333333,0,d,4.67,0.2,0.5,-,-,-
2,u,up,0.6666666666666666,0,u,2.16,0.3,0.5,-,-,-
3,s,strange,-0.3333333333333333,0,s,93.4,3.4,8.6,-,-,-
4,c,charm,0.6666666666666666,0,c,1270.0,20.0,20.0,-,-,-
5,b,bottom,-0.3333333333333333,0,b,4180.0,20.0,30.0,-,-,-
6,t,top,0.6666666666666666,0,t,172500.0,700.0,700.0,1420.0,150.0,190.0
11,e,electron,-1.0,0,-,0.51099895,1.5e-10,1.5e-10,0.0,0.0,0.0
12,nu(e),electron_neutrino,0.0,0,-,-,-,-,0.0,0.0,0.0
13,mu,muon,-1.0,0,-,105.6583755,2.3e-06,2.3e-06,2.9959836e-16,3e-22,3e-22
14,nu(mu),muon_neutrino,0.0,0,-,-,-,-,0.0,0.0,0.0
15,tau,tau,-1.0,0,-,1776.86,0.12,0.12,2.267e-09,4e-12,4e-12
16,nu(tau),tau_netrino,0.0,0,-,-,-,-,0.0,0.0,0.0
21,g,gluon,0.0,0,-,0.0,0.0,0.0,0.0,0.0,0.0
22,gamma,photon,0.0,0,-,0.0,0.0,0.0,0.0,0.0,0.0
23,Z,Z0,0.0,0,-,91187.6,2.1,2.1,2495.2,2.3,2.3
24,W,W+,1.0,0,-,80379.0,12.0,12.0,2080.0,40.0,40.0
25,H,higgs,0.0,0,-,125250.0,170.0,170.0,3.2,2.2,2.8
111,pi,pion-,0.0,0,(uU-dD)/sqrt(2),134.9768,0.0005,0.0005,7.81e-06,1.2e-07,1.2e-07
211,pi,pion+,1.0,0,uD,139.57039,0.00018,0.00018,2.5284e-14,5e-18,5e-18
321,K,kaon+,1.0,0,uS,493.677,0.016,0.016,5.317e-14,9e-17,9e-17
2112,n,neutron,0.0,4,udd,939.5654205,5e-07,5e-07,7.493e-25,4e-28,4e-28
2212,p,proton,1.0,4,uud,938.27208816,2.9e-07,2.9e-07,0.0,0.0,0.0
3112,Sigma,sigma-,-1.0,4,dds,1197.449,0.03,0.03,4.45e-12,3.2e-14,3.2e-14
3122,Lambda,lambda0,0.0,4,uds,1115.683,0.006,0.006,2.501e-12,1.9e-14,1.9e-14
3212,Sigma,sigma0,0.0,4,uds,1192.642,0.024,0.024,0.0089,0.0008,0.0009
3222,Sigma,sigma+,1.0,4,uus,1189.37,0.07,0.07,8.209e-12,2.7e-14,2.7e-14
3312,Xi,xi-,-1.0,4,dss,1321.71,0.07,0.07,4.02e-12,4e-14,4e-14
3322,Xi,xi0,0.0,4,uss,1314.86,0.2,0.2,2.27e-12,7e-14,7e-14
3334,Omega,omega-,-1.0,4,sss,1672.45,0.29,0.29,8.02e-12,1.1e-13,1.1e-13
----------------------------------------------------------------
Formatted Extraction of File Contents --> Preprocessing:
----------------------------------------------------------------
pdg_id,pdg_name,name,charge,rank,quarks,mass,mass_lower,mass_upper,width,width_lower,width_upper
-2212,p,anti_proton,-1.0,4,UUD,938.27208816,2.9e-07,2.9e-07,0.0,0.0,0.0
-2112,n,anti_neutron,0.0,4,UDD,939.5654205,5e-07,5e-07,7.493e-25,4e-28,4e-28
-321,K,kaon-,-1.0,0,Us,493.677,0.016,0.016,5.317e-14,9e-17,9e-17
-211,pi,pion,-1.0,0,Ud,139.57039,0.00018,0.00018,2.5284e-14,5e-18,5e-18
-24,W,W-,-1.0,0,0.0,80379.0,12.0,12.0,2080.0,40.0,40.0
-16,nu(tau),anti_tau_neutrino,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-15,tau,anti_tau,1.0,0,0.0,1776.86,0.12,0.12,2.267e-09,4e-12,4e-12
-14,nu(mu),anti_muon_neutrino,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-13,mu,anti_muon,1.0,0,0.0,105.6583755,2.3e-06,2.3e-06,2.9959836e-16,3e-22,3e-22
-12,nu(e),anti_electron_neutrino,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-11,e,anti_electron(positron),1.0,0,0.0,0.51099895,1.5e-10,1.5e-10,0.0,0.0,0.0
-6,t,anti_top,-0.6666666666666666,0,T,172500.0,700.0,700.0,1420.0,150.0,190.0
-5,b,anti_bottom,0.3333333333333333,0,B,4180.0,20.0,30.0,0.0,0.0,0.0
-4,c,anti_charm,-0.6666666666666666,0,C,1270.0,20.0,20.0,0.0,0.0,0.0
-3,s,anti_strange,0.3333333333333333,0,S,93.4,3.4,8.6,0.0,0.0,0.0
-2,u,anti_up,-0.6666666666666666,0,U,2.16,0.3,0.5,0.0,0.0,0.0
-1,d,anti_down,0.3333333333333333,0,D,4.67,0.2,0.5,0.0,0.0,0.0
1,d,down,-0.3333333333333333,0,d,4.67,0.2,0.5,0.0,0.0,0.0
2,u,up,0.6666666666666666,0,u,2.16,0.3,0.5,0.0,0.0,0.0
3,s,strange,-0.3333333333333333,0,s,93.4,3.4,8.6,0.0,0.0,0.0
4,c,charm,0.6666666666666666,0,c,1270.0,20.0,20.0,0.0,0.0,0.0
5,b,bottom,-0.3333333333333333,0,b,4180.0,20.0,30.0,0.0,0.0,0.0
6,t,top,0.6666666666666666,0,t,172500.0,700.0,700.0,1420.0,150.0,190.0
11,e,electron,-1.0,0,0.0,0.51099895,1.5e-10,1.5e-10,0.0,0.0,0.0
12,nu(e),electron_neutrino,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,mu,muon,-1.0,0,0.0,105.6583755,2.3e-06,2.3e-06,2.9959836e-16,3e-22,3e-22
14,nu(mu),muon_neutrino,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,tau,tau,-1.0,0,0.0,1776.86,0.12,0.12,2.267e-09,4e-12,4e-12
16,nu(tau),tau_netrino,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21,g,gluon,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,gamma,photon,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,Z,Z0,0.0,0,0.0,91187.6,2.1,2.1,2495.2,2.3,2.3
24,W,W+,1.0,0,0.0,80379.0,12.0,12.0,2080.0,40.0,40.0
25,H,higgs,0.0,0,0.0,125250.0,170.0,170.0,3.2,2.2,2.8
111,pi,pion-,0.0,0,(uU-dD)/sqrt(2),134.9768,0.0005,0.0005,7.81e-06,1.2e-07,1.2e-07
211,pi,pion+,1.0,0,uD,139.57039,0.00018,0.00018,2.5284e-14,5e-18,5e-18
321,K,kaon+,1.0,0,uS,493.677,0.016,0.016,5.317e-14,9e-17,9e-17
2112,n,neutron,0.0,4,udd,939.5654205,5e-07,5e-07,7.493e-25,4e-28,4e-28
2212,p,proton,1.0,4,uud,938.27208816,2.9e-07,2.9e-07,0.0,0.0,0.0
3112,Sigma,sigma-,-1.0,4,dds,1197.449,0.03,0.03,4.45e-12,3.2e-14,3.2e-14
3122,Lambda,lambda0,0.0,4,uds,1115.683,0.006,0.006,2.501e-12,1.9e-14,1.9e-14
3212,Sigma,sigma0,0.0,4,uds,1192.642,0.024,0.024,0.0089,0.0008,0.0009
3222,Sigma,sigma+,1.0,4,uus,1189.37,0.07,0.07,8.209e-12,2.7e-14,2.7e-14
3312,Xi,xi-,-1.0,4,dss,1321.71,0.07,0.07,4.02e-12,4e-14,4e-14
3322,Xi,xi0,0.0,4,uss,1314.86,0.2,0.2,2.27e-12,7e-14,7e-14
3334,Omega,omega-,-1.0,4,sss,1672.45,0.29,0.29,8.02e-12,1.1e-13,1.1e-13
----------------------------------------------------------------
Preprocessed Extraction of File Contents --> Plottability:
----------------------------------------------------------------
pdg_id,charge,rank,mass,mass_lower,mass_upper,width,width_lower,width_upper
-2212,-1.0,4,938.27208816,2.9e-07,2.9e-07,0.0,0.0,0.0,
-2112,0.0,4,939.5654205,5e-07,5e-07,7.493e-25,4e-28,4e-28,
-321,-1.0,0,493.677,0.016,0.016,5.317e-14,9e-17,9e-17,
-211,-1.0,0,139.57039,0.00018,0.00018,2.5284e-14,5e-18,5e-18,
-24,-1.0,0,80379.0,12.0,12.0,2080.0,40.0,40.0,
-16,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
-15,1.0,0,1776.86,0.12,0.12,2.267e-09,4e-12,4e-12,
-14,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
-13,1.0,0,105.6583755,2.3e-06,2.3e-06,2.9959836e-16,3e-22,3e-22,
-12,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
-11,1.0,0,0.51099895,1.5e-10,1.5e-10,0.0,0.0,0.0,
-6,-0.6666666666666666,0,172500.0,700.0,700.0,1420.0,150.0,190.0,
-5,0.3333333333333333,0,4180.0,20.0,30.0,0.0,0.0,0.0,
-4,-0.6666666666666666,0,1270.0,20.0,20.0,0.0,0.0,0.0,
-3,0.3333333333333333,0,93.4,3.4,8.6,0.0,0.0,0.0,
-2,-0.6666666666666666,0,2.16,0.3,0.5,0.0,0.0,0.0,
-1,0.3333333333333333,0,4.67,0.2,0.5,0.0,0.0,0.0,
1,-0.3333333333333333,0,4.67,0.2,0.5,0.0,0.0,0.0,
2,0.6666666666666666,0,2.16,0.3,0.5,0.0,0.0,0.0,
3,-0.3333333333333333,0,93.4,3.4,8.6,0.0,0.0,0.0,
4,0.6666666666666666,0,1270.0,20.0,20.0,0.0,0.0,0.0,
5,-0.3333333333333333,0,4180.0,20.0,30.0,0.0,0.0,0.0,
6,0.6666666666666666,0,172500.0,700.0,700.0,1420.0,150.0,190.0,
11,-1.0,0,0.51099895,1.5e-10,1.5e-10,0.0,0.0,0.0,
12,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
13,-1.0,0,105.6583755,2.3e-06,2.3e-06,2.9959836e-16,3e-22,3e-22,
14,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
15,-1.0,0,1776.86,0.12,0.12,2.267e-09,4e-12,4e-12,
16,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
21,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
22,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
23,0.0,0,91187.6,2.1,2.1,2495.2,2.3,2.3,
24,1.0,0,80379.0,12.0,12.0,2080.0,40.0,40.0,
25,0.0,0,125250.0,170.0,170.0,3.2,2.2,2.8,
111,0.0,0,134.9768,0.0005,0.0005,7.81e-06,1.2e-07,1.2e-07,
211,1.0,0,139.57039,0.00018,0.00018,2.5284e-14,5e-18,5e-18,
321,1.0,0,493.677,0.016,0.016,5.317e-14,9e-17,9e-17,
2112,0.0,4,939.5654205,5e-07,5e-07,7.493e-25,4e-28,4e-28,
2212,1.0,4,938.27208816,2.9e-07,2.9e-07,0.0,0.0,0.0,
3112,-1.0,4,1197.449,0.03,0.03,4.45e-12,3.2e-14,3.2e-14,
3122,0.0,4,1115.683,0.006,0.006,2.501e-12,1.9e-14,1.9e-14,
3212,0.0,4,1192.642,0.024,0.024,0.0089,0.0008,0.0009,
3222,1.0,4,1189.37,0.07,0.07,8.209e-12,2.7e-14,2.7e-14,
3312,-1.0,4,1321.71,0.07,0.07,4.02e-12,4e-14,4e-14,
3322,0.0,4,1314.86,0.2,0.2,2.27e-12,7e-14,7e-14,
3334,-1.0,4,1672.45,0.29,0.29,8.02e-12,1.1e-13,1.1e-13,
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Dataset Directory physics_particles_Plottable_Fields
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Directory Pathname: /Users/98dav/Desktop/Xcode/ECE-370_Project/ECE-370_Standardized_CSV_Data_Analysis/physics_particles_Plottable_Fields
Count of Files: 9
Files in Directory:
physics_particles_Plottable_Field_5-mass_upper
physics_particles_Plottable_Field_7-width_lower
physics_particles_Plottable_Field_3-mass
physics_particles_Plottable_Field_8-width_upper
physics_particles_Plottable_Field_4-mass_lower
physics_particles_Plottable_Field_6-width
physics_particles_Plottable_Field_0-pdg_id
physics_particles_Plottable_Field_2-rank
physics_particles_Plottable_Field_1-charge
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Dataset Directory physics_particles_Plottable_Fields_Full_Analysis_Results
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Directory Pathname: /Users/98dav/Desktop/Xcode/ECE-370_Project/ECE-370_Standardized_CSV_Data_Analysis/physics_particles_Plottable_Fields_Full_Analysis_Results
Count of Files: 49
Files in Directory:
width_upper_normality
mass_lower_full_analysis
mass_upper_plot
charge_full_analysis
width_upper_full_analysis
mass_upper_normality
width_lower_stats
charge_histogramcharge_stats
mass_normality
width_upper_plot
width_full_analysis
pdg_id_normality
mass_upper_full_analysis
mass_lower_normality
width_lower_plot
width_lower_full_analysis
rank_full_analysis
pdg_id_plot
width_upper_stats
rank_stats
width_histogram
width_lower_normality
charge_stats
rank_histogram
mass_lower_plot
width_stats
width_plot
mass_plot
mass_lower_histogram
mass_upper_stats
pdg_id_histogrammass_stats
mass_stats
pdg_id_stats
rank_normality
width_lower_histogram
width_normality
mass_lower_statsrank_plot
rank_plot
width_upper_histogram
pdg_id_full_analysis
mass_full_analysis
charge_plot
mass_histogram
comprehensive_plots
charge_normality
mass_upper_histogram
The objective of this documentation is to elucidate the consistent and uniform
logical framework governing the representation and description of data in a Comma-
Separated Values (CSV) file. This framework emphasizes the uniform parameterization
of data entries by a consistent set of fields, and how the program effectively utilizes this
structure. Enclosed below is an overview of the project’s approach to CSV data,
ensuring consistency and uniformity. This approach consistently treats each row as a
parameterized set of fields that directly correspond to a declared schema. The program
effectively leverages this structured representation.
Below is the overarching philosophy and assumptions the project uses when
interpreting CSV files (NOTE: How the analysis and modeling phases leverage this
uniform schema is omitted here for brevity.):
-
By convention, the first line of the CSV is the header line, the number of
comma-separated fields in this header is fixed; subsequent rows must
match that count. -
Each element in this header is understood to be the “field name,” and all
subsequent lines (or rows) contain data entries aligned with these field
names. In essence, the CSV is conceived of as a table where columns are
fixed, in both count and meaning, and each row must fill in every
column’s value (or mark it as missing). -
The project’s code explicitly enforces that the header’s number of
character-delimited fields remains consistent with the number of fields
recognized in every subsequent data row. -
Internally, each field (column) is assigned a type label (“numeric” or
“nonnumeric”), which is inferred from the CSV contents. The program
uses that labeling, for instance, to decide which columns can be plotted.
-
The code that parses a CSV enforces a single, consistent set of fields
across all data entries. This is accomplished by counting how many
comma-delimited (or otherwise similarly-delimited) segments appear in
the first line, then ensuring each of the following lines matches that count.
In short, every row is parameterized by the same set of column names. -
When the code parses a file, it reads the first line (the header) to establish
fieldCount
. -
The next lines are split on the same delimiter into exactly
fieldCount
tokens each. -
Each row’s data is aligned to the corresponding field name in the header.
-
This consistency is key for ensuring that if, say, column 2 is “mass,” then
every row can supply a value for mass (or explicitly mark it missing). It also
underpins plottability checks, missing-value strategies, and so on.
-
Beyond simply splitting data into columns, the project code adds a
further type annotation stage for each field. Specifically, the program
“looks” at the second line (i.e., the first data row after the header) to
guess whether each column is numeric or nonnumeric to form an initial
guess and will then compare this guess against every other data entry’s
corresponding column to ensure accurate type inference. -
An internal representation of each column is constructed like
"FieldName:numeric"
or"FieldName:nonnumeric"
, where a field will only
be declared as nonnumeric once all potential numeric representations of
that field’s contents have been ruled out (e.g., scientific notation, signs,
decimals, etc.). -
This pairing is used throughout the rest of the workflow—plotting
routines, integrators, analysis modules, etc.—so that code can quickly
check if a field is “plottable.” -
Numeric columns are “plottable” and can be taken for further analysis
without significant refactoring. Nonnumeric columns are stored but not
fed to numeric-only analysis routines.
This preprocessing phase underpins the entire program’s
compatibility with data and is what makes the standardization of data
analysis possible in my approach.
-
Primarily concerned with enforcing a dynamic tracking of data quality to
ensure compatibility throughout the program, the preprocessing phase
blindly accepts all contents of the stored data set and enforces the
declared data type. -
If a column is expected numeric but the CSV row’s token is nonnumeric
(other than a “missing” hyphen), the code replaces it with a default
numeric placeholder (e.g., 0.0) to keep arrays consistent so all numeric
columns remain valid and their relative order maintained. -
Each column in each row, tallies how many entries are blank (“-”) or
otherwise have some mismatch (e.g., expected numeric but found
nonnumeric)—again possible because all rows present the same columns
in order. -
Result: For each field, the program knows (a) if it is numeric (plottable) or
not, (b) how many missing/malformed entries occur, and (c) if it might
need further transformation (i.e., date/time → Unix time… see
StringUtilities.h
documentation, isolated extraction of units, expanding
scientific notation1e5
→10000
, etc.).
The uniform schema approach is central. From the assumptions it enforces, the rest of
the program can reliably parse, clean, analyze, and model the data using the same
structured assumption: all rows must provide the same fields, in the same order, typed
consistently. This “once interpreted, consistently used” approach cuts down on
guesswork or repeated parsing at every analysis step, but it requires a comprehensive
and exhaustive transformation of data into what the program and its assumptions
consider compatible.
-
Data Consistency: By enforcing a strict header and field matching, the project
avoids mid-file re-interpretations or inconsistent row lengths. -
Enhanced Analysis: This uniformity underpins consistent downstream operations
—statistical integrators know exactly how many numeric columns exist and can
skip nonnumeric columns, while plotting routines can extract “x” and “y” values
by name, counting on consistent ordering. -
Modularity: Each step in the pipeline—reading, cleaning, labeling, extracting
numeric columns, generating analysis scripts—is encapsulated in its own
function or set of functions, making it easy to replace, extend, or even refactor
steps in operation without affecting the larger system.
Below is a comprehensive, end-to-end guide covering every header/source file of this project in detail. The discussion begins with the high-level structure and end- to-end flow, then drills down into each file: its purpose, key functions, important data structures, and how it interacts with the rest of the program. Where relevant, I integrate the extensive in-code comments to clarify each function’s role.
Purpose:
- Defines shared constants and macros used across the entire project.
- Houses special-case arrays (like
months[]
,weekDays[]
,commonDateTimeFormats[]
, andcommonUnitFormats[]
), so they remain accessible in one place. - Introduces certain global variables or macros (like
BUFFER_SIZE
andARRAY_ELEMENT_TO_STRING()
) that other modules rely on. - Central location for rarely changing “special case” definitions that unify the rest of the code.
Key Definitions & Structures:
-
MAX_STRING_SIZE / MAX_NUM_FILE_LINES
- Provide upper bounds for typical string and line usage (e.g., no more than 1000 characters in a line, or 100000 lines in a file).
-
Macros for extracting string representations
VAR_NAME_AS_STRING(var)
,ARRAY_ELEMENT_NAME_AS_STRING(var)
, andARRAY_ELEMENT_TO_STRING(arr,n)
.- They’re rarely invoked in the rest of the code but can be used to convert variable or array element names to strings at compile-time.
-
Mutex localtime_mutex
- A global
pthread_mutex_t
, enabling thread-safe calls aroundlocaltime
. - Used primarily by the function
thread_safe_localtime(...)
in GeneralUtilities.
- A global
-
Arrays for date/time:
months[12]
,weekDays[7]
, andcommonDateTimeFormats[12]
store textual month names, day names, and commonly encountered date/time patterns (e.g.,"%Y-%m-%d %H:%M"
).- The code uses these arrays to parse date/time strings into
struct tm
.
-
commonUnitFormats[]
- Holds about 45 strings representing typical SI units, derived units, or other specialized measurement units.
- Helps in detecting unit-suffixed numeric tokens (like
“23.1kg”
).
Key Implementation Notes:
commonDateTimeFormats[]
is crucial for the functionconvert_to_unix_time(...)
in GeneralUtilities.c and forstring_is_date_time(...)
in StringUtilities.c.commonUnitFormats[]
is used bystring_is_unit(...)
oris_numeric_with_units(...)
in StringUtilities.c.- Implementation: The
.c
file simply defines theextern
arrays declared in CommonDefinitions.h and implements a helpern_array_element_to_string(...)
used by the macros.
Purpose:
- Provides a broad set of utility functions that do not specifically belong to only
string or file manipulation. This includes:
- Memory allocation helpers
- Sorting (merge sort, radix sort for doubles)
- Time conversions (Unix time, thread-safe localtime)
- Bitwise reinterpretations of doubles (e.g.,
double_to_uint64()
, flipping sign bits) - Basic math (min/max element, etc.)
Important Function Categories:
-
Safe Memory Allocation
allocate_memory_int_ptr(size_t sizeI)
,allocate_memory_char_ptr(size_t sizeC)
, etc.- Provide wrappers around
malloc
with error checking. For instance, if memory can’t be allocated, theyperror(...)
andexit(1)
.
-
Bitwise Operations
double_to_uint64(double value)
,uint64_to_double(uint64_t)
,double_to_mapped_uint64(double)
,mapped_uint64_to_double(uint64_t)
, andflip_sign_bit(uint64_t)
.- These are essential for the specialized radix sort to handle negative vs. positive doubles in correct numerical order by mapping them to a region of the unsigned integer space.
-
Sorting
- Merge Sort:
merge_data(...)
,merge_sort_data(...)
,merge_sort(...)
.- Classic divide-and-conquer mergesort for an array of doubles.
- Radix Sort for doubles:
radix_sort_doubles(double *unsortedData, const int numElements)
.- Uses the “bitwise reinterpretation” approach.
- Steps:
- Map each double to
uint64_t
in an order-preserving way for sorting. - Sort the
uint64_t
s by standard byte-level passes (256 buckets each pass). - Convert the sorted
uint64_t
s back to double.
- Map each double to
- Merge Sort:
-
Time Conversions
convert_to_unix_time(const char* dateTimeString)
: tries each pattern incommonDateTimeFormats[]
, callingstrptime(...)
, returning atime_t
if successful or -1 if not.thread_safe_localtime(const time_t *tim, struct tm *result)
: wrapslocaltime(...)
calls in a mutex lock/unlock for safety, copying out toresult
.
-
Basic Math
minimum(double a, double b)
,maximum(double a, double b)
: straightforward macros, though implemented as small functions.min_element(...)/max_element(...)
: skip NaNs, find the true min or max.
Integration:
- Radix Sort is used in higher-level code (like
extract_radix_sorted_data(...)
in DataExtraction orprocess_data_set_for_analysis(...)
in DataSetModeling) to handle large numeric arrays. - Unix time conversions are central in
replace_date_time_with_unix(...)
in StringUtilities. - The memory-allocation wrappers show up throughout the code.
Purpose:
- Houses a comprehensive suite of string handling routines: trimming, splitting/tokenizing, analyzing whether a string is numeric, detecting date/time or units, plus logic to unify data line structure.
Important Function Categories:
-
Character-Level Checks
char_is_alpha()
,char_is_digit()
,char_is_delimiter()
, etc.- These help parse CSV fields, skipping whitespace, detecting punctuation, etc.
-
String Classification
string_is_numeric(...)
,string_is_nonnumeric(...)
,string_is_date_time(...)
,string_is_unit(...)
.- For example,
string_is_date_time(...)
will tokenize a line, attemptstrptime(...)
with eachcommonDateTimeFormats[]
, and set an integer array to 1 or 0 for each field. - Similarly,
string_is_unit(...)
checks for recognized units fromcommonUnitFormats[]
.
-
Trimming/Pruning
trim_string_whitespaces(...)
,prune_string_whitespaces(...)
,prune_repeated_delimiters_from_string(...)
.- These ensure consistent CSV lines by removing extra spaces or repeated
delimiters (e.g.,
,,
→,0,
so that an empty field is replaced by “0” or another placeholder).
-
Tokenizing & Concatenation
tokenize_string(...)
,tokenize_string_r(...)
: akin tostrtok
but sometimes re-entrant.combine_strings(...)
,concatenate_string(...)
,concatenate_string_array(...)
to join arrays or pairs of strings efficiently.
-
Higher-Level Preprocessing
prune_and_trim_problematic_characters_from_string(...)
:- Trim leading/trailing whitespace.
- Remove internal whitespace.
- Handle repeated delimiters.
- Optionally detect date/time fields, convert them to Unix time.
replace_date_time_with_unix(...)
: specifically scans fields for date/time and substitutes their numeric Unix time.extract_units_from_fields(...)
: tries to detect units from a numeric token and separate them into two arrays.
Integration:
- These string routines are heavily used in DataExtraction and FileUtilities when reading lines from CSV, pruning them, and assembling the final numeric or textual fields.
- Many CSV “quirks” (like repeated commas) get fixed here.
Purpose:
- Focused on file handling and directory operations.
- Functions to read entire files into memory, count lines/fields, create directories, retrieve file properties, etc.
Important Function Categories:
-
File Path Analysis
determine_file_depth(...)
: counts"/"
in the path.identify_file_extension(...)
: e.g., returns.csv
from/home/user/data.csv
.find_file_directory_path(...)
: e.g., from/home/user/data.csv
, returns/home/user/
.find_directory_name_from_path(...)
: from/home/user/
, returnsuser/
.find_name_from_path(...)
: extracts the bare filename (without extension).
-
Counting Fields & Lines
count_data_fields(...)
: splits a header line to see how many columns exist.count_file_lines(filePathName, maxLines)
: loops through the file withfgets
until max lines or EOF.- Additional helpers like
count_plot_data_fields(...)
.
-
Reading & Writing
read_file_contents(...)
: read the entire file into achar**
array, each element containing one line.write_file_contents(...)
: writes achar**
back out line by line.write_file_numeric_data(...)
: writes an array of doubles to a file, often storing them with a header that includes the field name.
-
Merging & Generating Filenames
generate_merged_filename(filePath1, filePath2)
,merge_two_files(...)
: combine or unify data from two distinct files, or produce a new filename that references both.
-
Directory Operations
create_directory(filePathName, directoryName)
: usesmkdir
or similar, forming e.g./home/user/data.csv_plottable_fields/
.delete_directory(...)
,count_files_in_directory(...)
,get_file_pathnames_in_directory(...)
, etc.
-
FileProperties & DirectoryProperties Structures
FileProperties
can hold(fileName, fileExtension, filePathName, fileLineCount)
.DirectoryProperties
can hold(directoryPathName, directoryName, fileCount, fileProperties)
, so you can enumerate which files exist in that directory.
Integration:
- Called first in the pipeline to read CSV lines (
read_file_contents
), or to write them back out at the end (write_file_contents
). - Helps create specialized subdirectories where numeric columns are stored.
Purpose:
- Central data set parsing and preliminary analysis.
- Identifies missing values, pairs each field in the header with data types, and extracts “plottable” numeric columns.
Important Function Categories:
-
parse_entire_file(char **fileContents, int lineCount, int* fieldCount, const char *delim)
- Tokenizes each line, storing the i-th field of each row in a 2D array structure.
- For example,
separatedData[i][line] =
the i-th column’s value in rowline
. - Also returns
fieldCount
if needed.
-
count_missing_values(...)
- Iterates over each row, checking for blank (
"-"
) or mismatched data. - Uses
determine_data_entry_types(...)
to see if a column is numeric or not. If it’s flagged numeric but the data is invalid, incrementsmissingDataCount
.
- Iterates over each row, checking for blank (
-
Header & Data Type Identification
capture_data_set_header_for_plotting(...)
: pairs each field name in the header with the type (e.g.,"mass:numeric"
), returning an array of those pairs.determine_data_entry_types(...)
: checks each token in a single line, returning"numeric"
or"nonnumeric"
.determine_common_data_entry_types(...)
: repeats the above across all lines, concluding the most typical type for each column.
-
Identifying & Formatting Fields
identify_plottable_fields(...)
: from the name/type pairs, returns an integer array:1
if numeric,0
if not.capture_plottable_fields(...)
andformat_data_entry_for_plotting(...)
then produce strings (or arrays) with only numeric columns, substituting defaults for missing or invalid numeric fields.
-
Writing Data
write_plottable_data(...)
: writes each numeric column to a text file named after the field.write_data_set(...)
: can rewrite the entire dataset, splitting numeric vs. nonnumeric columns.
-
Extracting Full Numeric Arrays
extract_plottable_data_field(...)
,extract_plottable_data(...)
: read all numeric columns from the “clean” lines into double arrays.extract_radix_sorted_data(...)
: same but sorts each column as well. Great for statistical analysis.
Integration:
- Most of the code that cleans and infers types is located or invoked here, calling StringUtilities frequently.
- The results from DataExtraction flow into DataAnalysis_DataSetModeling or other analysis modules that want an array of numeric columns.
Purpose:
- Perform standard statistical computations such as mean, standard deviation, skewness, histogram binning, and normality tests (Anderson–Darling).
Important Function Categories:
-
Basic Stats
compute_mean(...)
,compute_standard_deviation(...)
,compute_skewness(...)
.
-
Histogram & Binning
- Freedman–Diaconis approach with
compute_IQR(...)
andcompute_bin_width(...)
. - Histogram structure:
bins[]
,num_bins
,bin_width
,min_value
,max_value
. compute_data_set_binning(...)
uses Freedman–Diaconis or an approach to produce optimal bins.
- Freedman–Diaconis approach with
-
Anderson–Darling Normality
anderson_darling_normality_test(...)
: sorts data, then compares the empirical CDF to the normal CDF.- If data has many ties, it might incorrectly reject normality, but that’s a known limitation.
-
Gaussian / Integration
compute_gaussian(n, x[])
returns the PDF-like function2/sqrt(pi)*exp(-x^2)
.gaussian_riemann_sum_integration(...)
does a Riemann sum to approximate error function.- These are used for demonstration or plotting normal distributions.
-
File-based
- Some helper
compute_mean_from_file(...)
, etc. read numeric data from a file, do the stats in one step. analyze_numeric_data(double *data, int n, const char *outputDirectory, const char *fieldName)
returns aStatisticalReport
with (mean
,std_dev
,skewness
,ad_stat
,histogram
).
- Some helper
Integration:
- Typically invoked after numeric columns are extracted. For each column’s double
array, code calls
analyze_numeric_data(...)
, obtains mean, histogram, etc., and writes a small_analysis.txt
or_histogram.txt
.
(this is effectively the orchestrator for analysis and modeling.)
Purpose:
- Coordinates the final pipeline steps: once data is “cleaned” and “plottable,” it can be analyzed in detail or modeled.
- Creates or references directories for numeric data, calls
extract_radix_sorted_data(...)
, and writes out modeling scriptsb for MATLAB.
Key Exports:
-
Structures
-
DataSetProperties:
(entryCount, fieldCount, delimiter, dataSetFieldNames, fieldNameTypePairs, dataSetFilePathName, missingDataCount, plottabilityStatus, commonDataTypes, dataSetFileContents)
.- Encapsulates all key information about a single dataset.
-
DataSetAnalysis: extends DataSetProperties with
double **radixSortedData
plus references to directories in which plottable or analysis data is stored.
-
-
Functions
analyze_data_set_properties(...)
: top-level function that:- Reads lines from a file, identifies delimiter.
- Counts fields, calls
capture_data_set_header_for_plotting(...)
,determine_common_data_entry_types(...)
, etc. - Returns a fully built
DataSetProperties
.
capture_data_set_configurations(...)
: a sub-variant if you already have the lines read.process_data_set_for_analysis(...)
:- Reads lines, calls
extract_and_format_data_set(...)
to unify them. - Writes the dataset out via
write_data_set(...)
. - Returns a
DataSetAnalysis
that includes directory info for the numeric fields.
- Reads lines, calls
preprocess_data_set_for_analysis(...)
: a lighter wrapper that just returns the directory path with plottable data.perform_full_analysis_and_modeling(...)
: enumerates plottable field files, callsanalyze_numeric_data(...)
, writes out_analysis.txt
or_histogram.txt
, and can generate MATLAB scripts.
-
MATLAB Scripts Generation
- Routines like
generate_individual_matlab_scripts(...)
orgenerate_comprehensive_matlab_script(...)
produce.m
files that load the numeric data, plot histograms, overlay normal curves, etc. - Typically each numeric field is given a separate script or a combined script that loops over them.
- Routines like
Integration:
- This is often called by
main.c
, providing a single function invocation that orchestrates reading, cleaning, analyzing, and generating results. - DataAnalysis_DataSetModeling thus depends on DataExtraction, StringUtilities, FileUtilities, and StatisticalMethods to do the actual “leg work.”