Advanced Statistical Programming and Data Science.
A perspective by Darko Medin

Today’s Statistics and Data Science advanced procedures and models rely mostly on good programming basis. Whether its Biostatistics, Machine Learning, AI algorithms, Machine Learning, Engineering, Database engineering or even Economy statistics are in question, programming is essential part in practical application of these spheres of Statistics and Data Science.

I created a perspective on what are the most important details in transferring from moderate to advanced Statistical and Data Science programming.

1. Small details in a Programming language : As the level of Programming language knowledge increases, a programmer learns these small but important details and tricks, like how resolve a blockade in the predefined script deployment, how to avoid stacking or how to do something faster without a function or write new function what will speed up the process.

2. Finding balance in writing code for detailed, but still concise checks and clear logs/errors/warnings: An advanced Statistical Programmer knows how to evaluate error and warning logs in detail and even better knows how to create new ones based on the needs of a procedure. Still these parts of the code must not be burden for the overall workflow of the code. When creating custom functions these logs again play vital role. For this step theoretical knowledge of the mathematical and statistical theories must be included.

3. Fast programming and Shortcuts :Being advanced in Statistical Programming and Data Science does not require just knowledge, but also skills. Being able to program large scripts quickly requires strong skills. Through experience a programmer will know every shortcut in detail, and these enable programming quickly using just keyboard without much mouse use (mouse movement is an action which requires moving from keyboard to the mouse and back, which slows down the whole process).

4. Knowing more compatible Programming languages : Sometimes a procedure is done in one programming language and then optimized in another. These relocations enable all sort of flexible environments and a large number of options in Statistical programming and Data Science. Often viewed as a gold standard for Statistical programing and Data Science, Languages like Python, R, C++, Java, SAS, Scala ,Perl and others, are best utilized when a Data Scientist knows a combination of 2 or more of these languages based on the needs. Since they cover slightly different areas of Statistical Programming and Data Science, this knowledge of multiple languages enables smooth flow and creation of efficient pipelines between them. In latest versions of these programming languages there are even GUI channels where transfer to other language scripting rules is available in the window, as this has become a very important segment of data science.

5. Workflow, pipelines and automation : A strong Data Science skills definitely include having clear workflow and knowing to to automate it. How to efficiently use loops, pipelines, use mainloop procedures, automatically adapt the algorithms are key procedures every advanced Data Scientist should now in detail.

6. Customizing procedures : A good Data Scientist will be an expert in customizing methods and procedures, but also creating new ones. This is one of the most advanced aspects of a Data Scientist. It requires problem solving, logic, mathematical, statistical, programming and industry level knowledge.

7. Strong theoretical Mathematical/Statistical basis :A strong Mathematical basis enables most of the things i mentioned before. Having mathematical logic enables resolving complex problems quickly which is essential in this area. Statistical theories play a key role in transferring from Mathematics to applicable solutions. Knowing how to code these theories with ease is another aspect of and advanced Statistical Programmer, but also knowing that not every aspect of a theory can be applied in practice. Choosing the right combination of theoretical and practical aspects is the key.

8. Ability to read mathematical equations efficiently and implement them in a programming language without stacking : Mathematical expressions can be complex and advanced data scientist knows how to program them in a way that does not cause stacking on a moderate computer configuration. This is one of the key aspects in algorithm efficiency.

9. Being able to efficiently deploy the algorithm outputs/models in the beta version: Deployment is a final phase of a Statistical Programming/Data science structure and is vital in terms of functionality of most Data driven products. Making smooth workflow pipelines is a key in this segment and when advanced Data Science products are in question, beta solutions should be as close as possible to the real solution.

10. Last but not least: Industry knowledge, adaptability -An advanced Data Scientist or a Statistical Programmer should be able to understand the problem that is to be resolved. It is not expected to be top level technical knowledge, but any advanced real world solution will require being flexible enough to understand both the industry relevant and technical details. This transfer from Math/Statistics/Programming/Data Science and Real world problem is a key channel in the whole story and must work perfectly in order for customized Statistical product or Data science algorithm to work well in real world settings and situations. This last part comes with experience and being able to understand something that is not naturally within the scope of Data Science or Statistics is the key here.

Analytics presented in article were performed by Darko Medin using ‘ComplexHeatmap’ and ‘brms’ packages, R programming language.

References to the packages and Programming language:

1.Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
genomic data. Bioinformatics 2016.

2. Bürkner P (2017). “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software, 80(1), 1–28

3. R Core Team (2019). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.

Data Scientist and a Biostatistician