I gave this talk to give an overview of few tools that our department relies heavily on. This talk was part of our IS department’s commons session to give employees the chance to share their skills with examples to set and follow.
We are hoping to give an insight into 5 tools that we rely heavily to make our life easy in Analytics department.
I have covered 5 tools in this document
Each tool has it’s on section covering
If you have questions or need help using any of these, please feel free to reach our to Robin Varghese, it will be my at-most joy to assist you in your endeavor.
SQL is a database language. SQL is used to communicate with a RDBMS database(i.e. Microsoft SQL Server, MS Access, MySQL, Oracle, Sybase, Ingres). We live in a data-driven world: We at Arnot search through data to find insights to inform strategy, marketing, operations, and a plethora of other categories. We have a several systems that use large, relational databases, which makes a basic understanding of SQL a great employable skill not only for data analyst, but for almost everyone.
We have written close to 100k lines of SQL code in last 5 years. We modify close to 38 files with 3073 insertions(+) and 216 deletions(-) every month. SQL is a crucial tool for analyzing data.
SQL could be distilled down to these 5 basic commands
SELECT
column_name(s)
FROM
table_name
WHERE
condition
GROUP BY
column_name(s)
ORDER BY
column_name(s);
I highly recommend taking one of these online course
Python is a popular programming language. Python was designed for readability, and has some similarities to the English language.
Python uses new lines to complete a command, as opposed to other programming languages which often use semicolons or parentheses. Python relies on indentation, using whitespace, to define scope.
It is used for:
What can Python do?
We use python programming language in a several ways
This generic integration agent integrates data into our data warehouse from variety of data sources( ex. sql server, csv, txt, excel files etc). It also de-dupes data it is integrating.
For e.g. this is how we integrate AMS credentials file to ensure we have the latest information on Arnot Health Providers.
'job_key':{
'extract_home': '~/Projects/flat_files_folder',
'extract_filename_part': 'CREDENTIALING STATUS REPORT',
'file_extensions': ['.xlsx'],
'sheetname': 'DEMOGRAPHICS MASTER',
'skiprows': 0,
'table_name': '[analytics_prod_database].[AMS_schema].[credentialing_master]',
'primary_keys': ['NPI'],
'severity': 'minor',
'frequency': 'weekly',
},
We trigger automated python jobs through our job engine. Job engine is programmed to tag each triggered job with basic diagnostic details(ex. time, host name, job_id etc). Every job also has few diagnostic tags(ex. ‘#critical #weekly #job #awhitt’) These tags are used by job monitor later to notify of errors or delays. Finally all job logs along with diagnostic details are pushed to a database.
We monitor over 80 automated jobs across our entire infrastructure for delays and failures. Thrice a day it sends out an email with a subject similar to Job Monitor: Critical errors: 1, Minor errors: 0, Warnings: 11, Success: 85
with details on current state of jobs.
Our Infinity Platform server is written in python
I highly recommend going through one of these online material
https://en.wikipedia.org/wiki/Python_(programming_language) https://share.percipio.com/cd/cm5WXG5QF https://arnothealth.percipio.com/search?q=python https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
Git is a wonderful tool for source control. Git is the most popular version control system. If you have been programming for less than a decade, it’s highly likely that you haven’t used any other method of version control. The git workflow dictates how a software team collaborates, builds, and ships software.
We use a Community Edition of gitlab as our Centralized Version control System(CVS) and wiki for maintaining our documents.
We also use gitlab to manage our teams’ tasks and projects.
This is used to examine specific points of a file’s history and get context as to who the last author was that modified the line. This is used to explore the history of specific code and answer questions about what, how, and why the code was added to a repository.
For ex below file shows a history of a 122 line document how it has evolved over last 3 years.
https://github.com/jupyter/notebook/blame/master/.gitignore
git checkout master
git pull --rebase
Our convention is to create branches from gitlab issues
git checkout -t origin/<branch_name_here>
git add 1.txt
git commit -m '....'
git push
Team Management Project ex. python data visualization (#1167) · Issues · AHA-dev-team _ br1ck · GitLab.pdf
Project Jupyter is three things: a collection of standards, a community, and a set of software tools. Jupyter Notebook, one part of Jupyter, is software that creates a Jupyter notebook.
A Jupyter notebook is a document that supports mixing executable code, equations, visualizations, and narrative text. Specifically, Jupyter notebooks allow the user to bring together data, code, and prose, to tell an interactive, computational story.
We use this primarily when we need to attach a narrative with our analysis. Refer to attached file jupyter_notebook_demo.html
This is one of the latest tools that we have uncovered so even we are learning. If you want to learn more get in touch with us we can learn this together.
Our in-house analytics platform surfaces information from most of our in house systems via 48 dashboards and calculates 355 key metrics http://infinity.aha http://8.aha
Give us feedback and suggestion for improvement.
Project link: https://github.com/codein/codein.github.io/blob/master/files/resume.md