4.8. Making EDA Reports#
When you are doing EDA, it is often the case (whether in academia or industry) that you will have to put your findings into a report rather than just sending a Jupyter Notebook with code, plots, and test results. This report should also read nicely; well-written reports are more likely to make an impact on readers.
In this section, we will go over basic writing principles for EDA and preparing your Jupyter Notebooks for creating a report output.
4.8.1. EDA Writing Principles#
Many people are familiar with the Scientific Method, which formulates a cyclical workflow for performing science. This process involves:
Making preliminary observations and reading prior research;
Formulating hypotheses based on these observations and research;
Designing and performing experiments to test these hypotheses;
Evaluating the results collected from the experiments;
Making conclusions based on the results, which cycles back into the first step.
In scientific research papers, these phases formulate into sections for the report. Initial observations, prior research, and hypotheses form an Introduction section. The experiments performed form a Methods section, and its results form a Results section. Following this, the conclusions reached from the results form a Discussion section.
Data science, being an offshoot of science, benefits from this workflow as well. However, EDA reports follow a modified structure, placing less emphasis on the methods and requiring less rigor for introduction and discussion sections. Because of this, EDA reports are not typically published but rather can form a piece of a larger work.
Here, we will go over some of the sections of EDA in-depth, and how to write them.
Introduction#
The introduction of the EDA should answer, at the very least, the following questions:
Where did you get the data from?
What are your guiding questions throughout this process?
Why did you form these questions?
The introduction section can also include, especially for reports that are made more visible to the public, some background on the topic in general so as to help readers without prior knowledge of the subject get more understanding as to what the topic is about and why it is important. This means incorporating preliminary observations and looking into prior research if possible.
Data Preparation#
You undoubtedly had to perform some previous data manipulation to get to the basic analysis portion of your EDA. While you do not need to include any code in your EDA report or include basic preprocessing like removing N/A values, you should highlight any methods you took that are atypical, such as transformation to make new variables. The rationale for doing this is also important in this section.
In other words, this section should answer:
What did you do to the data that is notable?
Why did you do it?
Results#
Rather than have a methods section where you introduce the tests and plots that you made, it is better to segment these aspects together as you go through the results. One framework you can take through this is to go through each of your results, or alternatively, each of the features you found to have striking results. This framework would lend itself to a sub-structure as follows:
Feature title
Feature description
Important result
Plots and tests to back up your result
Interpretations of the result
Next feature title…
This is likely to be the most important section of your report, so put as much effort into it as possible. Make sure it is polished and that your rationale and results make sense.
This section should answer, in short:
What did you find?
What do your findings mean?
Discussion#
This section does not have to be labeled as a “Discussion” section, and it will likely not be as extensive as it typically is in many formal research papers. However, it should tackle the questions:
What did you find, overall? (Summarize your results!)
Why are your findings important?
What next steps could be taken in analysis?
Such next steps could be ones that come to you as you are performing analysis, but often times you can get these questions as you get feedback from others on your analysis. Feedback is important for the overall strength of a report; an additional set of eyes can see errors that you may not have caught when writing the report or even in revision (more likely than you think - with your own writing, you may be tempted to skip over a section especially if you think you know that section well).
4.8.2. General Writing Tips#
With many reports, one big tip is to elaborate as much as possible. Adding more content is generally better than adding less content; this is because in writing (like in many aspects of life) it is easier to take things away than to add something completely new.
Another tip is to not take feedback poorly. As mentioned previously, feedback is important and will help you to improve your analysis with each iteration. Though sometimes feedback can be quite harsh, this often points out areas for significant improvement.
Writing for EDA reports and for scientific reports in general is very technical. There is typically no need to try and make a report pleasing to read; rather, it is more important that the report is not confusing and gets all the details that are important to cover. That being said, being able to succinctly state what you need to say is good, as extra unnecessary text can make your writing more confusing.
Finally, I encourage you to enjoy the process. With many things in life, you will get more satisfaction out of the process, and you will do a better job, if you enjoy writing reports. For some people, writing is not their strength, or something they particularly enjoy. Fortunately, there are many people out there who are willing and able to help you out. Having additional people look at your reports or your progress over time can open your eyes to new insights about your own work that you may have missed, such as your own improvement with writing.
4.8.3. Notebook Preparation#
Thankfully for EDA report-writers, Jupyter Notebooks have the amazing capability of being able to utilize markdown in addition to running code. This framework of markdown cells and code cells forms the foundation of report-writing, upon which additional cleanup is necessary to make a presentable report.
Here, we will go over some markdown basics, hiding code, and outputting a PDF from your notebook.
Markdown Basics#
Here, we introduce some of the more common features of markdown that may be helpful to you as you make reports.
One of the biggest features of markdown to use is the making of titles. This will help you with sectioning your report and will thus make it easier to read. Typically, reports should have one main title at the top, followed by section titles and sub-section titles. These are all indicated with the hash symbol, pound sign, or hashtag, whatever you want to call it (#). One symbol is used for the main title, two for sections, and three for sub-sections. You can use even more for sub-subsections if you need.
Some people like to write some words in italics or boldface to emphasize points. This is done with asterisks surrounding the portion you want to italicize or make bold, with one asterisk making italics and two making boldface. For example,
*This text will be italicized.*
**This text will be in bold.**
Lists are another common tool in making reports and are done with the hyphen character (-) or a number followed by a period (like “1.”) at the beginning of a new line.
Markdown also has cross-functionality with LaTeX and HTML. Either of these can be used within markdown in most cases. LaTeX is good for implementing mathematical formulas; in-line math is done with $ surrounding a LaTeX math formula, while large formulas can be done on their own lines by a $$ on a new line, followed by a LaTeX formula on the next line, and another $$ on the next line. HTML tags typically have synonymous functions in either LaTeX or markdown; due to the clunky nature of HTML tags, they are not recommended to be used unless you know HTML very well.
Finally, some characters like the asterisk and the dollar sign are special, as you saw in the prior paragraphs. To print these characters as they are and not have markdown read them for their function, simply include a backslash (\) before the character, like \$, and it will print out as normal.
Hiding Code#
Before exporting notebooks, it is important to place tags into your code cells to hide them; code should not be present in a report as it may make the report much longer than it should be. To hide code cells, click on the cell’s setting to add a cell tag, then add “remove-input”. If there is a code chunk for which you want to hide the output as well, add “remove-cell”.
Now you can export to a PDF. However, it’s not nearly that simple. If you try and export a notebook with tagged code cells, the code cells will still produce in the output. Additionally, the required packages for generating PDF output may not be installed on your computer. To fix these, first make sure that the anaconda packages nbconvert
and its dependencies (pandoc
and some version of LaTeX) are installed.
To install
nbconvert
, refer to https://nbconvert.readthedocs.io/en/latest/install.html#installing-nbconvert. This can be done in the terminal withconda install nbconvert
.To install
pandoc
, refer to https://pandoc.org/installing.html. This can be done in the terminal withbrew install pandoc
.To install a version of LaTeX, refer to https://tug.org/mactex/ (which installs MacTex for MacOS) or https://miktex.org/ (which installs MikTex for Windows).
Make sure to restart your computer after installing these.
Then following this, make sure you are in the same directory as your notebook that you wish to export, and type the following in the terminal:
jupyter nbconvert --to pdf --no-input <file_name>
replacing <file_name>
with the name you want your output file to be. At this point you should be able to produce PDFs from your Jupyter Notebooks!
4.8.4. EDA Project, Part 3#
Format your findings from the previous sections into a well-written and presentable PDF report with the aforementioned sections.