This is very useful for exploring correlations between multidimensional data, when you'd like to plot all pairs of values against each other. We'll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three iris species:. Visualizing the multidimensional relationships among the samples is as easy as calling sns. Sometimes the best way to view data is via histograms of subsets. Seaborn's FacetGrid makes this extremely simple. We'll take a look at some data that shows the amount that restaurant staff receive in tips based on various indicator data:.
Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution of a parameter within bins defined by any other parameter:. Similar to the pairplot we saw earlier, we can use sns. Time series can be plotted using sns.
In the following example, we'll use the Planets data that we first saw in Aggregation and Grouping :. For more information on plotting with Seaborn, see the Seaborn documentation , a tutorial , and the Seaborn gallery. Here we'll look at using Seaborn to help visualize and understand finishing results from a marathon. I've scraped the data from sources on the Web, aggregated it and removed any identifying information, and put it on GitHub where it can be downloaded if you are interested in using Python for web scraping, I would recommend Web Scraping with Python by Ryan Mitchell.
We will start by downloading the data from the Web, and loading it into Pandas:. By default, Pandas loaded the time columns as Python strings type object ; we can see this by looking at the dtypes attribute of the DataFrame:. That looks much better. For the purpose of our Seaborn plotting utilities, let's next add columns that give the times in seconds:. The dotted line shows where someone's time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates as you might expect that most people slow down over the course of the marathon.
If you have run competitively, you'll know that those who do the opposite—run faster during the second half of the race—are said to have "negative-split" the race. Let's create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race:. Where this split difference is less than zero, the person negative-split the race by that fraction. Let's do a distribution plot of this split fraction:. Let's see whether there is any correlation between this split fraction and other variables.
We'll do this using a pairgrid , which draws plots of all these correlations:. It looks like the split fraction does not correlate particularly with age, but does correlate with the final time: faster runners tend to have closer to even splits on their marathon time. We see here that Seaborn is no panacea for Matplotlib's ills when it comes to plot styles: in particular, the x-axis labels overlap.
Because the output is a simple Matplotlib plot, however, the methods in Customizing Ticks can be used to adjust such things if desired. The difference between men and women here is interesting.
- Seaborn Networks’ IP network is now fully operational!
- The Triumph of Democracy in Spain!
- Seaborn: Python's Statistical Data Visualization Library.
Let's look at the histogram of split fractions for these two groups:. The interesting thing here is that there are many more men than women who are running close to an even split!
This almost looks like some kind of bimodal distribution among the men and women. Let's see if we can suss-out what's going on by looking at the distributions as a function of age. Let's look a little deeper, and compare these violin plots as a function of age. We'll start by creating a new column in the array that specifies the decade of age that each person is in:. Looking at this, we can see where the distributions of men and women differ: the split distributions of men in their 20s to 50s show a pronounced over-density toward lower splits when compared to women of the same age or of any age, for that matter.
Also surprisingly, the year-old women seem to outperform everyone in terms of their split time. This is probably due to the fact that we're estimating the distribution from small numbers, as there are only a handful of runners in that range:. Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We'll use regplot , which will automatically fit a linear regression to the data:.
There are several valid complaints about Matplotlib that often come up: Prior to version 2. This provides great flexibility in terms of controlling which Axes is to be used for plotting. Each Axes-level function also returns the Axes on which the plot has been made. If an Axes has been passed to ax argument the same Axes object will be returned. The returned Axes object can then be used for further customisation using different methods like Axes. If no Axes is passed to the ax argument, seaborn uses the current active Axes to make the plot.
If no Axes is passed to ax argument and there is no currently active Axes object, seaborn creates a new Axes object to make the plot and then returns that Axes object. The Axes-level functions in seaborn do not have any direct parameter to control the figure size.
However, since we can specify which Axes is to be used for plotting, by passing the Axes in ax argument, we can control the figure size as follows. When exploring a multi dimensional dataset, one of the most common use case for data visualisation, is drawing multiple instances of same plot on different subsets of data. The figure-level functions in seaborn are tailor made for this use case. A figure-level function has complete control over the entire figure and each time a figure level function is called, it creates a new figure which can include multiple Axes, all organised in a meaningful way.
Paul Seaborn | UVA McIntire School of Commerce
Consider a following use case, we want to visualise the relationship between total bill and tip via a scatter plot on different subsets of data. Each subset of data is categorised by a unique combination of values for following variables 1. The above code can be broken down into three steps:. Using FacetGrid, we can create Axes for dividing the dataset upto three dimensions using row , col and hue parameters. We also need to pass the name of columns to be used for plotting.
Using FacetGrid, we neither have to explicitly create Axes for each subset nor do we have to explicitly divide the data into subsets. That is done internally by FacetGrid and FacetGrid.
Seaborn: Python's Statistical Data Visualization Library
We can pass different Axes level function to FacetGrid. So the above three functions are different in terms of what Axes-level functions can be passed to each one of them. Explicitly using FacetGrid provides more flexibility than directly using high level interfaces like relplot , catplot or lmplot ; for example, with FacetGrid , we can also pass custom functions to FacetGrid. If you do not need that flexibility, you can directly use the high level interfaces. Each of the above three figure level functions as well as FacetGrid returns an instance of FacetGrid.
Using FacetGrid instance, we can get access to individual Axes which can then be used to tweak the plot like adding axis labels, titles etc. Also, controlling the size of figure level functions is different compared to controlling the size of matplotlib figures.
Instead of setting the overall figure size, we can set the height and aspect of each Facet subplot using the height and aspect parameters. Refer FacetGrid for more examples. PairGrid is used to plot pairwise relationships between variables in a dataset. Each subplot shows a relationship between a pair of variables. Consider a following use case, we want to visualise relationship via scatter plot between every pair of variables.
This can be easily done in matplotlib as follows. The above code can be broken down into two steps. We can pass different Axes-level function to PairGrid.
Pandas & Seaborn Data Science and Visualization Masterclass
It does not make sense to plot a scatter plot on the diagonal Axes. It is possible to plot one kind of plot on diagonal Axes and another kind of plot on non-diagonal Axes. It uses PairGrid and PairGrid. Using PairGrid instance, we can get access to individual Axes which can then be used to tweak plot like adding axis labels, titles etc. Refer PairGrid for more examples.