Practical Data Analysis Using Jupyter Notebook
上QQ阅读APP看书,第一时间看更新

Understanding a Python NumPy array and its importance

Several Python courses on NumPy focus on building programming or statistical examples intended to create a foundation for data science.

While this is important, I want to stay true to anyone who is just getting started working with data so the focus will be the practical usage of Python and NumPy for data analysis.This means not all of the features of NumPy will be covered, so I encourage you to learn more by looking at resources in the Further reading section. The history of the NumPy library has evolved from what was originally named Numerical Python. It was created as an open source project in 2001 by David Ascher, Paul Dubois, Konrad Hinsen, Jim Hugunin, and Travis Oliphant. According to the documentation, the purpose was to extend Python to allow the manipulation of large sets of objects organized in a grid-like fashion.

Python does not support arrays out of the box but does have a similar feature called lists, which has limitations in performance and scalability.

Additional research on the subject of why NumPy was created points to a need for efficiency in memory and storage when processing large volumes of data.Today, NumPy can be found as a dependent library for millions of Python projects in a public search of GitHub, including thousands of examples that handle image manipulation used for facial recognition.

The NumPy library is all about arrays, so let's walk through what an array is and why it is important.Any computer science or programming class I have taken has always included arrays. I was first introduced to an array beforeI evenunderstood the conceptthirty-seven years ago when I was introduced to a computer, the Apple IIe, in Mrs. Sherman's 4th-grade classroom.

One of the educational software available to run on the Apple IIe was called Logo,which was a programming language that allowed you to write simple commands to control the movement of a cursor on the computer monitor. To make the process more appealing to a younger audience, the commands allowed you to create geometric shapes and print values represented by a turtle. In Logo, arrays used the list command, which groups together one or more words or numbers as a single object that can be referenced during the same session.You can still find emulators available that allow you to run the Logo programming language, which wasa fun walk down memory lane for me that I hope you will enjoy as well.

A more formal definition of an array is it is a container used to store a list of values or collections of values called elements. The elements must be defined with a data type that applies to all of the values in an array, and that data type cannot be changed during the creation of the array.This sounds like a rigid rule but does create consistency between all of the data values.There is some flexibility using the data types of arrays found in the NumPy library, which are known as dtype(data types).The most common dtype are Boolean for true/false values, char for words/string values, float for decimal numbers, and int for integers. A full list of supported data types could be found in the documentation found in the Further reading section.

Some examples of an array can be a list of numbers from 1 to 10 or a list of characters such as stock tickers, APPL, IBM, and AMZN. Even board games such as Battleship and chess are examples of arrays where pieces are placed on the board and identified with the interaction of letters and numbers. Arrays in NumPy support complex data types including sentences but remember, you must keep the data type defined and consistent to avoid errors.Arrays come in all shapes and sizes, so let's walk through a few examples.

Differences between single and multiple dimensional arrays

If the array only has one dimension, it would represent that list of values in a single row or column (but not both).The following example shows a one-dimensional array assigned to variable named 1d_array:

1d_array = ([1, 2, 3, 4, 5])

A two-dimensional array, also known as a matrix, would be any combination of multiple rows and columns.The following equation is an example of a two-dimensional array:

2d_array =
([1, 'a'],
[2, 'b'],
[3, 'c'],
[4, 'e'],
[5, 'f'])

You may have already realized from the examples that a structured data table that is made up of rows and columns is a two-dimensional array! Now you can see why understanding the array concept builds the foundation for data analysis against structured data.

Once an array is defined, it is available for use in calculations or manipulation by referencing it during the same session such as when changing the sequence of the values or even replacing values as needed. Arrays have a multitude of uses in programming so I want to stay focused on specific use cases related to data analysis.

Understanding arrays goes beyond just a simple table of rows and columns.The examples discussed have either one or two dimensions. If the array has more than one dimension, you can reference the values along the axis (X, Y, or Z).

With the numpy library package, the core feature is the ndarray object, which allows for any number of dimensions, which is called n-dimensional.This refers to the shape and size of the array across multiple axes. Hence, a 3D cube with an X, Y, and Z axis can also be created using NumPy arrays. A scatter plot visualization would be a useful way to analyze the type of data, which we will cover in Chapter 9, Plotting, Visualization, and Storytelling, with some examples.

Some other key features of NumPy include the following:

  • The ability to perform mathematical calculations against big datasets
  • Using operators to compare values such as greater than and less than
  • Combining values in two or more arrays together
  • Referencing individual elements in the sequence from how they are stored

Visual representations of the different types of arrays using blocks to represent the elements are shown in the following diagrams. The following diagram is a one-dimensional array with three elements, which is also known as a vector when dealing with ordinal numbers or a tuple when working with ordered pairs. The first element of any array is referenced with an index value of 0:

The following diagram is a two-dimensional array, which is also known as a matrix. This matrix builds from the first one-dimensional array but now has two rows with three columns for a total of six elements:

An n-dimensional array could be represented as a cube similar to the one in the following diagram. This n-dimensional array continues to build from the first two examples and now includes a third dimension or axis:

I find the best way to learn is to get hands-on and comfortable using the commands to work with the data, so let's launch a Jupyter notebook and walk through some simple examples.