Filtering and subsetting data in R Programming Language

Rumman Ansari   Software Engineer   2023-03-24   5894 Share
☰ Table of Contents

Table of Content:


The data that we read in our previous recipes exists in R as data frames. If you want to know how to read and sub set data please read by clicking here. Data frames are the primary structures of tabular data in R. By a tabular structure, we mean the row-column format. The data we store in the columns of a data frame can be of various types, such as numeric or factor. In this recipe, we will talk about some simple operations on data to extract parts of these data frames, add a new chunk, or filter a part that satisfies certain conditions.

The following items are needed for this recipe:

  • A data frame loaded to be modified or filtered in the R session (in our case, the iris data)
  • Another set of data to be added to item 1 or a set of filters to be extracted from item 1

Perform the following steps to filter and create a subset from a data frame:

  1. Load the iris data as explained in the earlier recipe.
  2. To extract the names of the species and corresponding sepal dimensions (length and width), take a look at the structure of the data as follows:
     
    > str(iris)
    'data.frame':  150 obs. of  5 variables:
     $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 …
     $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 …
     $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 …
     $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 …
     $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
    
  3. To extract the relevant data to the myiris object, use the data.frame function that creates a data frame with the defined columns as follows:
    > myiris=data.frame(Sepal.Length=iris$Sepal.Length, Sepal.Width= iris$Sepal.Width, Species= iris$Species)
    
  4. Alternatively, extract the relevant columns or remove the irrelevant ones (however, this style of subsetting should be avoided):
    > myiris <- iris[,c(1,2,5)]
    
  5. Instead of the two previous methods, you can also use the removal approach to extract the data as follows:
    > myiris <- iris[,-c(3,4)]
    
  6. You can add to the data by adding a new column with cbind or a new row through rbind (the rnorm function generates a random sample from a normal distribution and will be discussed in detail in the next recipe):
    > Stalk.Length <-c (rnorm(30,1,0.1),rnorm(30,1.3,0.1), rnorm(30,1.5,0.1),rnorm(30,1.8,0.1), rnorm(30,2,0.1))
    > myiris <- cbind(iris, Stalk.Length)
    
  7. Alternatively, you can do it in one step as follows:
    > myiris$Stalk.Length = c(rnorm(30,1,0.1),rnorm(30,1.3,0.1), rnorm(30,1.5,0.1),rnorm(30,1.8,0.1), rnorm(30,2,0.1))
    
  8. Check the new data frame using the following commands:
    > dim(myiris)
    [1] 150   6
    > colnames(myiris)# get column names for the data frame myiris
    
    [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"      "Stalk.Length"
    
  9. Use rbind as depicted:
    newdat <- data.frame(Sepal.Length=10.1, Sepal.Width=0.5, Petal.Length=2.5, Petal.Width=0.9, Species="myspecies")
    > myiris <- rbind(iris, newdat)
    > dim(myiris)
    [1] 151   5
    > myiris[151,]
        Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    151         10.1         0.5          2.5         0.9 myspecies
    
  10. Extract a part from the data frame, which meets certain conditions, in one of the following ways:
    • One of the conditions is as follows:
      > mynew.iris <- subset(myiris, Sepal.Length == 10.1)
      
    • An alternative condition is as follows:
      > mynew.iris <- myiris[myiris$Sepal.Length == 10.1, ]
      	> mynew.iris
      	    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
      	151         10.1         0.5          2.5         0.9 myspecies
      	> mynew.iris <- subset(iris, Species == "setosa")
      
  11. Check the following first row of the extracted data:
    > mynew.iris[1,] 
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    1         5.1         3.5          1.4         0.2 setosa
    

    You can use any comparative operator as well as even combine more than one condition with logical operators such as & (AND), | (OR), and ! (NOT), if required.