1

I have a bunch of text strings in Excel 2016. I am trying to search for various substrings. To be more helpful for reviewers, I want to return the word, delimited by spaces, which contains that substring.

For example, if my string is Will be retiring in June and my substring is ret, I would like the formula to return retiring. This will allow reviewers to distinguish retired from retiring or return.

I have written the below formula. There are multiple main strings to search. They are in columns H - M. There are multiple substrings to search for; they are in cells AQ1 - BF1. I would like the found word to populate the AQ2 - BF60000.

AP1 will determine which main column to search. If cell AP1 is H, then AQ2 will return the word that AQ$1 is found in from cell $H2. If AP1 is changed to I, AQ2 would return the word from cell $I2. This was simply to cut down on the number of formulas needed, looking at only one column at a time.

Here's a picture.

H I ... AP AQ AR AS ...
1 Work Status 1 Work Status 2 ... H ret vol unem ...
2 Out on disability FD volunteeer ... #VALUE! #VALUE! #VALUE! ...
3 Still unemployed Returned April 2022 ... #VALUE! #VALUE! unemployed ...
4 Hasn't returned, volunteering Will be retiring 2023 ... returned volunteering #VALUE! ...
5 Retired 2017 ... Retired #VALUE! #VALUE! ...

But if I change AP1 to be I instead of H, the table changes to

H I ... AP AQ AR AS ...
1 Work Status 1 Work Status 2 ... I ret vol unem ...
2 Out on disability FD volunteeer ... #VALUE! volunteeer #VALUE! ...
3 Still unemployed Returned April 2022 ... Returned #VALUE! #VALUE! ...
4 Hasn't returned, volunteering Will be retiring 2023 ... retiring #VALUE! #VALUE! ...
5 Retired 2017 ... #VALUE! #VALUE! #VALUE! ...

The formula works, but it is a little slow, since I have about a million total cells to calculate. Is there any way this could be optimized or rewritten completely? This is a work project, so I have no ability to upgrade past Excel 2016 or install anything.

=MID(INDIRECT($AP$1&ROW()),

This finds the beginning of the word (finds first space beforehand and adds 1).

     IFERROR(FIND("|",
                  SUBSTITUTE(LEFT(INDIRECT($AP$1&ROW()),
                                  SEARCH(AQ$1,INDIRECT($AP$1&ROW()))
                                 ),
                             " ",
                             "|",
                             LEN(LEFT(INDIRECT($AP$1&ROW()),
                                      SEARCH(AQ$1,INDIRECT($AP$1&ROW()))
                                     )
                                )
                             - LEN(SUBSTITUTE(LEFT(INDIRECT($AP$1&ROW()),
                                                   SEARCH(AQ$1,INDIRECT($AP$1&ROW()))
                                                  ),
                                              " ",
                                              ""
                                             )
                                  )
                            )
                 ),
             0
            ) + 1,

This determines the number of characters from the start of the word to the end.

     SEARCH(" ",
            INDIRECT($AP$1&ROW()),
            SEARCH(AQ$1,INDIRECT($AP$1&ROW()))
           )
     - IFERROR(FIND("|",
                    SUBSTITUTE(LEFT(INDIRECT($AP$1&ROW()),
                                    SEARCH(AQ$1,INDIRECT($AP$1&ROW()))
                                   ),
                               " ",
                               "|",
                               LEN(LEFT(INDIRECT($AP$1&ROW()),
                                        SEARCH(AQ$1,INDIRECT($AP$1&ROW()))
                                       )
                                  )
                               - LEN(SUBSTITUTE(LEFT(INDIRECT($AP$1&ROW()),
                                                     SEARCH(AQ$1,INDIRECT($AP$1&ROW()))
                                                    ),
                                                " ",
                                                ""
                                               )
                                    )
                              )
                   ),
               0
              ) - 1
    )
1
  • 3
    I think you need to go the VBA or possibly power query route. I don't believe you would gain much speed using excel 2016 formulas, espcially since you are talking about 1M formulas. Maybe the only possibility of increasing performance using formulas is switching from Indirect to index. Good luck.
    – gns100
    Commented Nov 15, 2023 at 22:03

2 Answers 2

0

I don't know if this will be any faster than VBA or Power Query for the amount of data, but it is a different approach using the FILTERXML function.

I added an IFERROR to return a blank if there is no match. I also had to lowercase the strings since the xpath arguments are case-sensitive.

I also simplified the formula a bit by requiring that AP1 contains the column number and not the column name.

I suggest you change the reference $A:$M to reflect the maximum possible number of rows and columns in your data array. If you do that you would need to change the number in AP1 so that, for example, if you used: $H$1:$M$300000 then you would use a 1 in AP1 to represent Column H.

AQ2: =IFERROR(
    FILTERXML(
        "<t><s>" &
            LOWER(
                SUBSTITUTE(INDEX($A:$M, ROWS($1:2), $AP$1), " ", "</s><s>")
            ) & "</s></t>",
        "//s[contains(.,'" & AQ$1 & "')] "
    ),
    ""
)

Then drag/fill across and down as far as needed.

Data
enter image description here

Results with H (8)
enter image description here

Results with I (9)
enter image description here

1
  • Thanks! I will try this tomorrow. Commented Nov 17, 2023 at 3:56
0

The amount of data is terrifying, perhaps it can be divided into segments.
However, you can simplify formulas and use them as array formulas.
In the proposal shown below in the screenshot, there is one array formula for the entire range. I've defined the name "col" in the name manager to specify the selected data column.
Formulas need to be matched to the original data. Suggested modifications are shown in the screenshot.
An array formula entered in the range F10:H13 was

=TRIM(MID(SUBSTITUTE(col," ",REPT(" ",LEN(col))), 
FLOOR(SEARCH($F$9:$H$9, SUBSTITUTE(col," ",REPT(" ",LEN(col)))), 
LEN(col))+1, LEN(col)))

In Excel 2016 you should select the enire range and press Ctrl+Shift+Enter.
Arkusz2 is the sheet name. keywords

4
  • Thank you! I will try this tomorrow at the office. I did try separating the equation up into parts and saving them in the workbook. It makes the file size HUGE, but it does make the formulas a lot faster. I'll try your suggestion! Commented Nov 17, 2023 at 3:56
  • I tried this and it works well. It takes about the same amount of time as when I split the huge formula up in to multiple parts without the drawback of the massive file size. One odd thing is that I get the same results whether or not I make your solution an array formula. Maybe because you defined the name "col" to be an array, Excel automatically treats all formulas like array formulas when used, whether or not Ctrl+Shift+Enter is used? Commented Nov 17, 2023 at 23:37
  • I actually got rid of all the array parts and just used regular references like $H2. The formula isn't the same for every cell, but it works the same and will be easier for upkeep. That's a smart algorithm, padding with blanks to make picking out words easier. Thank you! Commented Nov 17, 2023 at 23:59
  • If you commit a formula with just Enter, you have to copy it in the entire range. And each cell has its own formula. I suggested selecting the range, entering the formula into the first cell, and committing Ctrl+Shift+Enter as a single array formula in the entire range. This should work faster with multiple cells, but I don't have enough data to check it.
    – MGonet
    Commented Nov 18, 2023 at 8:56

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .