Python to STATA Cheat Sheet

Seungjun (Josh) Kim
7 min readMar 17, 2021

--

From Pexels Images. A photo classified under the “royalty free” section. Also has “free to use” label on it.

Python has become the go-to language in the field of Data Science. However, other programming languages or statistical software are still being actively used depending on the particular industry or field you are immersed in. One best example is the field of Economics Research. Many economists and post-doctoral employees use this program called STATA for data wrangling, running regressions and plotting graphs. Most econometrics classes ask students to use STATA instead of Python for class projects and assignments. How is STATA different from Python or any other programming languages commonly used for Data Science?

I would say the characteristic of STATA that differentiates it the most from Python is that it is not a programming language. It’s more of a “software” that enables economists who don’t know how to code in any programming language to more intuitively deal with datasets to perform economics research. For example, as opposed to Python, a object oriented programming language, STATA does not allow for this “assignment” of data into “objects” or “variables”. The closest alternative STATA has is the “temporary files” which can be run using the tempfile command. As such, due to this fundamental difference and also some major syntax disparities, I myself had a pretty steep learning curve for getting used to STATA.

Furthermore, I found it very difficult to do “googling” to find additional information on how to perform specific tasks for STATA. Unlike Python which has myriads of e-books, blog posts, communities (e.g. stackoverflow) and tutorials online for various things you would like to do with your data, I had a hard time with STATA in that regard.

The purpose of this post is to provide a cheat sheet of how the commonly used Python commands and operations you might have used for Data Science translate into STATA so that you don’t have to spend hours searching for information on statalist (similar to stackoverflow but for STATA; does not have the most useful info or user-friendly interface) for something you could have so easily done in Python.

Reading and Saving Files

Reading .dta(stata) file

Stata) use file.dtaPython) pd.read_stata("file.dta")

Using .dta file with only specific columns

Stata) use var1 var2 using file.dta
Python) pd.read_stata("file.dta", columns=['var1','var2'])

Read in .xlsx or .csv files

Stata) import excel using excel_file.xlsx
Python) pd.read_excel('excel_file.xlsx')
Stata) import delimited using csvfile.csv
Python) pd.read_csv("csvfile.csv")

Save as .dta file to certain directory

Stata)save file.dta, replacePython) df.to_stata("file.dta") ORdf.to_pickle("file.dta") (for Python-native file type)

The parameter “replace” above in the STATA command means you are replacing the pre-existing file.dta file with the current version that you are saving.

Save as .csv or excel file to certain directory

Stata) outsheet using csvfile, comma
Python) df.to_csv("csvfile.csv")
Stata)export excel using excelfilePython)df.to_excel("excelfile.xlsx")

Description and Summary Statistics

Breakdown of values and their counts

Stata) tab var 
==> count, percentage and cumulative percentage
Python) df[‘var’].value_counts( )
==> just count breakdown

STATA allows abbreviated commands. For example, the above command “tab” will work with just “ta”.

Describe

Stata)describePython) df.info( ) OR df['var'].dtype

Total Number of Observations

Stata) countPython) df.shape[0] OR len(df). 
==> Here df.shape returns a tuple with the length and width of the DataFrame.

If you combine the “count” command in STATA with the “if statement”, then you can count the number of observations that satisfy certain conditions.

e.g. count if name_last == “kim” (number of observations whose last name is kim)

Describe specific variable

Stata) summ varsumm var if <condition>Python) df['var'].describe( )df[<condition>][‘var’].describe( ) OR   
df.loc[condition, ‘var’].describe()

Basic Data Wrangling and Operations (e.g. creation or dropping of new variable)

Sort observations using var or multiple vars

Stata) sort var1 var2Python) df.sort_values(['var1', 'var2'])

Rename variable names

Stata) rename var1 var2 
==> rename var1 to var2
rename (var1 var3) (var2 var4)
==> rename var1 to var3 and rename var3 to var4
Python) df.rename(columns = {‘var1’:‘var2’}, inplace=True)

Keep or drop observations based on conditions

Stata) keep if <condition>drop if <condition>
Python)df = df[<condition>]df = df[~(<condition>)]

Keep or drop columns/variables

Stata) keep var1 var2…drop var1 var2…Python) df = df[[‘var1’, ‘var2’]]del df[‘var1’] OR df = df.drop(labels=   [‘var1’,’var2’], axis=1)

Create new variables and replace

Stata) gen newvar = <expression>gen newvar = <expression> if   <condition>replace newvar = <expression> if   <condition>Python)df[‘newvar’] = <expression>

STATA also has this “egen” command which does the same thing as “gen” but is used for statistical or aggregate functions as you can see below:

Stata)  egen newvar = max(var)egen newvar = mean(var)Python) newvar = df[‘var’].max()newvar = df[‘var’].mean()

Other Operations

Checking if certain values are included in a variable

Stata)  inlist(var, “val1”, “val2”)Python) df[‘var’].isin([“val1”, “val2”])

Checking if values in a column/variable are within a specific range

Stata) inrange(<var>, ‘val1’, ‘val2’)Python) df[‘var’].between(‘val1’, ‘val2’) 
**inclusive = False / True

Preserve & Restore Operations

Stata) 
preserve
keep if missing(dob)
drop dob
save “mystata/dob_missing.dta”, replace
restore
Python) Assign modified/wrangled dataframe in another variable

Preserve, Restore operation is something unique to STATA. This operation is useful because STATA does not allow for assignment of current data to an object or variable in memory (as I have mentioned in the beginning of the post of how STATA is not an object oriented programming language but rather a software like SPSS). In the above example, say you want to save a list of observations whose date of birth is missing. You keep only the observations with missing date of births and save it separately. Now you want to go back to the original FULL DATA to do some other operations. Unfortunately, the original full data is non-present and only the modified data is in the current memory. preserve-restore operation comes in handy in such cases.

try & except equivalent in STATA

Stata)
capture {
<some stata commands>
}
if _rc==0 {
display "your calculation is wrong"
}
Python)
try:
2 + 2 == 5
except:
print("your calculation is wrong")

String Operations

Split operations

Stata) split n, p(“,“) gen(n_)
==> split the values in variable “n” on comma and make them into separate new columns named n_1, n_2…
Python)df[‘n’].str.split( pat= “,”)

Removing Blanks

Stata)Say s is a stringstritrim(s): remove blanks
stritrim(s) returns s with all consecutive, internal blanks collapsed to one blank
strltrim(s) returns s with leading blanks removed
strrtrim(s) returns s with trailing blanks removed
strtrim(s) returns s with leading and trailing blanks removed.
e.g. replace name_last = strtrim(stritrim(lower(name_last)))Python)df['last_name'] = df['last_name'].str.strip( )OR df['last_name'] = df['last_name'].str.replace(" ","")

Regular Expression

Stata)
gen last_name = "kim" if regexm(n, “kim([0-9]+)$”)
==> apply regex “kim([0-9]+)$” on variable/column "n" and if it satisfies that regex, create a new variable named last_name and fill in with "kim"
Python)
https://docs.python.org/3/library/re.html

Parsing subset of strings

Stata)gen dob_yr = substr(DOB, -4, . ) # parse 4th   from last to last stringsubstr(raw_name, 1, 3) # parse 3 characters   starting from 1st indexed character from values in raw_name columnPython)df[‘dob_yr’] = df.DOB.apply(lambda x: x[-4: ])

Substituting strings with some other value/string

Stata) subinstr(var, “x”, “”, . ) #replace “x” with   empty string(“”) for all values in column varPython) df[‘var’] = df.var.replace({“x”: “”})

Merge and Join Operations

append

Stata) use df1.dta
append using df2.dta
Python) df_joint = df1.append(df2)

merge

In STATA, there are four types of merges — 1:1, 1:m, m:1, and m:m.

Stata) use df1.dta
merge 1:1 <vars> using "df2.dta"
Python) df_joint = pd.merge(df1, df2, on=<vars>)

One thing to keep in mind is that when you are doing 1:1 or 1:m in STATA, you need to make sure that the master data (whatever data you are using currently; FYI “using” data is the data that is being merged or appended onto this master data) is uniquely defined in every observation across these variables specified in <vars>. Otherwise, the system will complain to you with an error that says “observations in master not uniquely defined by var1 var2….” in red font.

STATA merging also has this parameter called “keep” which allows you to control which observation are kept and dropped during the merge process. It is basically the same as the “how” parameter in Python merging.

Python | Statahow='left' | keep(1, 3)
==> Keeps all observations in the "left" DF
how='right' | keep(2, 3)
==> Keeps all observations in the "right" DF
how='inner' | keep(3)
Keeps observations that are in both DF
how='outer' | keep(1 2 3)
==> Keeps all observations.

For more detailed information on merging in STATA, refer to this documentation: https://www.stata.com/manuals/dmerge.pd

Econometrics

T-test

Stata) ttest <var>, by(<var>)Python) 
from scipy.stats import ttest_ind
ttest_ind(<array1>, <array2>)

Simple Linear Regression under certain conditions

Stata) reg <yvar> <xvar> if <condition>, rPython) 
import econtools.metrics as mt
results = mt.reg(df[<condition>], <yvar>, <xvar>, robust=True)

A lot of people ask me which programming language or software is the best for data science. There is no straightforward answer. In fact, time allowing, being versatile in several programming languages and programs (though in different levels of proficiency) is highly beneficial because each language / program is best suited for different data science tasks. For instance, if you want to build simple but pretty visualizations, R’s ggplot might be the way to go instead of Python’s matplotlib which requires more coding and syntax to create the same plots. Similarly, having STATA in your toolbox allows you to maybe perform certain tasks more efficiently and faster than when you use Python to finish the same tasks. In that light, I hope this article was of help to those who would like to have both Python and STATA in their tool kits, ready to use any of them that is more appropriate to use for a certain task.

References

[1] G. Rodríguez, Stata Tutorial, Germán Rodríguez Personal Website

[2] D. M. Sullivan, Stata to Python Equivalents, D. M. Sullivan Personal Website

[3] Pandas Comparison with Stata, Pandas Documentation Page

I can be reached at the following :)

Website [ https://seungjun-data-science.github.io/ ]

LinkedIn [ https://www.linkedin.com/in/kimseungjun/ ]

Github [ https://github.com/Seungjun-Data-Science ]

Kaggle [ https://www.kaggle.com/juminator ]

Youtube [ https://www.youtube.com/user/juny1226 ]

Instagram [ https://www.instagram.com/juny.

--

--

Seungjun (Josh) Kim
Seungjun (Josh) Kim

Written by Seungjun (Josh) Kim

Data Scientist; PhD Student in Informatics; Artist (Singing, Percussion); Consider Supporting Me : ) https://joshnjuny.medium.com/membership

No responses yet